Final Project

Nicolas Oshiro

Data and Processing

The data for this project can be found here. The data comes from Kaggle.com, a community of data scientists that share datasets, and it is published by the US Department of Transportation. The dataset is includes a record for each oil pipeline leak or spill reported to the Pipeline and Hazardous Materials Safety Administration since 2010. The original dataset contains 2795 rows and 48 columns. I performed some data processing in R that can be found here. Basically, I remove any missing data, rename some columns to be friendlier for d3, fix some of the data encoding for rows that weren't being added properly, and filtered the data to only include the 500 most expensive oil spills since 2010. This number is a bit arbitrary, but I found that the interactivity runs more smoothly and the data density remains unharmed when filtering to anything around 500. The last step in the data processing is I had to reformat the dates in Excel because they needed to be zero padded to cooperate with the timeParse function. At the end of processing the data had 500 rows and 24 columns.