Many data analytics projects start with a big mess, including nonstandard inputs, miscalculations, typos and other human errors. In addition, there are times when the way data is collected is not ideal for the way the analyst wishes to interpret it. An example might be when dollar amounts are entered in thousands when billions might be more appropriate figures to work with. Google Refine offers a relatively simple way to clean up data and perform transformations on up to thousands of records all at once. We reviewed Google Refine using NYC Open Data to see this in process.
The materials required for this lab included a computer, Google Refine desktop software, and a dataset to work with. Google Refine was already available on the desktop computer of the Pratt Institute classroom where we performed the lab Originally we were going to use a library file from iTunes, but we were unable to load the file to Google Refine, so we copied data from NYC Open Data portal instead.
New York City Open Data Portal
It was also very helpful to have a copy of the Google Refine documentation:
In my NYC Bathrooms file, I noticed that in the BUILDINGTY column that most of the data is entered in all caps, but the “Recreation Center” is in title case. I can change this by first selecting a text facet on this column, and then performing a transformation on the results. First click the dropdown arrow on the column heading and select “Text Facet.”
This will open a window in the left column. Look for the item marked “Recreation Center.”
From the results, click the dropdown arrow on the column heading and select “Edit Cells” then “Transform.”
To change the value of the cells from “Recreation Center” to “RECREATION CENTER” simply type “RECREATION CENTER” (with the quotes) into the expression box. A preview below the expression box will show you if your changes are reflected properly.
I opened Google Refine and attempted to import the iTunes XML file to the system without success. I tried both the file import function and the copy text field but got errors. I turned to the NYC Open Data portal and found a simple CSV file of bathrooms in public parks, which I imported using the “This Computer” link and it worked.
Next, I reviewed the data for items that required cleaning. I found a number of bathrooms at Junior High School playgrounds whose records were labeled “Jhs” in the Description field. I decided I would try transforming the instances of “Jhs” to capitalized form, “JHS.” To do this, I found a transform expression that allowed me to replace the first four characters “Jhs“ to “JHS” successfully using the following expression:
“JHS ” + value.substring(3)
There were a few glitches that required me to edit some stray cells individually. For example, one record contained the string “Jhs ” which was not in the first four characters. I did not have enough class time to figure out how to exclude this record. I edited the cell by moving the string to the front of the phrase and then performing the transformation. I had to make a note to myself to correct it to the original placement after performing the transformation. There may be an expression that allows you to transform a string without identifying its location.
The following image shows the “DESCRIPTIO” column with the transformed cells and all mentions of “JHS” in capital letters:
Most of the problems that I encountered while using Google Refine may be related to my beginner status with the program. Google Refine’s documentation offers extensive information about the ways you can sort, filter and transform data, and numerous examples of expressions that you can use to edit and transform multiple records at once. With patience, a user may find learning Google Refine’s tools saves time in the long run, particularly when working with large datasets.