Date:December 31, 2013

Google Refine: NYC Public Bathroom Map

Many data analytics projects start with a big mess, including nonstandard inputs, miscalculations, typos and other human errors. In addition, there are times when the way data is collected is not ideal for the way the analyst wishes to interpret it. An example might be when dollar amounts are entered in thousands when billions might be more appropriate figures to work with. Google Refine offers a relatively simple way to clean up data and perform transformations on up to thousands of records all at once. We reviewed Google Refine using NYC Open Data to see this in process.

Materials

The materials required for this lab included a computer, Google Refine desktop software, and a dataset to work with. Google Refine was already available on the desktop computer of the Pratt Institute classroom where we performed the lab Originally we were going to use a library file from iTunes, but we were unable to load the file to Google Refine, so we copied data from NYC Open Data portal instead.

Dataset:

Map of Bathrooms
https://data.cityofnewyork.us/Housing-Development/Map-of-Bathrooms/swqh-s9ee

New York City Open Data Portal
https://nycopendata.socrata.com/

It was also very helpful to have a copy of the Google Refine documentation:
https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users

Method

  1. Get the Data.The first step is to find a dataset to work with. This can be something you have on your computer or something that you find on the Internet. Make sure that it is in a format that Google Refine will accept.These formats currently include the following for Google Refine, version 2.0:
    • TSV, CSV, or values separated by a custom separator you specify
    • Excel (.xls, xlsx)
    • XML, RDF as XML
    • JSON
    • Google Spreadsheets
    • RDF N3 triples

     

  2. Import the Data.Google Refine offers a few ways to import data into the system, including uploading a file or copying it into a text field. The file upload function accepts a number of formats including Excel, CSV, RDF, XML, and Google Spreadsheets, among others. You can reference a dataset via its URL, copy and paste the dataset from the Clipboard or select a dataset from Google Data. When you have selected the file or copied the data, click Next.

    refine-import

  3. Review the Data.Did it import correctly? Did the data separate into columns as it should? Are the column headings showing up as headings or as a record row? If you use a spreadsheet, a delimited file like CSV or a structured file such as RDF or XML, it should have imported correctly. If not, Google Refine offers a set of tools that can help you define separators, such as commas or tabs or a custom field, or combine columns if they were separated incorrectly. It also allows you to set the rows that you want to use as the column headings, if these are not defined in the original file.

    refine-review

  4. Create Project.When you are satisfied that the data imported properly, click on the “Create Project” button at the top right of the screen.

    refine-save-project

  5. Clean the Data.When the data is separated into columns, look for additional problems. Do there appear to be misspellings or unconventional entries? Are there numerical figures that appear to be in a nonstandard format? Make a note of these problems so you can address each of them in turn.Google Refine offers a number of tools for editing and transforming data. You can edit cells individually within the spreadsheet. To edit a single cell, hover your cursor over the cell you would like to change and click “Edit.” This will open a small window with a box holding data that you can edit. It also allows you to set a data type and either apply the data just to that cell or to all identical cells. Click the “Apply” button to confirm the changes.You can also make changes by selecting Text Facets, which allows you to view cells within a column that share a common value and perform a search/replace function on all cells sharing that value at one time. To do this, click the dropdown arrow at the column header and select Facet | Text facet (or any of the other facets: numeric, timeline, scatterplot, or custom). Review the results for items that could be changed or combined. For example, if there is an entry for “Astoria Boulevard” and another for “Astoria Blvd,” you may wish to combine them as a single value, “Astoria Boulevard.” The “Cluster” button will reveal any datapoints that could be combined in this way.

    refine-facet

     

  6. You can also Transform data in a column by using a user-defined expression, which operates similarly to formulas in a spreadsheet. Google Refine Expression Language, or GREL, was designed to resemble Javascript. The documentation website is particularly helpful with performing transformations as it reviews the various expressions you can use to make changes to the data.refine-transform

    In my NYC Bathrooms file, I noticed that in the BUILDINGTY column that most of the data is entered in all caps, but the “Recreation Center” is in title case. I can change this by first selecting a text facet on this column, and then performing a transformation on the results. First click the dropdown arrow on the column heading and select “Text Facet.”refine-edit-cells

    This will open a window in the left column. Look for the item marked “Recreation Center.”

    refine-filter

    From the results, click the dropdown arrow on the column heading and select “Edit Cells” then “Transform.”

    refine-text-facet

    To change the value of the cells from “Recreation Center” to “RECREATION CENTER” simply type “RECREATION CENTER” (with the quotes) into the expression box. A preview below the expression box will show you if your changes are reflected properly.

    refine-custom-transform

Results/Discussion

I opened Google Refine and attempted to import the iTunes XML file to the system without success. I tried both the file import function and the copy text field but got errors. I turned to the NYC Open Data portal and found a simple CSV file of bathrooms in public parks, which I imported using the “This Computer” link and it worked.

Next, I reviewed the data for items that required cleaning. I found a number of bathrooms at Junior High School playgrounds whose records were labeled “Jhs” in the Description field. I decided I would try transforming the instances of “Jhs” to capitalized form, “JHS.” To do this, I found a transform expression that allowed me to replace the first four characters “Jhs“ to “JHS” successfully using the following expression:

“JHS ” + value.substring(3)

refine-custom-text-transform

 

There were a few glitches that required me to edit some stray cells individually. For example, one record contained the string “Jhs ” which was not in the first four characters. I did not have enough class time to figure out how to exclude this record. I edited the cell by moving the string to the front of the phrase and then performing the transformation. I had to make a note to myself to correct it to the original placement after performing the transformation. There may be an expression that allows you to transform a string without identifying its location.

The following image shows the “DESCRIPTIO” column with the transformed cells and all mentions of “JHS” in capital letters:

refine-facet

Future Directions

Most of the problems that I encountered while using Google Refine may be related to my beginner status with the program. Google Refine’s documentation offers extensive information about the ways you can sort, filter and transform data, and numerous examples of expressions that you can use to edit and transform multiple records at once. With patience, a user may find learning Google Refine’s tools saves time in the long run, particularly when working with large datasets.