Data Cleaning in OpenRefine

Meme of Yoda from Star Wars saying "Dirty Data You Have; Clean It Up You Must."

While I did not get much work done this past weekend due to family commitments and lack of motivation because Jenna and I were in different states, I did have an opportunity to use some work time today on data cleaning. At NYU Libraries, where I work by day, there is a Community of Practice group that meets once a month to learn new skills in using OpenRefine, which is a “Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data coming from the web. All from a web browser and the comfort and privacy of your own computer.” I have long suspected that this software would be very useful for ZineCat, so I am very excited that I got the opportunity to work in OpenRefine and also have an opportunity for monthly meetings and check ins as I learn this tool and use it to work on ZineCat. 

So what did I learn? I learned to “cluster” similar data from different cells in the software and then apply a change to the clustered fields.  I uploaded the CSV file for the Sallie Bingham zine collection since I had some issues with the publication dates for this set of records and wanted to see if OpenRefine could help.  Guess what!? It did!! For example, some of the dates in the Sallie Bingham collection were recorded as Winter 1994, Winter 94, Winter 1994-95, etc. OpenRefine clustered these together through a command in the application and allowed me to direct the data to all be listed as Winter 1994.  After running this cluster function and doing some data cleaning, I was then able to format the “issuedate” column of the data set to be read as a date. Unfortunately, the session was only an hour and we were using a Data Carpentry lesson plan and the facilitator moved onto other parts of the lesson module before I was able to fully standardize all the dates in the set.  At this point dates are marked as Winter 1994, 1998, October 2001 still and are not all consistent. It’ll require a little more work to align all the dates properly.  

Screen capture of OpenRefine with Sallie Bingham data with "issuedate" and cells with dates circled.

As we moved onto the next module, we were shown how to split cell data.  For example, the “zinetopics” column from the Sallie Bingham records had multiple topics in each cell, but for some applications to run properly, the multiple topics would need to be in separate cells.  I was able to separate each topic through a simple command and get them to each be in different cells. However, in Collective Access, the ingest allows for multiple items in one cell to be ingested, so this OpenRefine function isn’t entirely necessary, but I could foresee there being a time where this function would be useful.  

Screen capture of OpenRefine with Sallie Bingham data with "zine_topics" and multiple cells with topics circled.

OpenRefine can do many more things than I’ve outlined above, but this was what I discovered with one hour of engagement. I anticipate that the application will continue to be of use as the project evolves.

In other news regarding the intersection of NYU and ZineCat, I’ve had several conversations with my colleagues at NYU about grant opportunities to support ZineCat’s development.  I’ve been given some documents that will provide a framework for creating budgets and drafting work plans, so that will help me finalize some of the deliverables for my Independent Study report that I’ve been writing for close to a year.  I don’t have many details to share right now outside of that it seems likely that NYU will be able to support the project as we continue to work on this project’s development outside of the CUNY Graduate Center.  I will hopefully have more to officially share on that front in the months to come.  

That’s all for now folks.  Thanks for reading!

Leave a Reply