Data is messy—especially in the social sciences. Messy data comes in all shapes and sizes, but one issue I found myself facing recently was how to deal with free response write-in items. For items relating to categorical assignments (e.g., demographic questions), I find that Excel functions make organizing a small dataset fairly simple. However, I found this method sorely lacking as I recently began working on a project with previously collected data from well over 500,000 participants. My simple Excel functions were no match for a dataset this large and with such variability in the written responses!

Luckily, I was recommended the application OpenRefine (formerly Google Refine) to help me face this Goliath dataset. OpenRefine is an open-source desktop application available for free download online for Windows, Mac, and Linux.

OpenRefine can be used for:

  • Cleaning messy data using transformations, facets, and clustering
  • Transforming data by converting formats and normalizing or denormalizing
  • Parsing data from websites

Check out http://openrefine.org to learn more about the application and how to download and use the program for your research needs.