Data cleaning

Introduction

From the introduction of Chapman, 2005. Principles and methods of data cleaning:

Data Cleaning is an essential part of the Information Management Chain as mentioned in the associated document, Principles of Data Quality (Chapman 2005a). As stressed there, error prevention is far superior to error detection and cleaning, as it is cheaper and more efficient to prevent errors than to try and find them and correct them later. No matter how efficient the process of data entry, errors will still occur and therefore data validation and correction cannot be ignored. Error detection, validation and cleaning do have key roles to play, especially with legacy data (e.g. museum and herbarium data collected over the last 300 years), and thus both error prevention and data cleaning should be incorporated in an organisation’s data management policy.

One important product of data cleaning is the identification of the basic causes of the errors detected and using that information to improve the data entry process to prevent those errors from re-occurring.

Tools

  • OpenRefine, a tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services (such as georeferencing requests and taxonomic verification).
  • R packages for biodiversity data cleaning: scrubr, biogeo, taxize, and rgeospatialquality, for example.
  • Canadensys tools, date and geographic coordinate transformation and conversion.

Documents