Data cleaning, publication and re-use: a list of tools, materials and documentations

Cleaning, enrichment and publication of data is the everyday life of the Canadensys team, as well as the the curators, coordinators, technicians, volunteers, students, … that are working in collections to digitize and publish specimens, or collecting data on the field, or even trying to digitize datasets/papers published and historically important.

The task can be sometimes easy, but more often, daunting, especially when working with legacy datasets.

Fortunately, a vast spectrum of tools and documentation has been developed to help during the different steps leading to publication of high-quality and standardized data. Here is a list of what we usually use. This list is absolutely not exhaustive, and will probably needs several updates in the next few years, as biodiversity informatics is evolving quickly.

I am intentionally not talking about data management software here, because of the variety of needs and solutions. It would take an entire post by itself.

This post was meant to be the perfect host for the material developed for a the series of workshop we organized last year in Wolfville, Vancouver and Montréal, but while being here, why not also list useful tools and other documents helpful for you?

But first, you can find here all the slides, exercices and other documents developed for the workshops. In this workshops, we talked about data publication, data cleaning, we played with some of the tools I’m listing in this post, we discovered the Canadensys portal, based on ALA, as well as the GBIF portal, and we discussed data use. Feel free to re-use that material as much as you want!
This series of workshop has been cofounded by us and the GBIF CESP program, and the material has been based on previous workshop developed by GBIF Spain and the GBIF Secretariat.

So let’s talk about data cleaning, which of course, needs to be done before data publication (even if you will discover more data cleaning to do after publication).
Data cleaning is, in my opinion, closely linked to data enhancement (or enrichment), and the tool I will discuss will allow both.

  • Open Refine: this is THE tool I use every time I need to check a dataset, and it make my life so easier. Not a database, not a spreadsheet, but something in between helping you to visualize and apply batch-edition to your data. Numerous tutorials and documentation are available online. Try it, and you will for sure say: “Why did I never heard of, or used, this tool before?”
  • GBIF Data Validator: test your dataset before publishing it on the IPT. This tool replicate the indexing process your dataset will go through once harvested by GBIF. As a result, you will know if it is in a suitable format and which issues have been discovered. Go back to Open Refine or tour source database and clean those issues!
  • Species Matching: great tool to check taxonomy, spelling errors and get taxonomic status. You can then import back all this information in your dataset in Open Refine
  • APIs: APIs are at the base of a lot of data exchange in the Internet world, without you even knowing it. But this type of communication process can also be used to harvest different type of data from remote databases, like geographic information or taxonomy. Example: Vascan API.
  • Georeferencing hub (VertNet): georeferencing is important, but can be complicated. This hub is a goldmine for documentation and tool!

Data publication is made easy by the IPT, the tool developed by GBIF to facilitate data publication and data harvesting.
Our ‘7-step guide to data publication’ is a good starting point, and you can either ask us to create an account for you on our IPT, or play with the demo provided by GBIF.
Depending on your data management system, you can decide to publish manually your dataset, or link your SQL based system directly to our IPT, enabling automatic update of the data, on a timeline of your choice.

Documentation/Ressource about the IPT:

You’re not a data holder, but you would love to use all those data available on Canadensys, GBIF and all other platforms around the world? Then the question of how to efficiently use them is a big one fo you. Fortunately, a lot of ressources has been developed by the community of users and aggregators.