Canadensys explorer surpasses 1M records

We are pleased to announce that the Canadensys Explorer has surpassed one million occurrence records contributed by its partners. This milestone represents over a third of what we aim to help mobilize by end of 2013 and showcases the immense human effort at each partnering institution as well as the capacity of the Candensys technical team. We hope you agree with us that the Explorer handles 880,000 georeferenced records remarkably well.

The latest two collections to appear in the Canadensys database are the Lyman Entomological Museum (LEMQ), Macdonald Campus, McGill University and the E. H. Strickland Entomological Museum (UASM), University of Alberta.

The Lyman Museum is the second largest insect collection in Canada and the largest university insect collection in the country. It houses approximately 2.8M arthropod specimens. Its strengths are in pinned beetles and butterflies, grasshoppers and related orthopteroids, and more recently flies. The Lyman Museum staff published over 250,000 records on the Canadensys Integrated Publishing Toolkit, doi:10.5886/q79vhp1e.

The Strickland Museum houses over 1M specimens of primarily Nearctic species of which over 400,000 specimens are in the beetle family Carabidae. Many of its substantial moth and butterfly specimens were collected in Alberta. The Strickland Museum staff shared over 290,000 records through its own Integrated Publishing Toolkit.

Data Aggregation Process

Although both collections share similar numbers of records, the process by which these are aggregated on the Explorer are different. The Lyman Museum records are harvested from the local Canadensys repository whereas the Strickland Museum records are harvested from the remote University of Alberta repository. The latter type of source requires that we listen for republication events and then execute a semi-automated script to refresh our cache of data. Our goal is to remain synchronized with remote sources and we are actively researching techniques to accelerate our current procedure. So, individual collections across the country may elect to publish their records on the Canadensys repository or if they have in-house technical capacity, may elect to install their own Integrated Publishing Toolkit. Participation with Canadensys is near identical for either option.

Data Cleaning Process

A recent thread on the Taxacom listserv identifies data quality as an issue that all stakeholders in biodiversity data need to address. Each of us have different skills that can help identify shortcomings in the quality of biodiversity data or to actively help improve it. Our Explorer is often the first view curators have of their data and the reaction is typically one of surprise.

“How did that ground beetle get in the Pacific Ocean!?!”

If it weren’t for freely available, aggregated views of occurrence data like the Explorer, many data quality issues would remain unknown or unknowable.

It is relatively quick work for us to generate flagged lists of records and send back to curatorial staff for verification. In the case of the Lyman and Strickland Museums, this resulted in immediate data cleaning routines. We were told that some of these actions were executed in bulk but many others required re-examination of individual specimen labels. Clearly, only the curatorial staff of the host collection has the knowledge to take appropriate action.

One technique we use to flag records with potential georeferencing issues is to verify that geographic coordinates are in fact within the indicated Canadian Province. In other words, we cross-reference a reverse geocode. We do so by executing a query against a PostGIS table of data ingested from the Global Administrative Areas:

select o.catalognumber,, o.stateprovince, o.decimallatitude, o.decimallongitude FROM occurrence o LEFT JOIN canada c ON o.stateprovince = c.name_1 WHERE o.collectioncode = 'UASM' AND = 'Canada' AND (NOT ST_Contains(c.the_geom,o.the_geom) OR c.gid IS NULL) \g /tmp/uasm_errors.txt

Similar data verification routines are being rolled into our flexible and scalable narwhal processor and associated application programming interfaces. We are encouraged to see other excellent work in this area of activity by the Atlas of Living Australia (ALA), Centro de Referência em Informação Ambiental (CRIA), and FilteredPush. FilteredPush is a project that is expected to have a significant impact on data quality workflow. In the meantime, we see room for simple tools that produce human-readable reports that illustrate the quality of data before and after cleansing sessions. Without such reports, it is difficult to measure the impact local digitization projects have on the quality of data shared.

There are a myriad of ways to identify suspect records, many of which can be quite complex involving combinations of various fields and records within a dataset or across datasets. Some routines require access to third party data (eg for reverse geocoding). Data cleaning exercises never actually end. Instead, it is an ongoing process that must be appropriately prioritized.

  • What measures of data quality are most important to you?
  • If you are a member of a curatorial team, how do you prioritize data cleaning?
  • Do you think structured reports would be useful?
    • How should they be delivered?
    • Should they be made publicly available?