Why we should publish our data under Creative Commons Zero (CC0)

With the first datasets getting published and more coming soon, the issue comes up under what license we – the Canadensys community and the individual collections – will publish our data. Dealing with the legal stuff can be tedious, which is why we have looked into this issue with the Canadensys Steering Committee & Science and Technology Advisory Board before opening the discussion to the whole community.

By data we mean specimen, observation or checklist datasets published as a Darwin Core Archive and any derivatives. To keep the discussion focused, this does not include pictures or software code.

2012.01.30 – Update to post: technically CC0 is not a license, but a waiver (see comment below).

What we hope to achieve

  1. One license for the whole Canadensys community, which is easier for aggregation and sends a strong message as one community.
  2. An existing license, because we don’t want to write our own legal documents.
  3. An open license, allowing our data to be really used.
  4. A clear license, so users can focus on doing great research with the data, instead of figuring out the fine print.
  5. Giving credit where credit is due.

Our recommendation

cc-zero We recommend Canadensys participants to publish their data under Creative Commons Zero (CC0). With CC0 you waive any copyright you might have over the data(set) and dedicate it to the public domain. Users can copy, use, modify and distribute the data without asking your permission. You cannot be held liable for any (mis)use of the data either.

CC0 is recommended for data and databases and is used by hundreds of organizations. It is especially recommended for scientific data and thus encouraged by Pensoft (see their guidelines for biodiversity data papers) and Nature (see this opinion piece). Although CC0 doesn’t legally require users of the data to cite the source, it does not take away the moral responsibility to give attribution, as is common in scientific research (more about that below).

Why would I waive my copyright?

For starters, there’s very little copyright to be had in our data, datasets and databases. Copyright only applies to creative content and 99% of our data are facts, which cannot be copyrighted. We do hold copyright over some text in remarks fields, the data format or database model we chose/created, and pictures. If we consider a Darwin Core Archive (which is how we are publishing our data) the creative content is even further reduced: the data format is a standard and we only provide a link to pictures, not the pictures themselves.

Figuring out where the facts stop and where the (copyrightable) creative content begins can already be difficult for the content owner, so imagine what a legal nightmare it can become for the user. On top of that different rules are used in different countries. Publishing our data under CC0 removes any ambiguity and red tape. We waive any copyright we might have had over the creative content and our data gets the legal status of public domain. It can no longer be copyrighted by anyone.

Can’t we use another license?

Let’s go over the options. Keep in mind that these licenses only apply to the creative aspect of the dataset, not the facts. But as pointed out above, figuring this out can be difficult or impossible for the user. So much so in fact, that the user may decide not to use the data at all, especially if they think they might not meet the conditions of the license.

All rights reserved

copyright The user cannot use the data(set) without the permission of the owner.

Conclusion: Not good.

Open Data Commons Public Domain Dedication and License (PDDL)

There are no restrictions on how to use the data. This license is very similar to CC0.

Conclusion: Perfect, in fact this license was a precursor of CC0, but… it is less well known and maybe not as legally thorough as CC0. CC0 made a huge effort to cover legislation in almost all countries and the Creative Commons community is working hard to improve this even further. Therefore, if you have to choose, CC0 is probably better.

Creative Commons Attribution-NoDerivs (CC BY-ND)

by-nd The user cannot build upon the data(set), which is what most data use involves.

Conclusion: Not good, and sadly used by theplantlist.org. Roderic Page pointed this out by showing what cool things he can NOT do with the data.

Creative Commons Attribution-NonCommercial (CC BY-NC)

by-nc The user cannot use the data(set) for commercial purposes. This seems fine from an academic viewpoint, but the license is a lot more restrictive than intuitively thought. See: Hagedorn, G. et al. ZooKeys 150 (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information.

Conclusion: Not good.

Creative Commons Attribution-ShareAlike (CC BY-SA) or Open Data Commons Open Database License (ODbL)

by-sa The user has to share any work based upon the data(set) under a license that is identical or similar to the one used.

Conclusion: Good, but… this can lead to some problems for an aggregator like Canadensys or GBIF: if they are mixing and merging data with different SA licenses, which one do they choose? They might be incompatible.

Creative Commons Attribution (CC BY) or Open Data Commons Attribution License (ODC-By)

by The user has to attribute the data(set) in the manner specified by the owner. This condition is also present in the three licenses above.

Conclusion: Good, but… this can lead to impractical “attribution stacking”. If an aggregator or a user of that aggregator is using and integrating different datasets provided under a BY license, they legally have to cite the owner for each and every one of those in the manner specified by these owners (again, for the potential creative content in the data). See point 5.3 at the bottom of this Creative Commons page for a better explanation and this blog post for an example.

But giving credit is a good thing!

Absolutely, but legally enforcing it can lead to the opposite affect: a user may decide not to use the data out of fear of not completely complying with the license (see paragraph above). As hinted at the beginning of this post, CC0 removes the drastic legally enforceable requirement to give attribution, but it does not remove the moral obligation to give attribution. In fact, this has been the common practice in scientific research for many decades: legally, you don’t have to cite the research/data you’re using, but not doing so could be considered plagiarism, which would compromise your reputation and the credibility of your work.

To encourage users to give credit where credit is due, we propose to create Canadensys norms. Norms are not a legal document (see an example here), but a “code of conduct” where we declare how we would like users to use, share and cite our data, and how they can participate. We can explain how one could cite an individual specimen, a collection, a dataset or an aggregated “Canadensys” download. We can point out that our data are constantly being corrected or added to, so it is useful to keep coming back to the original repository and not to a secondary repository that may not have been updated. In addition to that, we can build tools to monitor downloads or automatically create an adequate citation. And with the arrival of data papers – which drafts can now be automatically generated from IPT – data(sets) are really brought into the realm of traditional publishing and the associated scientific recognition.


All this to say that there are mechanisms where both users and data owners can benefit, without the legal burden. CC0 + norms guarantees that our data can be used now and in the future. I for one will update the license for our Université de Montréal Biodiversity Centre datasets. We hope you will join us!

Thanks to the Gregor Hagedorn for his valuable advice on all the intricacies of data licensing.