What’s in a name: does the data publication metaphor work for primary biodiversity data?

January 25, 2012

tags: Aaike De Wever, Data nomenclature, Data publication

Looking for a solution: a flamingo in Chile. Image: Nuria Bonada

In the process of collecting, collating and mobilising freshwater biodiversity data for BioFresh I routinely use the phrase “data publication” in order to convince data authors or holders to make their data publicly available. In a paper currently available for public review, Mark Parsons and Peter Fox discuss the applicability and limitations of the data publication metaphor for making data broadly available. As the authors state themselves, their paper is somewhat provocative. Looking at the responses at their blog, it created quite a discussion which I must say got me thinking as well…

Where and how? Citing biodiversity data

When looking at our work within BioFresh – which for me at least focuses on primary biodiversity data (basically the what, where, how and by whom an organism was observed or collected as defined by the Global Biodiversity Information Facility) – I must admit that I agree with most limitations Parsons and Fox attribute to the use of the term data publication. It is for instance true that there is no standard review process or mechanism for datasets which comes close to the well-accepted practice of peer-review for scientific papers. In addition, for primary biodiversity data made available through the GBIF network data holders can make data available without necessarily publishing a paper on it (e.g. data on museum collections). This isn’t a bad thing at all (see our previous posts on data sharing topics), but it doesn’t reflect the term data publishing in a strict sense very well). Finally, this data rarely carries a persistent identifier like a Digital Object Identifier (DOI).

As such, we merely use the term data publication to stress the fact that scientists making their data available on-line shouldn’t see this as an act of ‘giving away’ their work. Instead, it is seen as a way for their data to be reused and cited in other scientific work (e.g. large scale biodiversity modelling) and thus creating more visibility for their work. Citing a dataset in the absence of a published scientific paper does however not have the same value as a citation that can easily be tracked through scholarly search engines and taken into account in a citation score. So, yes, in a way the term data publication can be somewhat misleading.

What’s the alternative?

But is there a worthy alternative? The process of making primary biodiversity data available on-line demonstrates parallels to the widely adopted practice of submitting sequence data to publicly available databases such as EMBL/GenBank/DDBJ. For both primary biodiversity data and sequence data, authors need to supply a limited set of core data in a standardized format and these data may be part of a larger dataset e.g. also containing environmental data which is not made publicly available. If we only want to stress the process of making (primary biodiversity) data available, data submission seems a valuable alternative, especially as it sounds less voluntary than data sharing. But, until data sharing has become a common practice and/or is being enforced by journal editors, I believe a good alternative to the data publication metaphor for convincing scientists has yet to be found.

Aaike De Wever

P.S.: I could further elaborate on the emerging topic of actual data papers in biodiversity science (e.g. Chavan & Penev 2011), but I’ll keep that for a follow-up post.