Special data feature: Obtaining information on freshwater databases

September 5, 2011

Authors: Aaike De Wever, Astrid Schmidt-Kloiber and Sian Davies

Today we begin a weekly series of posts written by BioFresh scientists, which give a behind-the-scenes account of how and why the BioFresh freshwater biodiversity metadatabase is being constructed. As we find out in the article, compiling data is not only a logistical task, it involves a fascinating network of politics and negotiation over data ownership, sharing and publication.

Why are we constructing a metadatabase?

One of the main products of the BioFresh project is the metadatabase, which is basically a database centralising information on freshwater biodiversity related databases. As outlined in the interview with Astrid Schmidt-Kloiber, this is one of her major tasks within the EU FP7 funded BioFresh project.

The aim of this metadatabase is to bring all possible information on freshwater related databases together and provide a resource where scientists, conservationists and policy makers can find databases relevant for their work. But, in the first place, it was started as a tool that would help scientists within BioFresh to identify datasets that can be used in their biodiversity modelling work.

Development history

The work on the metadatabase started very early on in the project, and a first prototype was already available by the first project meeting in February 2010. This allowed us to discuss which fields were needed for scientists to identify suitable datasets, making sure the metadatabase is compatible with common standards. By the summer of 2010, the extensive metadatabase questionnaire, specific for freshwater ecosystems was ready and project partners were encouraged to enter the databases they held.

In autumn of 2010, we began welcoming and collating databases from external parties. From April 2011, the metadatabase was available for public viewing, although the majority of the datasets were still behind the scenes. However, during this first year we already gained a lot of experience in requesting metadata, which we would like to comment on during this blogpost.

The present situation

At this stage (september 2011) we have 58 database entries, excluding the intercalibation datasets (see a forthcoming post on this topic), which are more or less complete. 34 of these were filled in by BioFresh project partners. 24 of the databases were external and consist of two main sources: the ones from people who contacted us; and the ones we identified ourselves. Four of those were filled by the data holders, but for the 20 other entries we started filling in as much metadata as we could ourselves, before contacting the data holders. We chose this approach because it was already clear from our experience with the internal databases that our chances of success (i.e. an entirely filled questionnaire) would be low otherwise.

Comments on the experience of requesting data

Four out of five persons that contacted us asking to include their datasets in the metadatabase have completed a metadatabase entry. Of the 20 external datasets we filled, only 4 were not checked and approved by the data holder; 2 of those were non-responders, for one data contact, we were unable to find a live email address and got no response on a general address and one data holder indicated that he was not interested in delivering data, and afterwards didn’t bother to check the metadata although we specifically asked for this.

In general, the overall response rate of around 80% (excluding the intercalibration data) is probably not too bad, especially if you know there is only one dataset which we heard about, but did not receive any additional information on, again because the data holder expressed he was not interested in sharing and did not seem to see the interest of having this dataset in a metadatabase. However, (pre)filling the metadatabase entries and motivating data holders to check and complete them is quite a tedious task, which is also the main reason the number of metadatabase entries does not really look impressive, but, more on that later.

Fortunately, the quality of most entries is relatively high, although typos and omission of obvious keywords are quite common. Another weak point is the lack of details on the intellectual property rights information for a lot of the databases, despite the fact that we specifically asked that attention was given to this aspect while checking the entries.

We will continue to proactively search and incorporate relevant datasets and improve the quality of the entries. We believe we will end up with a specialized, high quality database. But, it is clear that there’s a lot of freshwater data out there that will never be incorporated, and so we won’t hit the 1000+ mark for the number of entries in the metadatabase. At least not using this approach…

Luckily, we are certainly not the only initiative compiling metadata. We are currently looking into automatic mechanisms to exchange metadata and have high hopes that we will be able to work together with several existing initiatives, another potential blog topic for later.

Some conclusions and final thoughts

In conclusion, we believe we have already learned quite a few lessons on collating metadatabases. It is clear that working with scientists to obtain information on the databases they hold is a slow and tedious process, but assisting with the entry of the information clearly speeds up the process. However, unless the database is documented on-line or in a scientific paper, it is not straightforward (or even impossible) to enter any information on a dataset without access to it.

Another notable observation is the fact that the concept of metadata seems to be poorly understood. This seems to lead to a nervousness or reluctance to make it available, despite the fact that publishing what is simply a description of a dataset will not diminish or compromise the data holders work, nor will it mean they lose control of their data.

This theme will be explored further in next week’s article Working with intercalibration data