FAIRness of WDCC - summary
Findability of WDCC Data
F1. (meta)data are assigned a globally unique and eternally persistent identifier.
The WDCC offers DataCite data publication for long-term archived data. Permanent access to published data is ensured via assigned DOIs and data and metadata remain unchanged. To be eligible for publication at WDCC, data has to meet defined quality requirements. The oldest data DOI which is still used today (Link) was registered for the WDCC on March 18, 2004.
Fig. 1: A schematic depiction of the OAIS AIP used for the WDCC archival process.
The WDCC database CERA contains the metadata (OAIS AIP, Figure 1 and 2) but also the data (OAIS AIP Content DataObject). DOIs are assigned at coarse granularity (experiments or dataset groups) of the internal WDCC organization hierarchy. Metadata are kept for all levels of the hierarchy. While all levels bear identifiers that are persistent, only the higher levels are assigned true DOIs.
F2. data are described with rich metadata.
CERA is the WDCC's long term archive system for data and metadata. The metadata are stored in a relational database. Its tables are grouped in blocks concerning different themes. In addition, this data model is made up of several modules (table groups) which extend the basic information given in the blocks. After the CERA (meta)data have reached the ‘completely archived’ state in the archiving process they are described with what DKRZ refers to as "rich metadata" (OAIS AIPs are complete). During the DOI publication process, the metadata are extended with additional metadata e.g. accuracy and statistical reports (see Figure 3).
Fig. 2: Details of the PDI (Preservation Description Information) used in the OAIS AIP at WDCC.
F3. (meta)data are registered or indexed in a searchable resource.
All metadata records are available for external harvesters through an OAI PMH interface and a mapping to the Dublin Core, ISO 19135 and DataCite XML metadata sets. Important harvesters currently active are DWD-Gisc and WDS. Moreover, WDCC data with DOIs are visible in EUDAT B2FIND. Finally, the local WDCC GUI offers to search and browse.
F4. metadata specify the data identifier.
Data at the coarse granularity levels which bear DOIs are described by metadata which also includes a reference to the assigned DOI. At the lower granularity levels, PIDs are currently not assigned. Some projects which are planned to be archived at WDCC will provide data where PIDs are assigned also at lower hierarchical levels. For these, PIDs are kept in the headers of data files and may also be kept within WDCC metadata. These procedures are not completely in place and while the relevant projects (CMIP6 in particular) are of high importance for the community, not all data at WDCC will follow these procedures yet.
Fig. 3: Additional information required for archived datasets in the process of Data-Cite DOI-Publication at DKRZ. The additional information ensure the re-usability of the data. Fig. 3: Additional information required for archived datasets in the process of Data-Cite DOI-Publication at DKRZ. The additional information ensure the re-usability of the data.
Fig. 3: Additional information required for archived datasets in the process of Data-Cite DOI-Publication at DKRZ. The additional information ensure the re-usability of the data.
Accessibility of WDCC Data
A1. (meta)data are retrievable by their identifier using a standardized communications protocol.
Metadata can be retrieved via OAI-PMH. Data can be retrieved by HTTP. The WDCC data organization allows for small data volumes per individual download.
A1.1 the protocol is open, free, and universally implementable.
HTTP and OAI-PMH are open, free and universally implementable.
A1.2 the protocol allows for an authentication and authorization procedure, where necessary.
A2. metadata are accessible, even when the data are no longer available.
If data are lost or removed for any reason, remaining metadata remain available but can be changed. In particular, metadata may be modified to indicate the cause for their removal or loss.
Interoperability of WDCC Data
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
For all metadata there are Dublin Core, ISO 19135 and DataCite XML schema mappings which are openly accessible. Metadata instances can be retrieved via OAI-PMH. Data are using open format standards. The data model of the CERA database is also publicly documented. There are however still cases where machine-interpretability of WDCC metadata may be improved by relying on more exhaustive ontologies and mappings to self-describing semantically enabled vocabularies and encodings. Most data archived at WDCC conform to the data format CF-netcdf, which is in general self-describing and machine-readable and relies on commonly used controlled vocabularies, particularly the CF conventions. WDCC also archives other data formats, so not all data follow these conventions. There is no policy at WDCC to enforce this in order to not discourage users from depositing valuable scientific data. These data formats are archived as they are together with metadata but without provision of additional services.
I2. (meta)data use vocabularies that follow FAIR principles.
CF–netcdf is publicly documented and openly accessible. To make the conventions citable via DOIs is an ongoing discussion within the CF committee.
I3. (meta)data include qualified references to other (meta)data.
The relations possible to specify via the DataCite ‘relationType’ attribute are implemented in CERA and are accessible from the CERA web user interface and the harvesting interfaces. Users are supported in providing relations as relevant and possible for their data. However, data and metadata undergoing archival in WDCC may be more systematically linked with each other and with other relevant external knowledge, though this is naturally an open-ended task.
Reusability of WDCC Data
R1. meta(data) have a plurality of accurate and relevant attributes.
In general, WDCC metadata contain rich information about the context in which data was generated, ensured by the metadata requirements for data submission. Specific items covered by CERA metadata are relevant timestamps (creation and collection date), conditions under which data were created, actors involved in preparing the data, and model-related technical attributes such as model parameters and model descriptions. There are specific limitations pertaining to the machine-interpretability of certain metadata aspects. In particular, data accuracy statements are currently not enforced to comply with standardization formalities due to their complexity, i.e., accuracy descriptions may be provided as free text. Finally, metadata accuracy is controlled during the publication process by WDCC and DOI author.
R1.1 (meta)data are released with a clear and accessible data usage license.
Metadata is released under CC0 universal license terms. The data licenses are dependent on the user. However, WDCC recommends using CC-by 4.0.
R1.2 (meta)data are associated with their provenance.
CERA metadata includes basic provenance information such as:
- Citation information: CERA references DOI authors.
- The workflow that led to the data: CERA project and experiment summary
- Who generated or collected it: DOI authors and contributors
Provenance information related to the workflow or procedures involved in generating data are described at a basic level with project and experiment summaries and accuracy reports. The level of detail is limited. Similarly, references to data from which archived data was derived from (and descriptions how it was derived) are currently limited to DataCite relationTypes and selected netcdf header information. The quality depends largely on the projects that generate data and request archival at WDCC.
R1.3 (meta)data meet domain-relevant community standards.
Most metadata meet relevant community standards, in particular, CF-netcdf and the DataCite metadata kernel (cf. I1, I3). However, this may be project-dependent.