WDCC Preservation Plan and Workflow
version 1.1, 04 October 2023
The WDCC is a data repository which offers long-term archival and publication of datasets relevant to Climate and Earth System Research. Across the whole preservation workflow, the WDCC follows the guidelines given in the OAIS reference model (Lavoie 2014). The established workflow is described in the figure below and includes four different steps:
- Provisioning of the metadata and data by data provider
- Ingest, checks and updates of metadata to the WDCC Metadata database
- Data checks
- Data filling to tape archive, QA (Quality assurance) of data, linking data with metadata, and curation of data and metadata
The metadata (MD) are provided by the data provider either by the WDCC MD tool MetaXa, a graphical user interface, or by specific CSV files. In collaboration with selected projects, the WDCC curators can extract the MD from external resources (e.g. CMIP6 metadata are retrieved from ESGF). The MD is stored in a temporary WDCC database (WDCC temp) and checked against the WDCC metadata scheme before it is ingested into the productive WDCC MD database. Hereafter, the data filling process starts. The first step is a data check, including a.o. checksums, format check, and check for zero-byte files. As the second step, data filling is ordered and the data is archived into the WDCC tape archive. In order to ensure that the data has not been altered or corrupted, an automated retrieval of the data back from tape with subsequent fixity checks is implemented. The subsequent technical quality assurance (TQA) is an essential part of the data curation. Curation processes like TQA and DOI assignment generate new MD, added to the WDCC MD database immediately.
Since 2003, WDCC has been providing long-term preservation of data and metadata for the climate science community, including an optional DataCite DOI data publication. Since then, it has continuously reviewed, evolved, and consolidated the established procedures to ensure that the archived data remain findable and accessible.
The WDCC has implemented a standardised preservation approach upon all digital assets which it accepts for archival. This means that all archived data have the same preservation level, irrespective of e.g. the size, format or sensitivity of the data. WDCC preserves non-proprietary file formats which are well-established in the climate science community. WDCC strongly recommends the usage of NetCDF or WMO GRIB, both are non-proprietary, open, and international community standard formats. NetCDF is not just a file format but also a set of open software libraries designed for storing, accessing, and sharing scientific data in a backward-compatible manner. The NetCDF libraries provide a platform-independent and self-describing way to store data. In exceptional cases, WDCC also accepts non-preferred file formats, provided they are non-proprietary. It is, however, mandatory that the data objects are additionally provided and archived in an open source format. Before the data are archived in the WDCC, all file formats are checked by the curators as part of the WDCC quality assurance.
Once data files (netCDF or GRIB) are archived in the WDCC, they are not altered or modified in any way, ensuring the preservation and integrity of the archived data. This also means that the WDCC preserves the data in the formats in which they were originally submitted by the data provider and that there is no format migration.
The WDCC curators, however, may modify the metadata associated with the archived data files whenever deemed necessary. Such metadata comprise the metadata that describe the archived data on the respective WDCC landing page (both human- and machine readable), and in the case of DataCite data publications, the DataCite metadata.
In case that a data provider withdraws its data after the publication or in the extremely unlikely event of data damage in the archive that cannot be recovered, the landing page of the data object will still be preserved. However, the WDCC curator will mark the metadata objects as withdrawn or inaccessible.