CHAH Recommendation

Memorandum of Understanding

Herbarium Exchange Protocol 2020

Purpose

To provide advice to Australasian herbaria on the terms required for herbarium-to-herbarium data exchange.

Recommendations

1	That herbaria adopt this protocol and work towards the recording and exchange of the proposed categories of data.
2	That this protocol be implemented in herbaria by 30 June 2021.
3	CHAH member institutions should be ensuring their staff apply for an ORCID ID as a matter of urgency if they do not already have one. Funding bodies increasingly expect staff to have one and generate one if necessary.
4	CHAH member institution staff should be working together on Bionomia.net and other online resources to collaboratively define unique identifiers for past research staff and historical collectors. The existence of unique identifiers for these people will accelerate their adoption in collections management systems.

Background

Herbaria in Australasia have transferred data in several formats in the last 10–20 years. Following the release of HISPID, it was common for herbaria to export and import HISPID version 3 or 4 data files. More recently, herbaria have used the herbarium community’s BioCASE providers to serve HISPID 5 data, or generated Darwin Core-based spreadsheets and CSV files. All but 2–3 Australian herbaria now support the exchange of data in Darwin Core. In New Zealand, the standard has been Darwin Core for several years.

With the availability of HISPID 6 (an extension of Darwin Core) and the availability of many other Darwin Core extensions, it is time to consider what a modern, efficient method of data transfer looks like for the Australasian herbarium community.

Australasian herbaria are faced with challenges in 2020 that can’t be met with our current exchange protocol. This includes the need for permit and unique agent data, a revision to the way type data is exchanged and a decision on how herbaria formalise the link between duplicate specimens and the way this data is exchanged.

Data exchange using HISPID version 3 and 4 formats suffer from several issues:

There are no permit fields in this standard, which precludes it from being used if an herbarium needs to send Nagoya Protocol data.
There are no fields in this standard for the transfer of unique URIs for collectors and identifiers of specimens.

Data exchange using HISPID 5 format, which is based on ABCD 2.06, suffers from several issues:

The file format is complex, making it more difficult for under-resourced institutions or very small herbaria to exchange data.
There are no fields in this standard for the transfer of unique URIs for collectors and identifiers of specimens.

The limitations of current exchange formats also include:

Plain CSV and other delimited text files do not specify a character set, making it difficult to determine how the file should be read correctly to preserve diacritical marks. Additionally, there is no way to formally signal that the end of the document has been reached (other than reaching the end of the document). This is an issue for network data transfer because it is not always clear whether the transfer ended prematurely. (These problems affect the CSV files in a Darwin Core CSV file too, but we will explain how to correct this later.)
While XML stipulates a character set and defines the end of a document, it is excessively verbose. This makes it both difficult to create without the help of software (a problem for small institutions) and excessively large in file size.
HISPID versions before 5 are no longer suitable for use, as there are no permit fields in these standards. This precludes them from being used if an herbarium needs to send Nagoya Protocol data. There is additionally no way to specify a character set as part of the data file.
Data exchange in HISPID 5 (ABCD 2.06) format is no longer suitable for use because there is no support for Collector IDs. Additionally, the file format is complex, making it more difficult for under-resourced institutions or very small herbaria to exchange data. It also slows the rate of adoption of changes to exchange processes in larger institutions.

A Darwin Core Archive provides the following features:

Ease of use: A Darwin Core Archive comprises one or more CSV files containing columns of data using Darwin Core, HISPID 6, GGBN and other terminology as noted in this document. A CSV file is easily created, even by hand, if necessary. In comparison, XML is difficult to create and is usually only managed through software.

Comparatively small archives: By adopting CSV, a file is small when compared to the equivalent output using XML and BioCASe.

Up-to-date terminology: By adopting a term-based vocabulary based on Darwin Core and HISPID 6, we gain access to recent terminologies, including ABCD 3.

Comparatively easy extension when a new property needs to be exchanged: By adopting vocabularies (HISPID 6, Darwin Core, ABCD 3, GGBN) rather than a document schema (HISPID 5, ABCD 2.06), it is easier to add a new property to an exchange archive.

Typed data: Darwin Core Archive’s metafile provides metadata about the CSV files in an archive. This includes a character set for the data files and the type of data that is being provided in each row and column of each CSV. Additionally, a Zip file comes with its own protection against premature end of file errors. These mostly avoid the problem we identified earlier with CSV data files.

Resourcing

Institutions that use this MoU to exchange data should (a) be able to export the data for each property into an exchange document and (b) be able to import the data from each property in the exchange document into local systems. These basic requirements will ensure that it will eventually possible to accurately exchange provenance, collector, type, and permit data between herbaria. Once this occurs, it also becomes possible for institutions to be notified when these properties of a linked specimen are changed by another institution.

There is a resourcing issue for an institution to make changes to its file exchange protocols to allow it to take part in the revised exchange protocol. Some work will be required to determine a mapping between the local institution’s collection database and the terms in this MoU and it may be necessary to hire external consultants to do this. The cost of this work is dependent upon the ease of access to the required data, the availability of staff with the skills needed to complete the work, the need for external consultants and the fees they charge.

Many herbaria already have the capability to generate and exchange documents of this kind and there is a general push for the developers of collections management software such as Specify to be more closely involved in the need for Darwin Core support.

Terms

The terms included in this MoU are those in the AVH MoU Technical Addendum (2020) plus the following additional terms.

Permits

The Nagoya Protocol on Access and Benefit Sharing establishes an obligation on signatories—including Australia—to document the terms of any agreement an institution has made for the use of genetic resources. The details of permits granted when collecting plant specimens must be shared with other institutions so that there is legal clarity for ongoing use of herbarium specimens for duplicate specimens. To support the transfer of permit data, we adopt the GGBN Permit Vocabulary, and in particular Darwin Core’s GGBN Permit Extension.

In the AVH MoU Technical Addendum, only permitStatus is necessary for AVH, and any terms included in the data sent to ALA are automatically displayed to the public. As permit data identifies a person by their permit number, this data is sensitive and must be treated like other kinds of private data.

For herbarium exchange, we must provide a repeating set of the following terms, one set per permit.

permitStatus

http://data.ggbn.org/schemas/ggbn/terms/permitStatus

Required

The status of the permit, taken from one of the allowed values, for example: Permit available. When there are no permits granted to a specimen, this value must be set to Permit not required.

permitType

http://data.ggbn.org/schemas/ggbn/terms/permitType

Required

The type of permit assigned, taken from one of the allowed values, for example: Collecting Permit. When there are no permits granted to a specimen, this value must be set to Other.

permitStatusQualifier

http://data.ggbn.org/schemas/ggbn/terms/permitStatusQualifier

Optional

Free text, perhaps describing why a certain permit was not required or why permitStatus is unknown.

permitURI

http://data.ggbn.org/schemas/ggbn/terms/permitURI

Optional

The unique identifier of the permit, such as a UUID.

permitText

http://data.ggbn.org/schemas/ggbn/terms/permitText

Optional

Free text detail of the permit.

Types and Type Status

The specimen or specimens that a taxonomist uses to describe and name a new species is a type specimen. These are the most important specimens in a collection.

For AVH, we exchange type information using typeStatus only, but in herbarium exchange it is preferable to separate some of that data for easier consumption by also using typeOfType from HISPID 6.

typeStatus

http://rs.tdwg.org/dwc/terms/typeStatus

Required

A list (concatenated and separated) of nomenclatural types (type status, typified scientific name, publication) applied to the subject.

typeOfType

http://hiscom.chah.org.au/hispid/terms/typeOfType

Required

The type status of the collection object.

Unique identifiers for agents

The identity of the people who collected or identified a specimen are important to the taxonomic process. Unfortunately, there is immense diversity in the way humans represent names. Additionally, a person’s name and its representation can change over time.

For example, there are many ways to refer to Ferdinand von Mueller:

Ferdinand Jakob Heinrich von Mueller
Baron Ferdinand von Mueller
Ferdinand von Müller
F.J.H. von Mueller
F.Muell

Identifying the collectors and identifiers of specimens using unique identifiers makes it possible to disambiguate people with similar names as well as check for errors. Several identifier standards now exist to uniquely represent a person in a scientific setting including ORCID, Wikidata and VIAF.

Unique Identifier Examples

ORCID iD: https://orcid.org/0000-0001-2345-6789
Wikidata ID: http://www.wikidata.org/entity/Q708002
VIAF ID: https://viaf.org/viaf/49224511

All variation in the way Ferdinand von Mueller is cited can be handled by storing and using his concept Wikidata ID, https://www.wikidata.org/entity/Q708002.

recordedByID

https://rs.gbif.org/terms/1.0/recordedByID

Optional

GBIF has begun supporting unique identifiers for collectors by introducing a draft change to its infrastructure. The change allows them to accept a recordedByID term alongside the standard Darwin Core recordedBy term. This new term is to contain unique identifiers for each collector listed in the recordedBy term in the same order. GBIF does not make a recommendation about which identifier standard to use, but HISCOM recommends using one of the standards listed above.

This term is used in the core occurrences file, and its value must be a pipe-separated (|) list of absolute URIs consistent with the examples listed above.

Some research staff, especially those with a historical or high profile will commonly already have a Wikidata or VIAF ID, and these should be reused rather than generating an ORCID ID for them.

identifiedByID

https://rs.gbif.org/terms/1.0/identifiedByID

Optional

This term links the unique identifiers of the people who applied a taxon to a specimen. In all other respects it is treated the same as recordedByID.

Specimen Duplicate provenance

See the document “Handling Duplicates” for the background and methodology for supplying specimen provenance data.