Barcode Data Standard

Introduction

The Barcode Data Standard was established by the Consortium of the Barcode of Life soon after the first scientific paper by Dr. Paul Hebert was published that proposed the method of DNA Barcoding. View and download the official data standard document here.

Please note that, while the GenBank keyword BARCODE is no longer actively assigned by NCBI, this document is still referred to in the attempt to create barcode quality sequence records on GenBank. The portions referring to submission of trace files are no longer applicable, as the NCBI Trace Archive has been retired.

The data standard consists of several required and strongly recommended elements that have to do either with specimen metadata or sequence data. Below are brief summaries and explanations of each element, and try to highlight any commonly seen mistakes.

Specimen Metadata
Text from Standard	GenBank Field	Required or Recommended?
“a unique identifier for the voucher specimen using a structured field specified by CBOL and NCBI”	specimen_voucher	Required
“the name of a formally described species or a provisional label for an unpublished species”	organism	Required
“Country-Code using the controlled vocabulary used by GenBank”	Geo_Loc_Name	Required
“Latitude and longitude”	lat_lon	Strongly recommended
“Name of the collector”	collected_by	Strongly recommended
“Date of collection”	collection_date	Strongly recommended
Sequence Metadata
Text from Standard	GenBank Field	Required or Recommended?
“Come from a gene region accepted by CBOL as an effective barcode” … “Include the name of the region used”	gene	Required
“the sequences of all forward and reverse primers used”	Fwd_primer_seq/ Rev_primer_seq	Required
“the names of the forward and reverse primers”	Fwd_primer_name/ Rev_primer_name	Strongly recommended
“at least 75% contiguous, high quality bases from within the approved barcode region”	nucleotide_sequence	Required

Note

The full official definitions and descriptions for all of these terms can be found at on the INSDC Feature Table page at http://www.insdc.org/files/feature_table.html#7.3

Specimen Metadata

Collection Metadata

Geo_Loc_Name – Required

The GenBank field name “geo_loc_name” (previously “Country”) is slightly confusing – not just because the INSDC country controlled vocabulary list (http://www.insdc.org/country.html) includes oceans and seas in addition to countries – but because the country name is often concatenated with a colon to provide more specific location information about where a specimen was collected. Typically, locality terms following the standardized country name are ordered in ascending order of specificity. An example for a specimen collected on the grounds of the Smithsonian Natural History Museum might be “USA: Washington, DC; Smithsonian Natural History Museum; West Loading Dock”.

Latitude and Longitude – Strongly Recommended

The geographical coordinates of the location of where a specimen was collected are stored in the “lat_lon” field in decimal format. GenBank uses the specific format “d[d.dddd] N|S d[dd.dddd] W|E”. An example of this is “38.891262 N 77.026093 W” for the Smithsonian Natural History Museum.

Collector Name – Strongly Recommended

The name of the person(s) or institute that collected the specimen. GenBank does not provide any guidance on how to structure name (“Give Name Surname” vs. “Surname, Given Name”) or how to group multiple names, but at least be consistent.

Collection Date – Strongly Recommended

The date(s) on which the specimen was collected. Date ranges are supported by providing two collection dates from among the supported value formats, delimited by a forward-slash character.

Here are the supported value formats, with examples:

“DD-Mmm-YYYY”: 01-Jan-2016

“Mmm-YYYY”: Jan-2016

“YYYY”: 2016

“YYYY-MM-DD”: 2016-01-01

“YYYY-MM”: 2016-01

Voucher Metadata

Specimen Voucher – Required

The specimen voucher field is the most important portion of the Barcode Data Standard, because it serves as the bridge between genetic data and specimen data. This field is even more important for plants, because the plant barcode consists of more than one gene region. The two sequences that make up a plant barcode are published as two separate GenBank records, so a unique specimen voucher field is the only thing that asserts that these sequences came from the same individual.

Not only is a unique identifier required for the specimen voucher, but it also needs to be in a specific format. It is very easy to miss since this format is specified in a footnote, but the data standard document specifies that the voucher specimen identifier should use a triplet structure based on elements of the Darwin Core (DwC) separated by a colon:

institutionCode:collectionCode:catalogNumber

There are also instances where the voucher specimen identifier uses a doublet separated by a colon, such as in the cases of botanical collections in herbaria. For example, the doublet US:12345678 would represent a voucher specimen in the United States National Herbarium, where the code US represents both institution code and collection code.

To ensure that specimen voucher identifiers are unique and traceable, GBIF maintains the GBIF Registry of Scientific Collections (GBIF.org), which builds on GRSciColl, a comprehensive, community-curated clearinghouse of collections information originally developed by Consortium of the Barcode of Life (CBOL).

Organism – Required

The scientific name of the organism that provided the sequenced genetic material. The text from the data standard reads “the name of a formally described species or a provisional label for an unpublished species”, which allows for the exception of allowing for organism names only identified to the Order or Family level. It is recommended by GenBank to give provisional names the values of the specimen voucher for reproducibility reasons.

Sequence Metadata

Nucleotide Sequence – Required: This is the DNA sequence of the barcode record.
PCR Primer Sequence(s) – Required: This refers to the sequences for the PCR primers used to amplify the DNA Barcode region. All sequences should be presented in 5’>3’ order.
PCR Primer Name(s) – Highly Recommended: This refers to the “common names” of the primer sequences. Unfortunately this field is optional, and the vast majority of barcode records do not have primer names listed.
Trace Files – Optional: If desired, trace files for the forward and reverse sequencing runs may be submitted to the NCBI Sequence Read Archive (SRA). See https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/ for further information.