Organizing GenBank Records with BioProjects¶
What is a BioProject?¶
From the NCBI BioProject homepage (https://www.ncbi.nlm.nih.gov/bioproject/):
“A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project.”
BioProjects grew out of the NCBI Genome Project database, which served solely to organize genome sequences in GenBank. However, it became apparent that this organizational structure could be used to group together entries for several more kinds of data, so BioProjects became a distinct entity in 2011.
BioProjects can have a hierarchical structure, meaning that project-level BioProjects can be organized together under an “umbrella” BioProject.
The Smithsonian Barcoding Network (SIBN) and Global Genome Initiative (GGI) are using BioProjects to organize sequence records on GenBank that were generated by each project that it funded. Each funded project will have its own BioProject that makes searching easier and tracking progress more convenient. The SIBN BioProject can be found at https://www.ncbi.nlm.nih.gov/bioproject/81359, and the GGI BioProject can be found at https://www.ncbi.nlm.nih.gov/bioproject/384793.
When a GenBank record is added to a BioProject, a link to other records in the same BioProject appears directly on the GenBank record.
Creating a BioProject¶
You will need to create a BioProject before new GenBank submissions or existing GenBank records can be organized under one.
To create a BioProject, navigate to the new NCBI Submission Portal: https://submit.ncbi.nlm.nih.gov/.
- Click on “Sign in to NCBI” before you get started. Generally, it is easiest to sign in with your Google account, so that you do not have to create yet another username and password to forget.
- After signing in, you should be directed back to the Submission Portal page. Now click on the link for BioProject.
- Click the New Submission button.
- Fill out the Submitter page.
- Select “Targeted Locus (Loci)” for Project Data Type, and “Multispecies” for Sample scope.
- Give a short description for “Multispecies description”.
- The submission portal will create an automated Project Title based on your previous entries, but overwrite this with the title of your project. Give a good description of the project in “Public description”, because this will be front-and-center on the BioProject page. Finally, check the “Yes” box to indicate that this project is part of a larger initiative.
- If this BioProject falls under the SI Barcode Network, then enter “SI Barcode Network” for Initiative description, and “PRJNA81359” for BioProject Accession.
- If this BioProject falls under GGI, then enter “Global Genome Initiative” for Initiative description, and “PRJNA384793” for BioProject Accession.
Leave the rest of entries on this page blank.
- Skip the BioSample page.
- Add any Publications your project has generated on the Publications page. Don’t worry, you can come back and add publications later.
- Finally, the Overview tab will show all of your entries in one place. This will be your last chance to make any changes before submitting.
- After a few days, you will receive an email from NCBI informing you that your BioProject has been successfully created. Most importantly, they will send your BioProject ID, which you can now add to existing GenBank records or include in new GenBank submissions.
How to update BioProject information¶
If your BioProject has already been published and you would like to update any of the entries from the BioProject creation process, email the changes you would like to make to firstname.lastname@example.org.
Adding a BioProject to existing GenBank records¶
Adding a BioProject ID to sequence records that are already published to GenBank is a manual procedure done through email. Email email@example.com, and let them know:
- your BioProject ID, and
- the range of GenBank accessions to which you would like to add the BioProject ID.
Adding a BioProject to new GenBank submissions¶
Unfortunately, out of the several different methods of publishing sequences to GenBank (BankIt, Sequin, tbl2asn, Geneious, and BOLD), only tbl2asn has a straightforward method for adding a BioProject ID to a batch submission.
We are currently working with the Geneious developers to have BioProjects added to the Submission Details section of the GenBank Upload Plugin.
In the tbl2asn instruction manual at https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/, the 3 files required to create a submission package are a “template file”, a FASTA file containing nucleotide sequences, and a feature table with annotations. The template file will be where we include the BioProject ID for a submission.
To create a GenBank submission template file, go to https://submit.ncbi.nlm.nih.gov/genbank/template/submission/, and fill out the form. The last section of the form is for “BioProject/BioSample Information”, and this is where you will add your BioProject ID.
Press the “Create Template” button to download a “.sbt” file, and bundle that with your other components for the tbl2asn command line utility.