NCBI to GTDB lineage conversion #100

janstett · 2024-10-01T15:13:29Z

TreeSAPP normally acquires lineage information from the NCBI by using Entrez to acquire TaxIDs and lineage information.

However, based on the Wiki, it's best practice to first convert the NCBI taxonomies to GTDB taxonomies before continuing to update the trees with sequences from genomes that have been labeled with GTDB. While treesapp does attempt to repair mismatches at each rank, it will keep the lineage that appears most frequently, so you can wind up with a mix of NCBI and GTDB lineages when updating, so it's better to convert the taxonomies from NCBI to GTDB before updating. However, since there isn't a simple 1:1 taxonomic conversion, but I think I have a solution.

For many protein sequences and taxIDs, there are genome assembly accession identifiers that are available.

Since GTDB-tk's reference databases scrape the NCBI's reference and genbank assemblies and records the assembly accessions in addition to the GTDB lineage information. Since these assembly accessions exist in both databases, it essentially links the NCBI and GTDB accession.

For instances where these assembly accessions are available, it would be great if there was a feature that produced a GTDB lineage table and an NCBI table, so the user can compare both tables, and make adjustments to the GTDB compliant table, especially for instances where GTDB lineage information isn't available (ie for Eukaryotes), or for instances where an Assembly Accession isn't available either database. Entrez should be capable of also fetching Assembly Accessions, in addition to TaxID information. There are also summary files that the NCBI has that links the Assembly Accessions to each taxID.

Alternatively, we could provide a guide to users on how to generate these converted tables in the Wiki.

janstett added the feature request A request for a new feature unlike one that already exists label Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCBI to GTDB lineage conversion #100

NCBI to GTDB lineage conversion #100

janstett commented Oct 1, 2024

NCBI to GTDB lineage conversion #100

NCBI to GTDB lineage conversion #100

Comments

janstett commented Oct 1, 2024