Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBI to GTDB lineage conversion #100

Open
janstett opened this issue Oct 1, 2024 · 0 comments
Open

NCBI to GTDB lineage conversion #100

janstett opened this issue Oct 1, 2024 · 0 comments
Labels
feature request A request for a new feature unlike one that already exists

Comments

@janstett
Copy link

janstett commented Oct 1, 2024

TreeSAPP normally acquires lineage information from the NCBI by using Entrez to acquire TaxIDs and lineage information.

However, based on the Wiki, it's best practice to first convert the NCBI taxonomies to GTDB taxonomies before continuing to update the trees with sequences from genomes that have been labeled with GTDB. While treesapp does attempt to repair mismatches at each rank, it will keep the lineage that appears most frequently, so you can wind up with a mix of NCBI and GTDB lineages when updating, so it's better to convert the taxonomies from NCBI to GTDB before updating. However, since there isn't a simple 1:1 taxonomic conversion, but I think I have a solution.

For many protein sequences and taxIDs, there are genome assembly accession identifiers that are available.

Since GTDB-tk's reference databases scrape the NCBI's reference and genbank assemblies and records the assembly accessions in addition to the GTDB lineage information. Since these assembly accessions exist in both databases, it essentially links the NCBI and GTDB accession.

For instances where these assembly accessions are available, it would be great if there was a feature that produced a GTDB lineage table and an NCBI table, so the user can compare both tables, and make adjustments to the GTDB compliant table, especially for instances where GTDB lineage information isn't available (ie for Eukaryotes), or for instances where an Assembly Accession isn't available either database. Entrez should be capable of also fetching Assembly Accessions, in addition to TaxID information. There are also summary files that the NCBI has that links the Assembly Accessions to each taxID.

Alternatively, we could provide a guide to users on how to generate these converted tables in the Wiki.

@janstett janstett added the feature request A request for a new feature unlike one that already exists label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A request for a new feature unlike one that already exists
Projects
None yet
Development

No branches or pull requests

1 participant