You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TreeSAPP normally acquires lineage information from the NCBI by using Entrez to acquire TaxIDs and lineage information.
However, based on the Wiki, it's best practice to first convert the NCBI taxonomies to GTDB taxonomies before continuing to update the trees with sequences from genomes that have been labeled with GTDB. While treesapp does attempt to repair mismatches at each rank, it will keep the lineage that appears most frequently, so you can wind up with a mix of NCBI and GTDB lineages when updating, so it's better to convert the taxonomies from NCBI to GTDB before updating. However, since there isn't a simple 1:1 taxonomic conversion, but I think I have a solution.
For many protein sequences and taxIDs, there are genome assembly accession identifiers that are available.
Since GTDB-tk's reference databases scrape the NCBI's reference and genbank assemblies and records the assembly accessions in addition to the GTDB lineage information. Since these assembly accessions exist in both databases, it essentially links the NCBI and GTDB accession.
For instances where these assembly accessions are available, it would be great if there was a feature that produced a GTDB lineage table and an NCBI table, so the user can compare both tables, and make adjustments to the GTDB compliant table, especially for instances where GTDB lineage information isn't available (ie for Eukaryotes), or for instances where an Assembly Accession isn't available either database. Entrez should be capable of also fetching Assembly Accessions, in addition to TaxID information. There are also summary files that the NCBI has that links the Assembly Accessions to each taxID.
Alternatively, we could provide a guide to users on how to generate these converted tables in the Wiki.
The text was updated successfully, but these errors were encountered:
TreeSAPP normally acquires lineage information from the NCBI by using Entrez to acquire TaxIDs and lineage information.
However, based on the Wiki, it's best practice to first convert the NCBI taxonomies to GTDB taxonomies before continuing to update the trees with sequences from genomes that have been labeled with GTDB. While treesapp does attempt to repair mismatches at each rank, it will keep the lineage that appears most frequently, so you can wind up with a mix of NCBI and GTDB lineages when updating, so it's better to convert the taxonomies from NCBI to GTDB before updating. However, since there isn't a simple 1:1 taxonomic conversion, but I think I have a solution.
For many protein sequences and taxIDs, there are genome assembly accession identifiers that are available.
Since GTDB-tk's reference databases scrape the NCBI's reference and genbank assemblies and records the assembly accessions in addition to the GTDB lineage information. Since these assembly accessions exist in both databases, it essentially links the NCBI and GTDB accession.
For instances where these assembly accessions are available, it would be great if there was a feature that produced a GTDB lineage table and an NCBI table, so the user can compare both tables, and make adjustments to the GTDB compliant table, especially for instances where GTDB lineage information isn't available (ie for Eukaryotes), or for instances where an Assembly Accession isn't available either database. Entrez should be capable of also fetching Assembly Accessions, in addition to TaxID information. There are also summary files that the NCBI has that links the Assembly Accessions to each taxID.
Alternatively, we could provide a guide to users on how to generate these converted tables in the Wiki.
The text was updated successfully, but these errors were encountered: