Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on GTDBr214.1 gtdb taxdump file regarding taxID #8

Open
aababc1 opened this issue Dec 6, 2023 · 6 comments
Open

question on GTDBr214.1 gtdb taxdump file regarding taxID #8

aababc1 opened this issue Dec 6, 2023 · 6 comments

Comments

@aababc1
Copy link

aababc1 commented Dec 6, 2023

Hello Thank you for your nice work.
I downloaded GTDB taxa to utilize it for kraken database. (the taxonomy files you've created)
I utilized GTDBr214.1 taxdump files.

I just found out one specific taxa is not aligned with seven level taxnonmy (domain to species).
I did it directly on the downloaded taxonomy dataset too.
$grep 1830337315 *
GTDBr214.1_taxid_taxonomy:GCA_003162175.1 1830337315 Archaea;Halobacteriota;Bog-38;Bog-38 sp003162175;003162175

$taxonkit lineage <(echo 1830337315) --data-dir /data1/DBs/kraken2/gtdbr214.1/gtdb-taxdump/R214.1/
1830337315 Archaea;Halobacteriota;Bog-38;Bog-38 sp003162175;003162175

I think duplicated names are removed , that have same names in different taxonomy units somehow.

In officail GTDB site, they have duplicated names in different taxonomy unit.
image

I don't know it's removed during taxonkit execution or taxdump file creation .
Can you inspect about it?

Thank you very much.

@aababc1 aababc1 changed the title Hello I have a question on your data question on GTDBr214.1 gtdb taxdump file regarding taxID Dec 6, 2023
@shenwei356
Copy link
Owner

Yes, it's "removed" during taxdump file creation.

There are some doc in the help message:

$ taxonkit create-taxdump -h
Attentions:
  1. Names should be distinct in taxa of different ranks.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     the Class and Genus have the same name B47-G6, and the Order and Family
     between them have different names. In this case, we reassign a new TaxId
     by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

@aababc1
Copy link
Author

aababc1 commented Dec 7, 2023

Thank you for your reply.
I wonder that when It is used for the downstream analysis.

When I used your taxdump files for kraken2 database comrised of GTDBr214.1 Species representative 85202 genome , the Kraken and bracken report file report the absence of specific taxonomic unit.

As I understancd, the GTDB database taxonomy units holded in specific name placeholders such as case I mentioned, are genuine taxonomy that should be considered in analysis .

When I look at the kraken report results, the duplicated intermediate taxonomy unit names (such as class order family ) are just omitted would affect taxonomy abundance analysis in those rank.

or I could misunderstand somepoint the way of action in kraken2 and bracken taxonomy processing.

When I converted the bracken to mpa style taxonomic composition report file by using KrakenTools provided by kraken2 authos, they produced output files in this way.

bracken2mpa:d__Archaea|p__Halobacteriota|c__Bog-38 0.0002
bracken2mpa:d__Archaea|p__Halobacteriota|c__Bog-38|s__Bog-38_sp003158275 0.0002

In the species level or phylum class , the reports will be complete , but regarding family and order level , the information will be just vanished in my guess .

Though Kraken2 and bracken is not taxdump, the analysis are heavily dependent on taxonomy files, so I wonder your thought about it.

Thank you very much

@shenwei356
Copy link
Owner

shenwei356 commented Dec 7, 2023

I understand your worries. In practice, we only summarize at rank phylum and species.

Besides, for predictions with an abundance lower to 0.0002, which probably are false positives.

You can also ask if KrakenTools can support these cases.

@aababc1
Copy link
Author

aababc1 commented Dec 7, 2023

Okay.
I get the points what you are saying.
There could be some viable approaches I can adapt.
Thank you very much for your comment.

@aababc1
Copy link
Author

aababc1 commented Apr 29, 2024

Hello.
I am using GTDBr220 gtdb-taxdump information .
I asked once about, missing taxonomic rank in gtdb-taxdump .
Now, taxonomy information seems to be changed according to aligning with GTDB's taxonomy classificaiton system.
shenwei356/taxonkit#92.

Regarding this I have two question .
1 : In the r220 GTDB taxonomy files, are you using full taxonomy as you changed in taxonkit 0.16.2(by allowing duplicated names in different rank)?
2 : If first question has been already modified as you did in taxonkit 0.16.2, all previous versions taxonkit could not be used for updated GTDB taxonomy ?

Thank you for your great contribution wei.

@shenwei356
Copy link
Owner

1 : In the r220 GTDB taxonomy files, are you using full taxonomy as you changed in taxonkit 0.16.2(by allowing duplicated names in different rank)?

Yes. shenwei356/taxonkit#92 (comment)

2 : If first question has been already modified as you did in taxonkit 0.16.2, all previous versions taxonkit could not be used for updated GTDB taxonomy ?

0.16.2 is not released yet :), It's 0.16.0.

all previous versions taxonkit could not be used for updated GTDB taxonomy ?

Yes, old taxonkit versions can still be used for updated GTDB taxonomy as the taxudmp file format is not changed.

taxonkit v0.2.5 (Oct 12, 2018)

$ ./taxonkit version 
taxonkit v0.2.5

Checking new version...
New version available: taxonkit v0.16.0 at https://github.com/shenwei356/taxonkit/releases/tag/v0.16.0

$ echo 1662163052 | ./taxonkit lineage --data-dir gtdb-taxdump/R220/
1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445;009780445

The latest

$ taxonkit version 
taxonkit v0.16.0
$ echo 1662163052 | taxonkit lineage --data-dir gtdb-taxdump/R220/
1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445;009780445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants