-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InterProScan to Pipeline and integrate in AMPcombi #428
base: dev
Are you sure you want to change the base?
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.1.0. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
@nf-core-bot fix linting |
@nf-core-bot fix linting |
Also fixes issue number #434 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main issue is I don't like the use of function
, we already use functon
in funcscan
in a broad sense... can you refine what exactly we are using interproscan for and then we can adjust the naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
Also missing README update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work overall 💪
Now that you introduce the protein_annotation
workflow, I wonder if we should rename the DNA-level annotation
workflow (of pyrodigal, bakta etc.). Maybe to contig_annotation
, cds_annotation
, or orf_annotation
?
| MultiQC | 1.24.0 | 1.27 | | ||
| Pyrodigal | 3.3.0 | 3.6.3 | | ||
| seqkit | 2.8.1 | 2.9.0 | | ||
======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
======= |
| Tool | Previous version | New version | | ||
| ------------ | ---------------- | ----------- | | ||
| AMPcombi | 0.2.2 | 2.0.1 | | ||
| Bakta | 1.9.3 | 1.10.4 | | ||
| InterProScan | - | 5.59_91.0 | | ||
| Macrel | 1.2.0 | 1.4.0 | | ||
| MMseqs2 | 15.6f452 | 17.b804f | | ||
| MultiQC | 1.24.0 | 1.27 | | ||
| Pyrodigal | 3.3.0 | 3.6.3 | | ||
| seqkit | 2.8.1 | 2.9.0 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this table formatting intended? I don't know if we should write it like this (i.e. without filling up spaces).
@@ -70,6 +70,14 @@ | |||
|
|||
> Eddy S. R. (2011). Accelerated Profile HMM Searches. PLoS computational biology, 7(10), e1002195. [DOI: 10.1371/journal.pcbi.1002195](https://doi.org/10.1371/journal.pcbi.1002195) | |||
|
|||
- [InterPro](https://doi.org/10.1093/nar/gkaa977) | |||
|
|||
> Blum, M., Chang, H-Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G.A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D.H., Letunic, I., Marchler-Bauer, A., Mi, H., Natale, D.A., Necci, M., Orengo, C.A., Pandurangan, A.P., Rivoire, C., Sigrist, C.A., Sillitoe, I., Thanki, N., Thomas, P.D., Tosatto, S.C.E, Wu, C.H., Bateman, A., Finn, R.D. (2021) The InterPro protein families and domains database: 20 years on, Nucleic Acids Research, 49(D1), D344–D354.[DOI: 10.1093/nar/gkaa977](https://doi.org/10.1093/nar/gkaa977). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> Blum, M., Chang, H-Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G.A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D.H., Letunic, I., Marchler-Bauer, A., Mi, H., Natale, D.A., Necci, M., Orengo, C.A., Pandurangan, A.P., Rivoire, C., Sigrist, C.A., Sillitoe, I., Thanki, N., Thomas, P.D., Tosatto, S.C.E, Wu, C.H., Bateman, A., Finn, R.D. (2021) The InterPro protein families and domains database: 20 years on, Nucleic Acids Research, 49(D1), D344–D354.[DOI: 10.1093/nar/gkaa977](https://doi.org/10.1093/nar/gkaa977). | |
> Blum, M., Chang, H-Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G. A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D. H., Letunic, I., Marchler-Bauer, A., Mi, H., Natale, D. A., Necci, M., Orengo, C. A., Pandurangan, A. P., Rivoire, C., Sigrist, C. A., Sillitoe, I., Thanki, N., Thomas, P. D., Tosatto, S. C. E, Wu, C. H., Bateman, A., Finn, R. D. (2021) The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, 49(D1), D344–D354. [DOI: 10.1093/nar/gkaa977](https://doi.org/10.1093/nar/gkaa977) |
|
||
- [InterProScan](https://doi.org/10.1093/bioinformatics/btu031) | ||
|
||
> Jones, P., Binns, D., Chang, H-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A.F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S-Y., Lopez, R., Hunter, S. (2014)InterProScan 5: genome-scale protein function classification, Bioinformatics, 30(9), 1236–1240. [DOI: 10.1093/bioinformatics/btu031](https://doi.org/10.1093/bioinformatics/btu031) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> Jones, P., Binns, D., Chang, H-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A.F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S-Y., Lopez, R., Hunter, S. (2014)InterProScan 5: genome-scale protein function classification, Bioinformatics, 30(9), 1236–1240. [DOI: 10.1093/bioinformatics/btu031](https://doi.org/10.1093/bioinformatics/btu031) | |
> Jones, P., Binns, D., Chang, H-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A. F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S-Y., Lopez, R., Hunter, S. (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236–1240. [DOI: 10.1093/bioinformatics/btu031](https://doi.org/10.1093/bioinformatics/btu031) |
withName: SEQKIT_SEQ_FILTER { | ||
ext.prefix = { "${meta.id}_cleaned.faa" } | ||
publishDir = [ | ||
path: { "${params.outdir}/protein_annotation/interproscan/" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want the output in ${params.outdir}/protein_annotation/interproscan/
and not in ${params.outdir}/annotation/interproscan/
? I'd prefer the latter, to have it all in one place regardless of DNA (pyrodigal etc.) or protein annotation (interproscan). I think it's more intuitive to search for any annotation results in a single folder.
If not, what do you think of renaming the annotation
output folder to contig_annotation
?
.first() | ||
} else { | ||
INTERPROSCAN_DATABASE ( params.protein_annotation_interproscan_db_url ) | ||
ch_versions = ch_versions.mix( INTERPROSCAN_DATABASE.out.versions ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch_versions = ch_versions.mix( INTERPROSCAN_DATABASE.out.versions ) | |
ch_versions = ch_versions.mix( INTERPROSCAN_DATABASE.out.versions ) |
} | ||
|
||
INTERPROSCAN( ch_faa_for_interproscan, ch_interproscan_db ) | ||
ch_versions = ch_versions.mix( INTERPROSCAN.out.versions ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch_versions = ch_versions.mix( INTERPROSCAN.out.versions ) | |
ch_versions = ch_versions.mix( INTERPROSCAN.out.versions ) |
ch_versions = ch_versions.mix( INTERPROSCAN.out.versions ) | ||
ch_interproscan_tsv = ch_interproscan_tsv.mix( INTERPROSCAN.out.tsv ) | ||
|
||
// Current INTERPROSCAN version 5.59_91.0 only includes 13 columns and not 15 which ampcombi expects, so we added them here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this something to solve upstream on AMPcombi side? 😬 Is ok for now I guess, but better to have this column number check done by AMPcombi instead of pipeline level.
PROTEIN_ANNOTATION ( | ||
ch_input_for_protein_annotation | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PROTEIN_ANNOTATION ( | |
ch_input_for_protein_annotation | |
) | |
PROTEIN_ANNOTATION ( ch_input_for_protein_annotation ) |
|
||
ch_interproscan_tsv = PROTEIN_ANNOTATION.out.tsv.map { meta, file -> | ||
if (file == [] || file.isEmpty()) { | ||
log.warn("[nf-core/funcscan] Protein annotation with INTERPROSCAN produced an empty TSV file. No protein annotation will be added for ${meta.id}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.warn("[nf-core/funcscan] Protein annotation with INTERPROSCAN produced an empty TSV file. No protein annotation will be added for ${meta.id}.") | |
log.warn("[nf-core/funcscan] Protein annotation with InterProScan produced an empty TSV file. No protein annotation will be added for sample ${meta.id}.") |
PR checklist
This PR adds InterProScan to FUNCSCAN. It also integrates it into AMPcombi v2.0.1, which can parse its output as an optional flag.
This PR also closes issue #434
🚨 🚨 As interproscan requires a large database, i have not added it to any of the CI tests as that would require 4 hours for just downloading the database!!!!
👀 👀 👀 👀 👀 👀 Still TODO once AMPcombi 2.0.1 is updated in nf-core: DONE!!nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).