Merge branch 'dev' into nf-core-template-merge-3.1.2

nf-core · Jan 21, 2025 · 53e3569 · 53e3569
2 parents caaa3b3 + 75ea4d4
commit 53e3569
Show file tree

Hide file tree

Showing 93 changed files with 4,926 additions and 994 deletions.
diff --git a/.github/workflows/awsfulltest.yml b/.github/workflows/awsfulltest.yml
@@ -45,7 +45,6 @@ jobs:
 
       - name: Launch workflow via Seqera Platform
         uses: seqeralabs/action-tower-launch@v2
-        # TODO nf-core: You can customise AWS full pipeline tests as required
         # Add full size test data (but still relatively small datasets for few samples)
         # on the `test_full.config` test runs with only one set of parameters
         with:

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -27,7 +27,7 @@ jobs:
     strategy:
       matrix:
         NXF_VER:
-          - "24.04.2"
+          - "24.10.1"
           - "latest-everything"
         profile:
           - "conda"

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,14 +3,17 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v1.1.1 - [date]
+## v2.0.0~dev - [unreleased]
 
-Initial release of nf-core/pairgenomealign, created with the [nf-core](https://nf-co.re/) template.
+## [v1.1.1](https://github.com/nf-core/pairgenomealign/releases/tag/1.1.1) "Kani nabe" - [December 17th, 2024]
+
+This release brings the pipeline to the standards of Nextflow 24.10.1 and
+nf-core 3.1.0. No changes were made to the alignment process.
 
-### `Added`
+## [v1.1.0](https://github.com/nf-core/pairgenomealign/releases/tag/1.1.0) "Nattou maki" - [September 27th, 2024]
 
-### `Fixed`
+Added a new `softmask` parameter, to optionally keep original softmasking.
 
-### `Dependencies`
+## [v1.0.0](https://github.com/nf-core/pairgenomealign/releases/tag/1.0.0) "Sweet potato" - [August 27th, 2024]
 
-### `Deprecated`
+Initial release of nf-core/pairgenomealign, created with the [nf-core](https://nf-co.re/) template.
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -8,15 +8,27 @@
 
 > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
 
+## Pipeline design
+
+> Charles Plessy, Michael J. Mansfield, Aleksandra Bliznina, Aki Masunaga, Charlotte West, Yongkai Tan, Andrew W. Liu, Jan Grašič, María Sara del Río Pisula, Gaspar Sánchez-Serna, Marc Fabrega-Torrus, Alfonso Ferrández-Roldán, Vittoria Roncalli, Pavla Navratilova, Eric M. Thompson, Takeshi Onuma, Hiroki Nishida, Cristian Cañestro, Nicholas M. Luscombe. Extreme genome scrambling in marine planktonic Oikopleura dioica cryptic species. Genome Res. 2024. 34: 426-440; doi: [10.1101/2023.05.09.539028](https://doi.org/10.1101/gr.278295.123). PubMed ID: [38621828](https://pubmed.ncbi.nlm.nih.gov/38621828/)
+
 ## Pipeline tools
 
-- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [LAST](https://gitlab.com/mcfrith/last/)
+
+  > Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011 21(3):487-93. doi: 10.1101/gr.113985.110. PubMed PMID: 21209072 (This describes the main algorithms used by LAST.)
+
+  > Frith MC, Noé L. Improved search heuristics find 20,000 new alignments between human and mouse genomes. doi: 10.1093/nar/gku104 PubMed PMID: 24493737 (This describes sensitive DNA seeding (MAM8 and MAM4)
+
+  > Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biology. 2015 16:106. doi: 10.1186/s13059-015-0670-9 PubMed PMID: 25994148 (Describes the split alignment algorithm, and its application to whole genome alignment.)
+
+  > Hamada M, Ono Y, Asai K Frith MC. Training alignment parameters for arbitrary sequencers with LAST-TRAIN. Bioinformatics. 2017 33(6):926-928. doi: 10.1093/bioinformatics/btw742 PubMed PMID: 28039163 (Describes last-train.)
 
-> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
+  > Frith MC, Shaw J, Spouge JL. How to optimally sample a sequence for rapid analysis. doi: 10.1093/bioinformatics/btad057 PubMed PMID: 36702468 (Describes the lastdb -u RY sparsity options.)
 
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
-> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
 
 ## Software packaging/containerisation tools
 

diff --git a/README.md b/README.md
@@ -6,10 +6,10 @@
 </h1>
 
 [![GitHub Actions CI Status](https://github.com/nf-core/pairgenomealign/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/pairgenomealign/actions/workflows/ci.yml)
-[![GitHub Actions Linting Status](https://github.com/nf-core/pairgenomealign/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/pairgenomealign/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/pairgenomealign/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
+[![GitHub Actions Linting Status](https://github.com/nf-core/pairgenomealign/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/pairgenomealign/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/pairgenomealign/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.13910535-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.13910535)
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
 
-[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.2-23aa62.svg)](https://www.nextflow.io/)
+[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.10.1-23aa62.svg)](https://www.nextflow.io/)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
@@ -19,46 +19,44 @@
 
 ## Introduction
 
-**nf-core/pairgenomealign** is a bioinformatics pipeline that ...
+**nf-core/pairgenomealign** is a bioinformatics pipeline that aligns one or more _query_ genomes to a _target_ genome, and plots pairwise representations.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+![Tubemap workflow summary](docs/images/pairgenomealign-tubemap.png "Tubemap workflow summary")
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+The main steps of the pipeline are:
+
+1. Genome QC ([`assembly-scan`](https://github.com/rpetit3/assembly-scan)).
+2. Genome indexing ([`lastdb`](https://gitlab.com/mcfrith/last/-/blob/main/doc/lastdb.rst)).
+3. Genome pairwise alignments ([`lastal`](https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst)).
+4. Alignment plotting ([`last-dotplot`](https://gitlab.com/mcfrith/last/-/blob/main/doc/last-dotplot.rst)).
+
+The pipeline can generate four kinds of outputs, called _many-to-many_, _many-to-one_, _one-to-many_ and _one-to-one_, depending on whether sequences of one genome are allowed match the other genome multiple times or not.
+
+These alignments are output in [MAF](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) format, and optional line plot representations are output in PNG format.
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
 First, prepare a samplesheet with your input data that looks as follows:
 
 `samplesheet.csv`:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+sample,fasta
+query_1,path-to-query-genome-file-one.fasta
+query_2,path-to-query-genome-file-two.fasta
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
+Each row represents a fasta file, this can also contain multiple rows to accomodate multiple query genomes in fasta format.
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/pairgenomealign \
    -profile <docker/singularity/.../institute> \
+   --target sequencefile.fa \
    --input samplesheet.csv \
    --outdir <OUTDIR>
 ```
@@ -76,11 +74,15 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/pairgenomealign was originally written by charles-plessy.
+`nf-core/pairgenomealign` was originally written by [charles-plessy](https://github.com/charles-plessy); the original versions are available at <https://github.com/oist/plessy_pairwiseGenomeComparison>.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+- [Mahdi Mohammed](https://github.com/U13bs1125) ported the original pipeline to _nf-core_ template 2.14.x.
+- [Martin Frith](https://github.com/mcfrith/), the author of LAST, gave us extensive feedback and advices.
+- [Michael Mansfield](https://github.com/mjmansfi) tested the pipeline and provided critical comments.
+- [Aleksandra Bliznina](https://github.com/aleksandrabliznina) contributed to the creation of the initial `last/*` modules.
+- [Jiashun Miao](https://github.com/miaojiashun) and [Huyen Pham](https://github.com/ngochuyenpham) tested the pipeline on vertebrate genomes.
 
 ## Contributions and Support
 
@@ -90,10 +92,15 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
 
 ## Citations
 
-<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
-<!-- If you use nf-core/pairgenomealign for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
+If you use this pipeline, please cite:
+
+> **Extreme genome scrambling in marine planktonic Oikopleura dioica cryptic species.**
+> Charles Plessy, Michael J. Mansfield, Aleksandra Bliznina, Aki Masunaga, Charlotte West, Yongkai Tan, Andrew W. Liu, Jan Grašič, María Sara del Río Pisula, Gaspar Sánchez-Serna, Marc Fabrega-Torrus, Alfonso Ferrández-Roldán, Vittoria Roncalli, Pavla Navratilova, Eric M. Thompson, Takeshi Onuma, Hiroki Nishida, Cristian Cañestro, Nicholas M. Luscombe.
+> _Genome Res._ 2024. 34: 426-440; doi: [10.1101/2023.05.09.539028](https://doi.org/10.1101/gr.278295.123). PubMed ID: [38621828](https://pubmed.ncbi.nlm.nih.gov/38621828/)
+
+[OIST research news article](https://www.oist.jp/news-center/news/2024/4/25/oikopleura-who-species-identity-crisis-genome-community)
 
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
+And also please cite the [LAST papers](https://gitlab.com/mcfrith/last/-/blob/main/doc/last-papers.rst).
 
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 

diff --git a/assets/methods_description_template.yml b/assets/methods_description_template.yml
@@ -3,7 +3,6 @@ description: "Suggested text and references to use when describing pipeline usag
 section_name: "nf-core/pairgenomealign Methods Description"
 section_href: "https://github.com/nf-core/pairgenomealign"
 plot_type: "html"
-## TODO nf-core: Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
 ## You inject any metadata in the Nextflow '${workflow}' object
 data: |
   <h4>Methods</h4>

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -13,3 +13,52 @@ report_section_order:
 export_plots: true
 
 disable_version_detection: true
+
+custom_data:
+  train:
+    file_format: "tsv"
+    section_name: "Training parameter statistics"
+    plot_type: "table"
+    headers:
+      id:
+        title: "ID"
+        description: "target___query"
+      substitution_percent_identity:
+        title: "Substitution Percent Identity"
+      "last -t":
+        title: "Temperature"
+        description: "Parameter for converting between scores and probability ratios.
+          This affects the column ambiguity estimates. A score is converted to a probability
+          ratio by this formula: exp(score / TEMPERATURE). The default value is 1/lambda,
+          where lambda is the scale factor of the scoring matrix, which is calculated
+          by the method of Yu and Altschul (YK Yu et al. 2003, PNAS 100(26):15688-93)."
+      "last -a":
+        title: "Gap existence"
+        description: "Gap existence cost (lastal -a)"
+      "last -b":
+        title: "Gap extension"
+        description: "Gap extension cost (lastal -b)"
+      "last -A":
+        title: "Insertion existence"
+        description: "Insertion existence cost (lastal -A)"
+      "last -B":
+        title: "Insertion extension"
+        description: "Insertion extension cost (lastal -B)"
+  last_o2o:
+    file_format: "tsv"
+    section_name: "Alignment statistics"
+    plot_type: "table"
+    headers:
+      id:
+        title: "ID"
+        description: "target__query"
+      TotalAlignmentLength:
+        title: "Total alignment length"
+      PercentSimilarity:
+        title: "Percent similarity"
+
+sp:
+  last_o2o:
+    fn: "*o2o_aln.tsv"
+  train:
+    fn: "*train.tsv"