Skip to content

Commit

Permalink
chore: release 1.0.0 (#1)
Browse files Browse the repository at this point in the history
* chore: release 1.0.0

* feat: create symlink for selected clusters

* docs: add detailed descriptions of separate steps

* feat: add bandage graph

* feat: visualize tree cluster with ggtree

* chore: remove threads for bandage

* fix: correct logging

* chore: release 1.0.0

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Matin Nuhamunada <matinnu@biosustain.dtu.dk>
  • Loading branch information
github-actions[bot] and matinnuhamunada authored Apr 21, 2022
1 parent d4dd5ed commit eb90925
Show file tree
Hide file tree
Showing 11 changed files with 216 additions and 17 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Changelog

## 1.0.0 (2022-04-20)


### Features

* add polishing step ([d16c791](https://www.github.com/matinnuhamunada/genome_assembly_tryouts/commit/d16c791b54e0cd447ba533fa65d0b35d93eda9a0))
* initial attempt of trycycler ([2ae19d4](https://www.github.com/matinnuhamunada/genome_assembly_tryouts/commit/2ae19d4003c09801c486748a6a09f32c2ecb3256))
* split steps ([0eed51f](https://www.github.com/matinnuhamunada/genome_assembly_tryouts/commit/0eed51ffe6c938bd630d82791051b1daf9dbfd8b))


### Bug Fixes

* clean files ([796bc78](https://www.github.com/matinnuhamunada/genome_assembly_tryouts/commit/796bc78a7fc392b85cbbaee8ad85d17397be7e3b))
* correct input folder name ([cbad160](https://www.github.com/matinnuhamunada/genome_assembly_tryouts/commit/cbad1602c3f119331428f3300cf5bb2dcf29cae1))
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
[![Snakemake](https://img.shields.io/badge/snakemake-≥6.15.1-brightgreen.svg)](https://snakemake.github.io)

This is an experimental snakemake workflow for trying out [trycylcer](https://github.com/rrwick/Trycycler).
It follow the author's instruction: https://github.com/rrwick/Trycycler/wiki
It follow the author's instruction: https://github.com/rrwick/Trycycler/wiki, where the assembly are split into different steps.
See [step 4](#step-4-executing-the-workflow)

## Usage
### Step 1: Clone the workflow
Expand Down Expand Up @@ -56,14 +57,22 @@ Activate the conda environment:
conda activate snakemake

#### Part 1 - Generating Assemblies
This step generates multiple assemblies as described in: https://github.com/rrwick/Trycycler/wiki/Generating-assemblies

snakemake --snakefile workflow/Snakefile-assembly --use-conda --cores <n_cores>

TO DO: Generate a bandage gfa graph for each assemblies for manual curation
#### Part 2 - Clustering Contigs
This step clusters the assemblies into per-replicon groups as described in: https://github.com/rrwick/Trycycler/wiki/Clustering-contigs

snakemake --snakefile workflow/Snakefile-cluster --use-conda --cores <n_cores>

This step also generate `data/interim/02_trycycler_cluster/cluster.yaml` which should be copied to the config folder in order to proceed to the next step. NOTE: You can select or drops the bad contigs or clusters that will ber run in the next step

TO DO: Generate a tree image with necessary information (size, depth) for manual curation of the clusters
#### Part 3 - Generating Consensus
This step summarizes step 3, 4, 5, and 6 in the Trycycler wiki and generate the consensus contig sequence as described in: https://github.com/rrwick/Trycycler/wiki/Generating-a-consensus


snakemake --snakefile workflow/Snakefile-consensus --use-conda --cores <n_cores>

Expand Down
3 changes: 2 additions & 1 deletion workflow/Snakefile-assembly
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ include: "rules/common.smk"
##### Target rules #####
rule all:
input:
expand('data/interim/01_trycycler_assembly/{strains}/nanopore/assemblies/assembly_{subsample}.fasta', subsample=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'], strains=STRAINS)
expand('data/interim/01_trycycler_assembly/{strains}/nanopore/assemblies/assembly_{subsample}.fasta', subsample=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'], strains=STRAINS),
expand('data/interim/01_trycycler_assembly/{strains}/nanopore/assemblies/assembly_{subsample}.png', subsample=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'], strains=STRAINS)

##### Modules #####
include: "rules/01_trycycler_assembly.smk"
3 changes: 2 additions & 1 deletion workflow/Snakefile-cluster
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ include: "rules/common.smk"
##### Target rules #####
rule all:
input:
expand('data/interim/02_trycycler_cluster/cluster.yaml')
'data/interim/02_trycycler_cluster/cluster.yaml',
expand('data/processed/{strains}/02_trycycler_cluster/{strains}_cluster.png', strains=STRAINS),

##### Modules #####
include: "rules/02_trycycler_cluster.smk"
16 changes: 16 additions & 0 deletions workflow/envs/R.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: R_env
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- r-base
- bioconductor-ggtree
- bioconductor-treeio
- bioconductor-ggtreeextra
- r-ggplot2
- r-ape
- r-dplyr
- r-argparser
- r-aplot
- r-tidytree
1 change: 1 addition & 0 deletions workflow/envs/utilities.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ dependencies:
- minimap2
- miniasm
- racon
- bandage
- pip
- pip:
- git+https://github.com/rrwick/Minipolish.git
14 changes: 14 additions & 0 deletions workflow/rules/01_trycycler_assembly.smk
Original file line number Diff line number Diff line change
Expand Up @@ -120,4 +120,18 @@ rule assemble_raven:
shell:
"""
raven --threads {threads} --graphical-fragment-assembly {output.graph} --disable-checkpoints {input} > {output.assembly} 2>> {log}
"""

rule draw_graph:
input:
graph = 'data/interim/01_trycycler_assembly/{strains}/nanopore/assemblies/assembly_{subsample}.gfa'
output:
graph = 'data/interim/01_trycycler_assembly/{strains}/nanopore/assemblies/assembly_{subsample}.png'
log:
"workflow/report/logs/01_trycycler_assembly/{strains}/bandage/{strains}_{subsample}.log"
conda:
"../envs/utilities.yaml"
shell:
"""
Bandage image {input.graph} {output.graph} &>> {log}
"""
18 changes: 16 additions & 2 deletions workflow/rules/02_trycycler_cluster.smk
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ rule trycylcer_cluster:

rule cluster_dump:
input:
expand('data/interim/02_trycycler_cluster/{strains}', strains=STRAINS)
expand('data/interim/02_trycycler_cluster/{strains}', strains=STRAINS),
output:
yaml = 'data/interim/02_trycycler_cluster/cluster.yaml'
log:
Expand All @@ -41,4 +41,18 @@ rule cluster_dump:

# write as yaml
with open(output.yaml, 'w') as f:
yaml.dump(clusters, f, Dumper=yaml_indent_dump, default_flow_style=False)
yaml.dump(clusters, f, Dumper=yaml_indent_dump, default_flow_style=False)

rule cluster_draw:
input:
cluster='data/interim/02_trycycler_cluster/{strains}'
output:
png = 'data/processed/{strains}/02_trycycler_cluster/{strains}_cluster.png'
log:
"workflow/report/logs/02_trycycler_cluster/draw_cluster_{strains}.log"
conda:
"../envs/R.yaml"
shell:
"""
Rscript workflow/scripts/ggtree.R -i {input.cluster}/contigs.newick -o {output.png} 2>> {log}
"""
24 changes: 12 additions & 12 deletions workflow/rules/03_trycycler_consensus.smk
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ rule trycylcer_intermediate:
reconcile = 'data/interim/03_trycycler_consensus/{strains}/{cluster}_copy.log'
threads: 12
log:
"workflow/report/logs/trycycler/trycylcer_intermediate-{cluster}-{strains}.log"
"workflow/report/logs/03_trycycler_consensus/trycycler_intermediate/trycycler_intermediate-{cluster}-{strains}.log"
params:
cluster_file = config["clusters"]
shell:
Expand All @@ -21,7 +21,7 @@ rule trycylcer_reconcile:
reconcile = 'data/interim/03_trycycler_consensus/{strains}/{cluster}/2_all_seqs.fasta'
threads: 12
log:
"workflow/report/logs/trycycler/trycylcer_reconcile-{cluster}-{strains}.log"
"workflow/report/logs/03_trycycler_consensus/trycylcer_reconcile-{cluster}-{strains}.log"
conda:
"../envs/trycycler.yaml"
params:
Expand All @@ -38,7 +38,7 @@ rule trycylcer_MSA:
msa = 'data/interim/03_trycycler_consensus/{strains}/{cluster}/3_msa.fasta'
threads: 12
log:
"workflow/report/logs/trycycler/trycylcer_MSA-{cluster}-{strains}.log"
"workflow/report/logs/03_trycycler_consensus/trycylcer_MSA-{cluster}-{strains}.log"
conda:
"../envs/trycycler.yaml"
params:
Expand All @@ -57,7 +57,7 @@ rule trycylcer_partition:
partition = "data/interim/03_trycycler_consensus/{strains}/partition.log"
threads: 12
log:
"workflow/report/logs/trycycler/trycylcer_partition_{strains}.log"
"workflow/report/logs/03_trycycler_consensus/trycylcer_partition_{strains}.log"
conda:
"../envs/trycycler.yaml"
params:
Expand All @@ -76,7 +76,7 @@ rule trycylcer_consensus:
consensus = 'data/interim/03_trycycler_consensus/{strains}/{cluster}/7_final_consensus.fasta'
threads: 12
log:
"workflow/report/logs/trycycler/trycylcer_consensus-{cluster}-{strains}.log"
"workflow/report/logs/03_trycycler_consensus/trycylcer_consensus-{cluster}-{strains}.log"
conda:
"../envs/trycycler.yaml"
shell:
Expand All @@ -86,22 +86,22 @@ rule trycylcer_consensus:

rule medaka_polish:
input:
cluster = "data/interim/03_trycycler_consensus/{strains}/{cluster}",
consensus = 'data/interim/03_trycycler_consensus/{strains}/{cluster}/7_final_consensus.fasta'
output:
medaka = "data/interim/03_trycycler_consensus/{strains}/{cluster}/8_medaka.fasta"
conda:
"../envs/trycycler.yaml"
threads: 12
log:
"workflow/report/logs/trycycler/medaka_polish-{cluster}-{strains}.log"
"workflow/report/logs/03_trycycler_consensus/medaka_polish-{cluster}-{strains}.log"
params:
model = 'r941_min_sup_g507'
model = 'r941_min_sup_g507',
cluster = "data/interim/03_trycycler_consensus/{strains}/{cluster}",
shell:
"""
medaka_consensus -i {input.cluster}/4_reads.fastq -d {input.consensus} -o {input.cluster}/medaka -m {params.model} -t 12 2>> {log}
mv {input.cluster}/medaka/consensus.fasta {input.cluster}/8_medaka.fasta
#rm -r {input.cluster}/medaka {input.cluster}/*.fai {input.cluster}/*.mmi
medaka_consensus -i {params.cluster}/4_reads.fastq -d {input.consensus} -o {params.cluster}/medaka -m {params.model} -t 12 &>> {log}
mv {params.cluster}/medaka/consensus.fasta {params.cluster}/8_medaka.fasta
#rm -r {params.cluster}/medaka {params.cluster}/*.fai {params.cluster}/*.mmi
"""

rule trycylcer_concat:
Expand All @@ -111,7 +111,7 @@ rule trycylcer_concat:
assembly = 'data/interim/03_trycycler_consensus/{strains}/assembly.fasta'
threads: 12
log:
"workflow/report/logs/trycycler/trycylcer_concat-{strains}.log"
"workflow/report/logs/03_trycycler_consensus/trycylcer_concat-{strains}.log"
conda:
"../envs/trycycler.yaml"
shell:
Expand Down
90 changes: 90 additions & 0 deletions workflow/scripts/ggtree.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#library("treedataverse")
library("argparser")
library("aplot")
#library("svglite")
library("ape")
library("dplyr")
library("ggplot2")
library("tidytree")
library("treeio")
library("ggtree")
library("ggtreeExtra")

# Parse command line arguments
p <- arg_parser("Draw tree from trycycler cluster, with cluster names, sequence lengths, and read depths")

p <- add_argument(p, "--input",
short = "-i",
help = "Newick file generated by trycycler",
default = "contigs.newick")

p <- add_argument(p, "--output",
short = "-o",
default = "tree_cluster.png",
help = "Output image file (PNG)")

argv <- parse_args(p)


# READFILE
newick <- read.newick(argv$input)

# GENERATE EMPTY DATAFRAME
metadata <- data.frame(old_label = vector(mode = "character"),
label = vector(mode = "character"),
cluster_name = vector(mode = "character"),
length = vector(mode = "integer"),
depth = vector(mode = "numeric"))

# EXTRACT DATA FROM TREE
label <- newick$tip.label
for (i in seq_along(label)) {
item <- sapply(strsplit(label[i],"_"), function(x){
old_label <- label[i]
new_label <- paste(x[[2]], x[[3]], sep = "")
cluster_name <- paste(x[[1]], x[[2]], sep = "_")
length <- as.integer(x[length(x) - 2])
tip_depth <- tail(x, n=1)
depth <- as.numeric(substring(tip_depth, 1, nchar(tip_depth) - 1))
output <- c(old_label, new_label, cluster_name, length, depth)
})
metadata[nrow(metadata) + 1, ] <- item
}
metadata <- transform(metadata, length = as.integer(length),
depth = as.integer(depth))

# RENAME TIP LABEL AND MERGE DATA INTO TREE
tree <- rename_taxa(newick, metadata, old_label, label)
drops <- c("old_label")
tree_data <- metadata[ , !(names(metadata) %in% drops)]
p <- ggtree(tree) %<+% tree_data


# DRAW TREE
g <- p + geom_tiplab(offset = .005,
hjust = 0.005,
align=TRUE,
size = 3.2) +
geom_tippoint(aes(shape = cluster_name, color = cluster_name)) +
theme(legend.position = "right") +
theme_tree2()

# DRAW BUBBLES
p2 <- ggplot(tree_data, aes(x=0, y = label, label=length)) +
geom_point(aes(size = length, fill = depth), alpha = 0.9, shape = 21) +
geom_text(hjust=-0.5, vjust=0.5, size=2) +
scale_size(trans = "log10") +
scale_fill_gradient(low = "white", high = "blue") +
theme(axis.title = element_blank(),
axis.ticks.x = element_blank(),
axis.text = element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_blank(),
plot.background=element_blank(),
)

# CREATE COMPOSITE GRAPH
p2 %>% insert_left(g, width = 2)

ggsave(argv$output, dpi = 300)
37 changes: 37 additions & 0 deletions workflow/scripts/symlink_cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from pathlib import Path
import yaml
import sys

def get_clusters(filepath):
"""
Read the clusters output and return a dictionary
"""
try:
with open(filepath) as file:
selected_cluster = yaml.load(file, Loader=yaml.FullLoader)
return selected_cluster
except FileNotFoundError as e:
sys.stderr.write(f"No cluster selected. The file: <{filepath}> is not a valid cluster format. Check your config.yaml.\n")
raise e

def symlink_cluster(strain, cluster, cluster_file, source_dir = 'data/interim/02_trycycler_cluster', target_dir = 'data/interim/03_trycycler_consensus'):
clusters = get_clusters(cluster_file)
source_path = Path(source_dir)
target_path = Path(target_dir)

for contigs in clusters[strain][cluster]:
source = source_path / strain / cluster / "1_contigs" / f"{contigs}.fasta"
target = target_path / strain / cluster / "1_contigs" / f"{contigs}.fasta"
target.parent.mkdir(parents=True, exist_ok=True)
try:
target.symlink_to(source.resolve(), target_is_directory=False)
except:
if target.is_symlink():
target.unlink(missing_ok=True)
target.symlink_to(source.resolve(), target_is_directory=False)
with open(str(target_path / strain / f"{cluster}_copy.log"), 'w') as f:
f.write("")
return

if __name__ == "__main__":
symlink_cluster(sys.argv[1], sys.argv[2], sys.argv[3])

0 comments on commit eb90925

Please sign in to comment.