Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of multithreading #419

Merged
merged 87 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
fbb5df9
Optimise recombination identification
nickjcroucher Apr 17, 2024
ec400bd
Merge pull request #404 from nickjcroucher/master
nickjcroucher Apr 17, 2024
dc34d59
Fix comment typo
nickjcroucher Apr 18, 2024
2e0cfa5
Restore master version to check CI
nickjcroucher Apr 18, 2024
afd8aed
Restore filtering block functions
nickjcroucher Apr 18, 2024
28fcdb2
Update get_blocks function
nickjcroucher Apr 18, 2024
899a885
Update downstream recombinations function
nickjcroucher Apr 18, 2024
37a5d36
Revert "Update get_blocks function"
nickjcroucher Apr 18, 2024
e32c59f
Add tree iteration functions
nickjcroucher Apr 18, 2024
c7c0f77
Access lists of nodes at different depths
nickjcroucher Apr 18, 2024
88056ef
Revert breaking changes
nickjcroucher Apr 18, 2024
0c75025
Prototype multithreading function
nickjcroucher Apr 18, 2024
98a8854
Remove node counter
nickjcroucher Apr 18, 2024
8865558
Rename variables to help rewrite
nickjcroucher Apr 19, 2024
da4470c
Introduce new data structures
nickjcroucher Apr 19, 2024
11385bd
Switch recursion to loop
nickjcroucher Apr 19, 2024
73307e4
Correct indentation of sequence storage
nickjcroucher Apr 19, 2024
8ceba52
Add pthreads to compilation commands
nickjcroucher Apr 19, 2024
0bf08b1
Correct length of sequence stored
nickjcroucher Apr 19, 2024
fbd7e33
Correct array parsing
nickjcroucher Apr 19, 2024
02e5201
Fix conditional test
nickjcroucher Apr 19, 2024
9d458a5
Fix NULL initialisation
nickjcroucher Apr 19, 2024
16ae012
Fix memory clearing error
nickjcroucher Apr 20, 2024
fdb338e
Fix memory allocation statements
nickjcroucher Apr 20, 2024
7fa1eb1
Update ordering of file with new function implementation
nickjcroucher Apr 20, 2024
5d97efb
Changes to formatting
nickjcroucher Apr 21, 2024
abb28ac
Multithread branch sequence reconstruction
nickjcroucher Apr 22, 2024
a41030f
Remove potential memory leaks
nickjcroucher Apr 22, 2024
d90ba08
Update memory management
nickjcroucher Apr 22, 2024
28cf4fd
Simplify code structure
nickjcroucher Apr 22, 2024
65ba08a
Test one thread
nickjcroucher Apr 22, 2024
65cce5b
Fix memory management
nickjcroucher Apr 23, 2024
b942613
Parallelise gap inference
nickjcroucher Apr 23, 2024
5478d65
Remove obsolete function
nickjcroucher Apr 23, 2024
000a04a
Modularise functions
nickjcroucher Apr 23, 2024
288ecd0
Fix iterator
nickjcroucher Apr 23, 2024
166d855
Fix counter value
nickjcroucher Apr 23, 2024
c13a6ba
Remove extra iterator
nickjcroucher Apr 23, 2024
abdac6e
Minor change to ll value
nickjcroucher Apr 23, 2024
b31f426
Remove obsolete sequence retrieval
nickjcroucher Apr 23, 2024
0f2658d
Enable thread safe printing
nickjcroucher Apr 23, 2024
a88692e
Enable spectification of thread number
nickjcroucher Apr 23, 2024
1adcc7f
Enable multithreaded recombination detection
nickjcroucher Apr 23, 2024
028bf05
Bump version
nickjcroucher Apr 23, 2024
309fb34
Update gubbins arguments
nickjcroucher Apr 23, 2024
f07aef1
Update gubbins arguments
nickjcroucher Apr 23, 2024
4530969
Update gubbins arguments
nickjcroucher Apr 23, 2024
d960245
Fix missing positional argument
nickjcroucher Apr 23, 2024
ef49279
Reorder gubbins flags
nickjcroucher Apr 23, 2024
d025174
Rename threaded functions
nickjcroucher Apr 23, 2024
ac194d1
Convert gap function to loop from recursive
nickjcroucher Apr 23, 2024
cd52952
Data structures for looping
nickjcroucher Apr 24, 2024
a04050d
Fix function definitions
nickjcroucher Apr 24, 2024
e0bfc6c
Fix genome length calculations
nickjcroucher Apr 24, 2024
20c1768
Fix SNP statistic calculation
nickjcroucher Apr 24, 2024
a49a330
Correct function name
nickjcroucher Apr 24, 2024
a03944b
Update expected tree following recombination length recalculation
nickjcroucher Apr 24, 2024
a7da2e2
Update after adjusting recombination length calculation
nickjcroucher Apr 24, 2024
eb630fa
Add stats test
nickjcroucher Apr 24, 2024
15fabaa
Parallelise gap function
nickjcroucher Apr 24, 2024
e298ad3
Set values by index not name
nickjcroucher Apr 24, 2024
57d1863
Avoid copying nodes when parallelising
nickjcroucher Apr 24, 2024
4b2488f
Update IQTREE multithreading argument
nickjcroucher May 27, 2024
d5091b5
Add ska.rust citation
nickjcroucher May 27, 2024
facb8db
Change CI for new OSX version
nickjcroucher May 28, 2024
da5cf49
Change CI again for new OSX version
nickjcroucher May 28, 2024
bbb2c8c
Remove chown command
nickjcroucher May 28, 2024
0f2f560
Add miniconda installation
nickjcroucher May 28, 2024
8b867ff
Change miniconda installation
nickjcroucher May 28, 2024
f4dac82
Change CI file indentation
nickjcroucher May 28, 2024
29d1792
Downgrade macos until dependencies upgraded
nickjcroucher May 28, 2024
cf77eaa
Change line width based on number of samples
nickjcroucher May 29, 2024
d13ff7f
Add script for extracting recombinant sequences
nickjcroucher May 31, 2024
f6843ef
Add extract_recombinant_sequences.py
nickjcroucher May 31, 2024
86f189a
Make script executable
nickjcroucher May 31, 2024
7a1218a
Add test_extract_recombination_sequences
nickjcroucher Jun 3, 2024
04f27d8
Add documentation for extract_recombinant_sequences.py
nickjcroucher Jun 3, 2024
13f1e64
Improve AVX RAxML selection
nickjcroucher Jun 7, 2024
5777717
Fix calculation of recombination lengths
nickjcroucher Jun 17, 2024
d2ed906
Enable recombination length calculation with gaps
nickjcroucher Jun 17, 2024
b2135ab
Calculate recombination length including gaps
nickjcroucher Jun 17, 2024
df8e440
Add argument to calculate_number_of_bases_in_recombinations_excluding…
nickjcroucher Jun 17, 2024
4ecbd72
Update function name
nickjcroucher Jun 17, 2024
4bd91a5
Test if rec length tests cause failure
nickjcroucher Jun 17, 2024
b0ddb28
Test if rec length tests cause failure
nickjcroucher Jun 17, 2024
98aa8fc
Update expected stats file
nickjcroucher Jun 17, 2024
8ca358f
Restore rec length tests
nickjcroucher Jun 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,19 @@ on:
jobs:

test-osx:
runs-on: macos-latest
runs-on: macos-13

steps:
- uses: actions/checkout@v3
- name: Set up Python 3.9
uses: actions/setup-python@v3
with:
python-version: 3.9
- name: Set up miniconda
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
python-version: 3.9
- name: Install dependencies with conda
run: |
sudo chown -R $UID $CONDA && source $CONDA/etc/profile.d/conda.sh && conda env update --file environment.yml
Expand Down
15 changes: 13 additions & 2 deletions R/scripts/plot_gubbins.R
Original file line number Diff line number Diff line change
Expand Up @@ -533,7 +533,7 @@ plot_gubbins_recombination <- function(gubbins_rec,gubbins_tree, start_pos = NA,
tidyr::unnest_longer(gene,
values_to = "Taxa") %>%
dplyr::mutate(Taxa = factor(Taxa,
levels = rev(get_taxa_name(gubbins_tree)))) %>%
levels = rev(ggtree::get_taxa_name(gubbins_tree)))) %>%
dplyr::mutate(length = end - start + 1) %>%
dplyr::arrange(rev(length)) %>%
dplyr::select(-length)
Expand All @@ -543,6 +543,16 @@ plot_gubbins_recombination <- function(gubbins_rec,gubbins_tree, start_pos = NA,
trim_start(start_pos) %>%
trim_end(end_pos)

# Get the number of taxa for selecting the line width
n_taxa <- length(ggtree::get_taxa_name(gubbins_tree))
rec_linewidth <-
dplyr::case_when(
n_taxa < 10 ~ 5,
n_taxa < 50 ~ 2,
n_taxa < 100 ~ 1.5,
TRUE ~ 1
)

# Plot recombination
rec_plot <-
ggplot(gubbins_rec,
Expand All @@ -551,7 +561,8 @@ plot_gubbins_recombination <- function(gubbins_rec,gubbins_tree, start_pos = NA,
y = Taxa,
yend = Taxa,
colour = Colour)) +
geom_segment(alpha = 0.25) +
geom_segment(alpha = 0.5,
linewidth = rec_linewidth) +
scale_colour_manual(values = c("red" = "red",
"blue" = "blue")) +
scale_y_discrete(drop = FALSE) +
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.3.5
3.4
1 change: 1 addition & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ AC_PROG_CXX

# Checks for pthread
AX_PTHREAD
LDFLAGS="$LDFLAGS -pthread"

# Checks for code coverage
AX_CODE_COVERAGE
Expand Down
8 changes: 8 additions & 0 deletions docs/gubbins_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,14 @@ To get summary statistics (e.g. ***r***/***m***) and subtrees for distinct clade
extract_gubbins_clade_statistics.py --clades [file designating isolates to clades] --gff out.recombination_predictions.gff --snps out.branch_base_reconstruction.embl --tree out.final_tree.tre --out tree_anaysis
```

To extract the recombinant sequences identified by Gubbins from the alignment:

```
extract_recombinant_sequences.py [-h] --aln [input alignment file] --gff out.recombination_predictions.gff --out-dir OUT_DIR [--start START] [--end END] [--terminal-only]
```

Note that currently this only identifies the unique alleles at the recombinant loci from the sequences in the alignment - it is not using the reconstructed sequences used to infer the recombinations on internal branches. This script will hopefully be updated to offer greater functionality in the future.

## Examples

Two example alignments can be downloaded from http://nickjcroucher.github.io/gubbins/:
Expand Down
6 changes: 3 additions & 3 deletions python/gubbins/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ def parse_and_run(input_args, program_description=""):
gubbins_command = create_gubbins_command(
gubbins_exec, gaps_alignment_filename, gaps_vcf_filename, current_tree_name,
input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size,
input_args.p_value, input_args.trimming_ratio, input_args.extensive_search)
input_args.p_value, input_args.trimming_ratio, input_args.extensive_search, input_args.threads)
printer.print(["\nRunning Gubbins to detect recombinations...", gubbins_command])
try:
subprocess.check_call(gubbins_command, shell=True)
Expand Down Expand Up @@ -798,10 +798,10 @@ def select_best_models(snp_alignment_filename,basename,current_tree_builder,inpu

def create_gubbins_command(gubbins_exec, alignment_filename, vcf_filename, current_tree_name,
original_alignment_filename, min_snps, min_window_size, max_window_size,
p_value, trimming_ratio, extensive_search):
p_value, trimming_ratio, extensive_search, num_threads):
command = [gubbins_exec, "-r", "-v", vcf_filename, "-a", str(min_window_size),
"-b", str(max_window_size), "-f", original_alignment_filename, "-t", current_tree_name,
"-m", str(min_snps), "-p", str(p_value), "-i", str(trimming_ratio)]
"-m", str(min_snps), "-p", str(p_value), "-i", str(trimming_ratio), "-n", str(num_threads)]
if extensive_search:
command.append("-x")
command.append(alignment_filename)
Expand Down
12 changes: 12 additions & 0 deletions python/gubbins/tests/test_python_scripts.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

import unittest
import os
import shutil
import subprocess
import hashlib
import glob
Expand Down Expand Up @@ -144,6 +145,17 @@ def test_recombination_counting_per_gene(self):
assert self.md5_check(out_tab, check_tab)
os.remove(out_tab)

# Test clade file extraction script
def test_extract_recombination_sequences(self):
multiple_aln = os.path.join(data_dir, "multiple_recombinations.aln")
multiple_gff = os.path.join(data_dir, "multiple_recombinations_gubbins.recombination_predictions.gff")
out_dir = os.path.join(data_dir, "test_loci")
rec_count_cmd = "extract_recombinant_sequences.py --aln " + multiple_aln + " --gff " + multiple_gff + " --out-dir " + out_dir
subprocess.check_call(rec_count_cmd, shell=True)
file_count = int(subprocess.run('ls -1 ' + out_dir + ' | wc -l', shell = True, stdout=subprocess.PIPE).stdout.decode('utf-8').rstrip())
assert file_count > 1
shutil.rmtree(out_dir)

# Test plotting script (an R exception)
def test_recombination_counting_per_gene(self):
tree_input = os.path.join(data_dir, "expected_RAxML_result.multiple_recombinations.iteration_5")
Expand Down
4 changes: 2 additions & 2 deletions python/gubbins/tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
class TestUtilities(unittest.TestCase):

def test_gubbins_command(self):
assert common.create_gubbins_command('AAA', 'BBB', 'CCC', 'DDD', 'EEE', 5, 10, 200, 0.05, 1.0, 0) \
== 'AAA -r -v CCC -a 10 -b 200 -f EEE -t DDD -m 5 -p 0.05 -i 1.0 BBB'
assert common.create_gubbins_command('AAA', 'BBB', 'CCC', 'DDD', 'EEE', 5, 10, 200, 0.05, 1.0, 0, 1) \
== 'AAA -r -v CCC -a 10 -b 200 -f EEE -t DDD -m 5 -p 0.05 -i 1.0 -n 1 BBB'

def test_translation_of_filenames_to_final_filenames(self):
assert common.translation_of_filenames_to_final_filenames('AAA', 'test') == {
Expand Down
2 changes: 1 addition & 1 deletion python/gubbins/treebuilders.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ def __init__(self, threads: 1, model: str, bootstrap = 0, seed = None, internal_
self.citation = "https://doi.org/10.1093/molbev/msaa015"

# Set parallelisation
command.extend(["-nt", str(self.threads)])
command.extend(["-T", str(self.threads)])

# Add flags
command.extend(["-safe","-redo"])
Expand Down
3 changes: 2 additions & 1 deletion python/gubbins/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,8 @@ def choose_executable_based_on_processor(list_of_executables: list):
for executable in list_of_executables:
if cpu_info and 'AVX2' in executable and 'avx2' in flags and which(executable):
break
elif cpu_info and 'AVX' in executable and 'avx' in flags and which(executable):
elif cpu_info and ('AVX' in executable and 'AVX2' not in executable)\
and ('avx' in flags and 'avx2' not in flags) and which(executable):
break
elif cpu_info and 'SSE3' in executable and 'sse3' in flags and which(executable):
break
Expand Down
101 changes: 101 additions & 0 deletions python/scripts/extract_recombinant_sequences.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#! python

# encoding: utf-8
# Wellcome Trust Sanger Institute and Imperial College London
# Copyright (C) 2020 Wellcome Trust Sanger Institute and Imperial College London
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#

# Generic imports
import os
import sys
import argparse
import re
import math

# Biopython imports
from Bio import AlignIO
from Bio import Phylo
from Bio import SeqIO
from Bio.Align import MultipleSeqAlignment
from Bio.Seq import Seq

# command line parsing
def get_options():

parser = argparse.ArgumentParser(description='Extract all the unique alleles at recombinant loci')

# input options
parser.add_argument('--aln',
help = 'Input alignment (FASTA format)',
required = True)
parser.add_argument('--gff',
help = 'GFF of recombinant regions detected by Gubbins',
required = True)
parser.add_argument('--out-dir',
help = 'Output directory',
required = True)
parser.add_argument('--start',
help = 'Start of region of interest',
default = 1,
required = False)
parser.add_argument('--end',
help = 'End of region of interest',
default = math.inf,
required = False)
parser.add_argument('--terminal-only',
help = 'Only extract recombinations on terminal branches',
default = False,
action = 'store_true')

return parser.parse_args()

# main code
if __name__ == "__main__":

# Get command line options
args = get_options()

# Create output directory
if not os.path.isdir(args.out_dir):
os.mkdir(args.out_dir)

# Read recombinant regions from GFF
rec_start = []
rec_end = []
with open(args.gff,'r') as gff_file:
for line in gff_file.readlines():
if not line.startswith('##'):
# Get coordinates
info = line.rstrip().split('\t')
taxon_pattern = re.compile('taxa="([^"]*)"')
taxon_set = set(taxon_pattern.search(info[8]).group(1).split())
if (len(taxon_set) == 1 or not args.terminal_only):
rec_start.append(int(info[3]))
rec_end.append(int(info[4]))

# Read in alignment and identify recombinations
alignment = AlignIO.read(args.aln,'fasta')
for (start,end) in zip(rec_start,rec_end):
if start >= args.start and end <= args.end:
out_fn = 'locus_' + str(start) + '_' + str(end) + '.aln'
with open(os.path.join(args.out_dir,out_fn),'w') as out_file:
seen_seqs = []
rec_locus_alignment = alignment[:,(start-1):end]
for taxon in rec_locus_alignment:
if taxon.seq not in seen_seqs:
out_file.write('>' + taxon.id + '\n' + str(taxon.seq) + '\n')
seen_seqs.append(taxon.seq)
2 changes: 1 addition & 1 deletion python/scripts/generate_ska_alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def ska_map_sequences(seq, k = None, ref = None):
' --out ' + args.out + '.csv',
shell = True)

sys.stderr.write("Completed generating alignment with ska2 (https://github.com/bacpop/ska.rust)\n")
sys.stderr.write("Completed generating alignment with ska2 (https://github.com/bacpop/ska.rust)\nPlease cite https://doi.org/10.1101/2024.03.25.586631\n")

# Clean up
if not args.no_cleanup:
Expand Down
1 change: 1 addition & 0 deletions python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
'scripts/gubbins_alignment_checker.py',
'scripts/generate_files_for_clade_analysis.py',
'scripts/count_recombinations_per_gene.py',
'scripts/extract_recombinant_sequences.py',
'../R/scripts/plot_gubbins.R'
],
tests_require=[
Expand Down
Loading
Loading