Unable to run spacedust normally #3

Dx-wmc · 2023-10-14T11:00:32Z

Expected Behavior

Test and obtain the expected gene cluster.

Current Behavior

using CDS

When I use the gff file generated by prokka, it prompts "Not enough columns in GFF file" ./spacedust createsetdb *fna setDB tmpFolder --gff-dir gff.txt --gff-type CDS

When running the next command ./spacedust clustersearch setDB setDB result.tsv tmpFolder, an error occurs.

using faa

there is no error in building the database, but an error also occurs when running ./spacedust clustersearch setDB setDB result.tsv tmpFolder.

A puzzling point

When I use the example in the current repository provided, CDS still prompts "Not enough columns in GFF file" while faa can run within a few minutes.

My gff and faa files were generated using prokka. The size of the my genomes is about 4.5M. Despite using the same command, my own data doesn't work properly.

Your Environment

I ran separately on Ubuntu and CentOS with the same command. example_data can be executed, but it fails when I try it with my own data.

spacedust Output (for bugs)

The output of the command `./spacedust clustersearch setDB setDB result.tsv tmpFolder`.

clustersearch setDB setDB result.tsv tmpFolder

MMseqs Version:                        	16b020301be952232d6eb2eaa2cd2ad0933d68b0
Substitution matrix                    	aa:blosum62.out,nucl:nucleotide.out
Add backtrace                          	true
Alignment mode                         	2
Alignment mode                         	0
Allow wrapped scoring                  	false
E-value threshold                      	10
Seq. id. threshold                     	0
Min alignment length                   	30
Seq. id. mode                          	0
Alternative alignments                 	0
Coverage threshold                     	0.8
Coverage mode                          	2
Max sequence length                    	65535
Compositional bias                     	1
Compositional bias                     	1
Max reject                             	2147483647
Max accept                             	2147483647
Include identical seq. id.             	false
Preload mode                           	0
Pseudo count a                         	substitution:1.100,context:1.400
Pseudo count b                         	substitution:4.100,context:5.800
Score bias                             	0
Realign hits                           	false
Realign score bias                     	-0.2
Realign max seqs                       	2147483647
Correlation score weight               	0
Gap open cost                          	aa:11,nucl:5
Gap extension cost                     	aa:1,nucl:2
Zdrop                                  	40
Threads                                	256
Compressed                             	0
Verbosity                              	3
Seed substitution matrix               	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                            	5.7
k-mer length                           	0
k-score                                	seq:2147483647,prof:2147483647
Alphabet size                          	aa:21,nucl:5
Max results per query                  	300
Split database                         	0
Split mode                             	2
Split memory limit                     	0
Diagonal scoring                       	true
Exact k-mer matching                   	0
Mask residues                          	1
Mask residues probability              	0.9
Mask lower case residues               	0
Minimum diagonal score                 	15
Selected taxa                          	
Spaced k-mers                          	1
Spaced k-mer pattern                   	
Local temporary path                   	
Rescore mode                           	0
Remove hits by seq. id. and coverage   	false
Sort results                           	0
Mask profile                           	1
Profile E-value threshold              	0.001
Global sequence weighting              	false
Allow deletions                        	false
Filter MSA                             	1
Use filter only at N seqs              	0
Maximum seq. id. threshold             	0.9
Minimum seq. id.                       	0.0
Minimum score per column               	-20
Minimum coverage                       	0
Select N most diverse seqs             	1000
Pseudo count mode                      	0
Gap pseudo count                       	10
Min codons in orf                      	30
Max codons in length                   	32734
Max orf gaps                           	2147483647
Contig start mode                      	2
Contig end mode                        	2
Orf start mode                         	1
Forward frames                         	1,2,3
Reverse frames                         	1,2,3
Translation table                      	1
Translate orf                          	0
Use all table starts                   	false
Offset of numeric ids                  	0
Create lookup                          	0
Add orf stop                           	false
Overlap between sequences              	0
Sequence split mode                    	1
Header split mode                      	0
Chain overlapping alignments           	0
Merge query                            	1
Search type                            	0
Search iterations                      	1
Start sensitivity                      	4
Search steps                           	1
Exhaustive search mode                 	false
Filter results during exhaustive search	0
Strand selection                       	1
LCA search mode                        	false
Disk space limit                       	0
MPI runner                             	
Force restart with latest tmp          	false
Remove temporary files                 	false
Use simple best hit                    	true
Include sub-optimal hits with factor   	0
Alpha                                  	1
Aggregation mode                       	0
Filter self match                      	false
Multihit P-value cutoff                	0.01
Clustering and Ordering P-value cutoff 	0.01
Maximum gene gaps                      	3
Minimal cluster size                   	2
Cluster weighting factor               	false
Database output                        	true
Cluster search against profiles        	false
Cluster Search Mode                    	0

Create directory tmpFolder/3152204347500479419/search
search setDB setDB tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/search --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3 --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --gap-pc 10 --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --chain-alignments 0 --merge-query 1 --search-type 0 --start-sens 4 --sens-steps 1 --exhaustive-search 0 --exhaustive-search-filter 0 --strand 1 --lca-search 0 --disk-space-limit 0 --force-reuse 0 --remove-tmp-files 0

prefilter setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 256 --compressed 0 -v 3 -s 5.7

Query database size: 12719 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 12719 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 12.72K 0s 65ms
Index table: Masked residues: 15234
Index table: fill
[=================================================================] 100.00% 12.72K 0s 39ms
Index statistics
Entries:          3785086
DB size:          509 MB
Avg k-mer size:   0.059142
Top 10 k-mers
    GPGGTL	64
    GQQVAR	39
    SQQSER	30
    GLGNGK	24
    SGGSLR	24
    QLGQRV	24
    LPDEFY	23
    GQQIAR	21
    GEQVAR	21
    LGNAST	20
Time for index table init: 0h 0m 0s 583ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 12719
Target db start 1 to 12719
[=================================================================] 100.00% 12.72K 3s 22ms

301.207794 k-mers per position
6149 DB matches per sequence
0 overflows
55 sequences passed prefiltering per query sequence
45 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 14ms
Time for processing: 0h 0m 4s 194ms
align setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 tmpFolder/3152204347500479419/result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3

Compute score, coverage and sequence identity
Query database size: 12719 type: Aminoacid
Target database size: 12719 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 12.72K 0s 547ms
Time for merging to result: 0h 0m 0s 15ms
459801 alignments calculated
78951 sequence pairs passed the thresholds (0.171707 of overall calculated)
6.207328 hits per query sequence
Time for processing: 0h 0m 0s 775ms
prefixid tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/result_prefixed --threads 256 -v 3

[=================================================================] 100.00% 12.72K 0s 62ms
Time for merging to result_prefixed: 0h 0m 0s 9ms
Time for processing: 0h 0m 0s 264ms
besthitbyset setDB setDB tmpFolder/3152204347500479419/result_prefixed tmpFolder/3152204347500479419/aggregate --simple-best-hit 1 --suboptimal-hits 0 --threads 256 --compressed 0 -v 3

[=================================================================] 100.00% 12.72K 0s 81ms
Time for merging to aggregate: 0h 0m 0s 11ms
Time for processing: 0h 0m 0s 316ms
mergeresultsbyset setDB_set_to_member tmpFolder/3152204347500479419/aggregate tmpFolder/3152204347500479419/aggregate_merged --threads 256 -v 3

Time for merging to aggregate_merged: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 254ms
combinehits setDB setDB tmpFolder/3152204347500479419/aggregate_merged tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419 --alpha 1 --aggregation-mode 0 --filter-self-match 0 --threads 256 --compressed 0 -v 3

[=================================================================] 100.00% 3 0s 53ms
Time for merging to matches_h: 0h 0m 0s 9ms
Time for merging to matches: 0h 0m 0s 4ms
Time for processing: 0h 0m 0s 407ms
clusterhits setDB setDB tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419/clusters --multihit-pval 0.01 --cluster-pval 0.01 --max-gene-gap 3 --cluster-size 2 --cluster-use-weight 0 --db-output 1 --alpha 1 --threads 256 --compressed 0 -v 3

Invalid query lookup record                                       ] 0.00% 1 eta -
Error: clusterhits failed

The text was updated successfully, but these errors were encountered:

Keepingle · 2024-02-15T21:19:59Z

I meet the same problem

RuoshiZhang · 2024-11-14T14:55:50Z

Thanks for the reminder and sorry for the delay. Spacedust currently only accepts faa files generated with Prodigal, with which the correct coordinate information can be parsed. If you would still like to work with Prokka, in principle you could create the DB with the gff files generated by Prokka.
As of now I would strongly advise the former way, as there are some standing issues with gff parsing and thread handling (currently it should work by running createsetdb with --threads 1).

Dx-wmc · 2024-11-14T17:19:46Z

I can work with Prodigal as required. However, I have concerns about the accuracy of CDS predictions made by Prodigal, as it lacks a certain error correction.

Support for formats generated by Prokka and Bakta would be incredibly valuable, as these tools are widely used, and enabling compatibility with them would likely expand Spacedust's usability and adoption significantly. Thank you for considering this potential enhancement!

KerrinSteensen · 2024-11-26T19:15:38Z

+1 here, as I also tried to run spacedust on bakta-generated faa files. Would be super helpful if that could be updated in the future!

Dx-wmc mentioned this issue Nov 10, 2024

version tags #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run spacedust normally #3

Unable to run spacedust normally #3

Dx-wmc commented Oct 14, 2023 •

edited

Loading

Keepingle commented Feb 15, 2024

RuoshiZhang commented Nov 14, 2024 •

edited

Loading

Dx-wmc commented Nov 14, 2024

KerrinSteensen commented Nov 26, 2024

Unable to run spacedust normally #3

Unable to run spacedust normally #3

Comments

Dx-wmc commented Oct 14, 2023 • edited Loading

Expected Behavior

Current Behavior

using CDS

using faa

A puzzling point

Your Environment

spacedust Output (for bugs)

The output of the command ./spacedust clustersearch setDB setDB result.tsv tmpFolder.

Keepingle commented Feb 15, 2024

RuoshiZhang commented Nov 14, 2024 • edited Loading

Dx-wmc commented Nov 14, 2024

KerrinSteensen commented Nov 26, 2024

Dx-wmc commented Oct 14, 2023 •

edited

Loading

The output of the command `./spacedust clustersearch setDB setDB result.tsv tmpFolder`.

RuoshiZhang commented Nov 14, 2024 •

edited

Loading