Usage

Given a sequence database (in FASTA format), vpsearch build constructs an optimized vantage point search tree. Building the tree is a one-time operation and doesn’t have to be done again unless the database changes. As an illustration, we build a vantage point tree for a database of sequences obtained by trimming the GTDB 16S database to the v3-v4 hypervariable region. This database contains 10875 unique sequences, and can be found (in compressed form) in the data/ directory inside this repository:

$ vpsearch build bac120_ssu_reps_r207-sliced-dedup.fa.gz
Building for 10875 sequences...done.
Linearizing...done.
Database created in bac120_ssu_reps_r207-sliced-dedup.fa.db

As this is a relatively small database, the process finishes quickly, in about 10 seconds. For larger databases, such as the RDP database of full length sequences, this may take longer. For example, building an index for the RDP database takes about 20 minutes on a standard machine.

Once a tree has been built, unknown sequences can be looked up using the vpsearch query command. Here we supply a query file with a single sequence. The query.fa file can also be found in the data/ directory and represents a Lactobacillus helsingborgensis sample whose sequence was downloaded from RefSeq. We see that we have a perfect match with RS_GCF_000970855.1, which happens to be the same sequence. Other matches are highly similar but not identical, and represent different species of Lactobacillus (kimbladii, melliventris, and panisapium, respectively):

$ vpsearch query bac120_ssu_reps_r207-sliced-dedup.fa.db query.fa
NR_126253.1     RS_GCF_000970855.1      100.00  253     0       0       1       253     1       253     0       1265
NR_126253.1     RS_GCF_014323605.1      98.81   253     0       0       1       253     1       253     0       1238
NR_126253.1     RS_GCF_013346935.1      98.02   253     0       0       1       253     1       253     0       1220
NR_126253.1     RS_GCF_002916935.1      97.63   253     0       0       1       253     1       253     0       1211

By default, the vpsearch query command outputs the best four matches in the database per query sequence (the number of matches can be changed with the -k parameter). Lookup is done one query sequence at a time, but multiple queries can be considered in parallel by enabling multiple threads; use the -j option to specify the number of threads.

The vpsearch query command attempts to output its results in the standard BLAST tabular format. The interpretation of the columns is as follows:

Column name

Example

Notes

query ID

NR_126253.1

subject ID

RS_GCF_014323605.1

% identity

98.81

alignment length

253

mismatches

0

currently not implemented

gap openings

0

currently not implemented

query start

1

query end

253

subject start

1

subject end

253

E-value

0

N/A (always 0)

bit score

1238

interpreted as the alignment score

Note that the number of mismatches and gap openings are currently not displayed in the result output. This will be addressed in a future version of the package.