====================
Implementation notes
====================

The tree construction operates in two phases. We first build the tree as a tree
of Python object nodes because it's easier to build with a dynamic data
structure. Then it linearizes the topology of the nodes into a few integer
arrays that are easy to serialize and fast to look up. The object that
represents the linearized tree can only query the database, not build the tree.
The slower tree-of-nodes implementation can build and query (albeit with more
overhead).

vpsearch is best suited for indexing sets of small-ish marker genes, such as
the bacterial 16S rRNA gene or the fungal ITS region (100s-1000s of basepairs),
and has been tested with databases of hundreds of thousands of sequences. In
general, vpsearch is able to construct the tree using (on average) ``O(n log n)``
sequence comparisons and uses ``O(n)`` memory to do so, where ``n`` is the number
of sequences in the database. Each sequence comparison involves a global
sequence alignment, which scales quadratically with the length of the sequence.
For short sequences this can be done quickly and efficiently, but for longer
sequences (e.g. full length viral or bacterial genomes), the total runtime and
memory usage can be considerable.  If you are interested in using vpsearch
under these conditions, please open an issue!