Input
LABEL accepts files in fasta format, whether multi-line or single line and for any number of sequences. Sequences with redundant headers are removed on a first-come first-serve basis. Space characters are also replaced with the underscore character.

Output
LABEL produces the following output files* in a zipped archive:

PROJ_final.tabStandard. Tab-delimited headers & predicted clades
PROJ_final.txtStandard. A prettier output of the above
LEVEL_trace.tabStandard. Table of HMM scores at each level, suitable for visualization in R
LEVEL_result.tabStandard. For the current LEVEL, tab-delimited headers & predicted clades
LEVEL_result.txtStandard. For the current LEVEL, A prettier output of the above
FASTA/Standard. Folder containing fasta files and newick trees
FASTA/MOD_predictions.fasStandard. Query sequence file with predictions added like: "_{PRED:CLADE}"
FASTA/MOD_control.fastaOptional. Alignment of predictions fasta file and guide sequences
FASTA/MOD_control.nwkOptional. Maximum likelihood tree of the above
FASTA/PROJ_reannotated.fasDefault. Query sequences file with annotations replaced with predicted ones, ordered by clade
FASTA/PROJ_ordered.fastaOptional. Aligned version of the above, still ordered by clade
FASTA/PROJ_tree.nwkOptional. Maximum likelihood tree of the above
FASTA/PROJ_clade_CLADE.fasStandard. The re-annotated file partitioned into separate clade files
c-*/Standard. Clade/lineage subfolder for the hierarchical predictions
Read_Me.txtThis file

*The project name is denoted "PROJ", the prediction level as "LEVEL", the lineage or clade is called "CLADE", and the module of interest is "MOD".

Time
In general expect no more than 1 second per sequence. Choosing alignment options may increase the runtime significantly. Guide sequence libraries are never more than 200 sequences in size. For the best results using the alignment options, break down the query sequence file into smaller files.

How it Works
LABEL or "Lineage Assignment By Extended Learning" uses hidden Markov model (HMM) profiles of lineages/clades - or groups of clades - to score every query sequence and then classify them via machine learning techniques. The HMM scoring step is performed via SAM v3.5 (see the MS for more information). Scoring is hierarchical in the sense that prediction starts out general (groups of many clades perhaps) and goes to a very specific or terminal level. This corresponds to the way phylogeny is usually structured. At each prediction level, data matrices and result files are produced. The prediction step is done via support vector machines (SVM) using the free SHOGUN Machine Learning Toolbox v1.1.0 (www.shogun-toolbox.org). Specifically, we utilize a multi-class SVM method called GMNP with an inhomogeneous, normalized polynomial kernel of degree 20. At the end of the computation, a final output file is produced at the root level with every query sequence being conferred its final prediction. The input FASTA sequence is annotated with predictions--using the form "{PRED:clade}" - while a second file accepting the predicted annotations as fact is also created. This second file is sorted by clade and then partitioned into separate files for each clade. One can optionally align (MUSCLE v3.8.31, see www.drive5.com/muscle MAFFT if available, see mafft.cbrc.jp/alignment/software; or via SAM's align2model program ) the re-annotated FASTA file while retaining clade ordering and then produce a maximum-likelihood tree. Finally, a guide sequence library containing reference annotations may also be optionally aligned with the query sequences and used to produce a second maximum-likelihood tree for positive control (smaller guide trees may not always be as accurate as fuller ones, when in doubt add more sequences to the clades of interest). All maximum-likelihood trees use a GTR+GAMMA model with 1000 local support bootstraps and are computed by FastTreeMP v2.1.4 (see www.microbesonline.org/fasttree).

Modules
LABEL modules are merely directories within the LABEL_RES/training_data folder and contain all associated pHMMs as well as SVM training data. Extensions such x-filter.txt control against inappropriate data input. The guide tree for positive control (if desired) is listed as MOD_downsample.fa for MAFFT/MUSCLE alignment or in the x-control folder for faster pHMM alignment. Current included modules:

H5v2015Influenza A hemagglutinin module for subtype H5N1, current to 31 DEC 2014, nomenclature accepted 2015
H5v2013Influenza A hemagglutinin module for subtype H5N1, current to 31 DEC 2012, nomenclature accepted 2013
H5v2011Influenza A hemagglutinin module for subtype H5N1, current to 19 MAR 2012, nomenclature accepted c. 2011
H9v2011Influenza A hemagglutinin module for subtype H9N2, current to 27 MAR 2012
Acknowledgements
LABEL v0.4.6 by Samuel S. Shepard (vfn4@cdc.gov)

Lineage Assignment by Extended Learning

Last updated Jun 2015, Centers for Disease Control & Prevention.