Flu Lineage Utilities LABEL README

Input
LABEL accepts files in fasta format, whether multi-line or single line and for any number of sequences. Sequences with redundant headers are removed on a first-come first-serve basis. Space characters are also replaced with the underscore character.

Output
LABEL produces the following output files* in a zipped archive:

PROJ_final.tab Standard. Tab-delimited headers & predicted clades

PROJ_final.txt Standard. A prettier output of the above

LEVEL_trace.tab Standard. Table of HMM scores at each level, suitable for visualization in R

LEVEL_result.tab Standard. For the current LEVEL, tab-delimited headers & predicted clades

LEVEL_result.txt Standard. For the current LEVEL, A prettier output of the above

FASTA/ Standard. Folder containing fasta files and newick trees

FASTA/MOD_predictions.fas Standard. Query sequence file with predictions added like: "_{PRED:CLADE}"

FASTA/MOD_control.fasta Optional. Alignment of predictions fasta file and guide sequences

FASTA/MOD_control.nwk Optional. Maximum likelihood tree of the above

FASTA/PROJ_reannotated.fas Default. Query sequences file with annotations replaced with predicted ones, ordered by clade

FASTA/PROJ_ordered.fasta Optional. Aligned version of the above, still ordered by clade

FASTA/PROJ_tree.nwk Optional. Maximum likelihood tree of the above

FASTA/PROJ_clade_CLADE.fas Standard. The re-annotated file partitioned into separate clade files

c-*/ Standard. Clade/lineage subfolder for the hierarchical predictions

Read_Me.txt This file

*The project name is denoted "PROJ", the prediction level as "LEVEL", the lineage or clade is called "CLADE", and the module of interest is "MOD".

Time
In general expect no more than 1 second per sequence. Choosing alignment options may increase the runtime significantly. Guide sequence libraries are never more than 200 sequences in size. For the best results using the alignment options, break down the query sequence file into smaller files.

How it Works
LABEL or "Lineage Assignment By Extended Learning" uses hidden Markov model (HMM) profiles of lineages/clades - or groups of clades - to score every query sequence and then classify them via machine learning techniques. The HMM scoring step is performed via SAM v3.5 (see the MS for more information). Scoring is hierarchical in the sense that prediction starts out general (groups of many clades perhaps) and goes to a very specific or terminal level. This corresponds to the way phylogeny is usually structured. At each prediction level, data matrices and result files are produced. The prediction step is done via support vector machines (SVM) using the free SHOGUN Machine Learning Toolbox v1.1.0 (www.shogun-toolbox.org). Specifically, we utilize a multi-class SVM method called GMNP with an inhomogeneous, normalized polynomial kernel of degree 20. At the end of the computation, a final output file is produced at the root level with every query sequence being conferred its final prediction. The input FASTA sequence is annotated with predictions--using the form "{PRED:clade}" - while a second file accepting the predicted annotations as fact is also created. This second file is sorted by clade and then partitioned into separate files for each clade. One can optionally align (MUSCLE v3.8.31, see www.drive5.com/muscle MAFFT if available, see mafft.cbrc.jp/alignment/software; or via SAM's align2model program ) the re-annotated FASTA file while retaining clade ordering and then produce a maximum-likelihood tree. Finally, a guide sequence library containing reference annotations may also be optionally aligned with the query sequences and used to produce a second maximum-likelihood tree for positive control (smaller guide trees may not always be as accurate as fuller ones, when in doubt add more sequences to the clades of interest). All maximum-likelihood trees use a GTR+GAMMA model with 1000 local support bootstraps and are computed by FastTreeMP v2.1.4 (see www.microbesonline.org/fasttree).

Modules
LABEL modules are merely directories within the LABEL_RES/training_data folder and contain all associated pHMMs as well as SVM training data. Extensions such x-filter.txt control against inappropriate data input. The guide tree for positive control (if desired) is listed as MOD_downsample.fa for MAFFT/MUSCLE alignment or in the x-control folder for faster pHMM alignment. Current included modules:

H5v2015	Influenza A hemagglutinin module for subtype H5N1, current to 31 DEC 2014, nomenclature accepted 2015
H5v2013	Influenza A hemagglutinin module for subtype H5N1, current to 31 DEC 2012, nomenclature accepted 2013
H5v2011	Influenza A hemagglutinin module for subtype H5N1, current to 19 MAR 2012, nomenclature accepted c. 2011
H9v2011	Influenza A hemagglutinin module for subtype H9N2, current to 27 MAR 2012

Acknowledgements
LABEL v0.4.6 by Samuel S. Shepard (vfn4@cdc.gov)

Lineage Assignment by Extended Learning

Last updated Jun 2015, Centers for Disease Control & Prevention.

PROJ_final.tab	Standard. Tab-delimited headers & predicted clades
PROJ_final.txt	Standard. A prettier output of the above
LEVEL_trace.tab	Standard. Table of HMM scores at each level, suitable for visualization in R
LEVEL_result.tab	Standard. For the current LEVEL, tab-delimited headers & predicted clades
LEVEL_result.txt	Standard. For the current LEVEL, A prettier output of the above
FASTA/	Standard. Folder containing fasta files and newick trees
FASTA/MOD_predictions.fas	Standard. Query sequence file with predictions added like: "_{PRED:CLADE}"
FASTA/MOD_control.fasta	Optional. Alignment of predictions fasta file and guide sequences
FASTA/MOD_control.nwk	Optional. Maximum likelihood tree of the above
FASTA/PROJ_reannotated.fas	Default. Query sequences file with annotations replaced with predicted ones, ordered by clade
FASTA/PROJ_ordered.fasta	Optional. Aligned version of the above, still ordered by clade
FASTA/PROJ_tree.nwk	Optional. Maximum likelihood tree of the above
FASTA/PROJ_clade_CLADE.fas	Standard. The re-annotated file partitioned into separate clade files
c-*/	Standard. Clade/lineage subfolder for the hierarchical predictions
Read_Me.txt	This file