Plurality consensus sequences are returned in the main folder and are named after the virus genome or gene segment class label, see the IRMA output page for more details. The plurality consensus sequence is very useful when used with the BAM file to look at minor variants, since the most frequent allele will appear in the reference. A plurality rule was chosen over majority consensus because it is more inclusive for pattern matching purposes and does not assign strict thresholds for the dominant virus phase in the sample. Other parameters are available to restrict the quality of the consensus alleles as part of the amended consensus.
In case one has not read the paper, please note that IRMA is an iterative refinement meta-assembler, meaning it does iterative reference editing based on the observed read data, including substitutions and structural changes. The final plurality consensus and the final reference are the same sequence!
In addition to plurality consensus sequences, a variety of IRMA amended consensus sequences are returned in the amended_consensus sub-folder depending on the configuration. These sequences are modifications to the plurality consensus.
- The first type of amendment is base ambiguation for mixed alleles. If a minor variant is called and its frequency is ≥ MIN_AMBIG it will use the IUPAC code that combines the dominant and minority allele bases. For example, if the plurality consensus is A and the called minor variant is G at the same site with frequency greater than X% (whatever threshold), then the new allele at that position would be R. The thought process is that there is a frequency threshold where the mixture become significant or relevant, whether in a biological or statistical sense, that is up to the user.
- The second type of base amendment is for consensus allele quality control. IRMA uses both the average consensus allele quality score as well as the consensus allele coverage depth. If either value is below threshold, the base will be amended as an N. The corresponding configuration fields are MIN_CONS_QUALITY and MIN_CONS_SUPPORT respectively. One should take care to adjust these values based on the chemistry or platform. For example, long-read technology may have lower coverage depth, in general, than short-read technology. The coverage table will also be amended when consensus alleles fail to pass thresholds. Note: setting these values arbitrarily high will result in masking hard-earned data. Consider taking an evidence-based approach to finding the optimal setting for your platform.
- Optionally, if an HMM profile is provided in your module, one can use it to globally align the plurality consensus to the profile and also create an amended alignment. The alignment will create necessary deletion states relative to the HMM coordinate space; insertion states appear as lowercase letters in the sequence. The coordinate space is usually shared with the reference seeds in the consensus folder of the module. One must set ALIGN_AMENDED=1 for this option. Additional files with the suffix ".a2m" will appear in main folder for the plurality alignment and also in the amended_consensus folder. If one does not have an HMM profile, the modelfromalign and align2model programs are distributed with LABEL subject to the license for SAM. See the docs for usage.
- Amplicon dropouts can sometimes leave zero coverage regions where no in-read deletions were detected. It can be useful to differentiate these regions from deletions in the amended consensus. If one sets PADDED_CONSENSUS=1 (depends on ASSEM_REF=1 & ALIGN_AMENDED=1), a new file will be created with the suffix ".pad.fa". The sequence in this file pads putative amplicon dropout regions with N versus the amended consensus alignment. An algorithm sketch is as fellows:
- The ASSEM_REF option ensures the first final assembly round is against the reference seed, which for all modules we have created, is in the same coordinate space as the provided profile HMM. Given an amended global alignment also exists via ALIGN_AMENDED, one can attempt to detect dropout regions by reading in the SAM file and the amended A2M.
- Insertions are removed from the amended global alignment.
- Deletions in the global alignment are specially masked using "*".
- The SAM file is processed and deleted regions spanned by reads are restored to deletions.
- Regions still containing the mask have zero coverage and are not necessarily due to deletion events.
- Regions flanking dropouts or low coverage masked regions are ambiguated except at the 5' or 3' end.
- Insertions are added back into the consensus
- The special mask is replaced by N bases ("N-padding").
- Deletions are now removed to create a contiguous, unaligned sequence.
- A new file is written with suffix ".pad.fa"
- Use of ALIGN_AMENDED generates an alternative coverage table (TARGET-coverage.a2m.txt) with coordinate mapping from the plurality consensus to the profile HMM coordinate space. Use of PADDED_CONSENSUS generates a tertiary coverage table (TARGET-coverage.pad.txt) that additionally coverts deletion states to padded states as applicable by the above algorithm. The HMM coordinates and alignment state columns are mapped in-place to the variants, allAlleles, insertions, and deletions table for convenience.