Primary vs. secondary data

Primary data are assembled in the final assembly phase of IRMA. Secondary data is stored with unmatched read patterns and, currently, does not undergo final assembly. Data are classified as secondary for two reasons: (1) insufficient number of reads or read patterns [see MIN_RC], or (2) the gene segment or genome belongs to a group and doesn't have the most reads for that group. This variable is usually a module-specific variable set in your init.sh relative to the organism or assembly strategy of interest.

    Groups are often useful when the user provides two references for genes of different types, subtypes, or lineages. Competing lineages may appear when there is sorting misclassification (more common with BLAT than LABEL), barcode bleed over (samples in the same run not being de-multiplexed properly), contamination, and due to a true co-infection. Below explains how the SORT_GROUPS variable affects this second aspect of primary/secondary data sorting.


Return to the IRMA homepage

SORT_GROUPS

SORT_GROUPS=""No groups declared. Each gene segment or genome identified by the headers in the starting reference library (usually consensus.fasta) are considered as primary data. Assumes the data are sorted into the relevant gene segment or genome in the first place.
 
SORT_GROUPS="__ALL__"All gene segments or genomes in one group. The reference gene segment or genome with the most reads will be counted as primary data.
 
SORT_GROUPS="p1,p2,p3,...,pN"One group for each pattern (p) in the comma-delimited list. A group is defined by every gene segment or genome name containing the same pattern or substring p. The gene segment or genome with the most reads for that group is considered primary data. Gene segments or genomes not in any groups are considered primary if sufficient reads are found.

Examples:

SORT_GROUPS="" Primary: A_PB2, A_PB1, A_PA, B_PB2, B_PB1   Secondary: B_PA
 
SORT_GROUPS="__ALL__" Primary: A_PA   Secondary: A_PB2, A_PB1, B_PB2, B_PB1, B_PA
 
SORT_GROUPS="PB2,PB1,PA" Primary: A_PB2, A_PB1, A_PA   Secondary: B_PB2, B_PB1, B_PA
 
SORT_GROUPS="PB1" Primary: A_PB2, A_PB1, A_PA, B_PB2   Secondary: B_PB1, B_PA

Example influenza A & B polymerase gene set (and reads found) which would have corresponding headers in consensus.fasta:

Gene segmentCount
A_PB21000
A_PB11400
A_PA2150
B_PB2111
B_PB1121
B_PA10

Suppose also that the minimum read count and read patterns allowed are 15 (see MIN_RP and MIN_RC), which would always put B_PA into secondary status for this example.


Return to the Influenza Division Bioinformatics Team homepage
This page last reviewed: Tuesday, November 19, 2019