Primary vs. secondary data

Primary data are assembled in the final assembly phase of IRMA. Secondary data is stored with unmatched read patterns and, currently, does not undergo final assembly. Data are classified as secondary for two reasons: (1) insufficient number of reads or read patterns [see MIN_RC], or (2) the gene segment or genome belongs to a group and doesn't have the most reads for that group. This variable is usually a module-specific variable set in your init.sh relative to the organism or assembly strategy of interest.

Groups are often useful when the user provides two references for genes of different types, subtypes, or lineages. Competing lineages may appear when there is sorting misclassification (more common with BLAT than LABEL), barcode bleed over (samples in the same run not being de-multiplexed properly), contamination, and due to a true co-infection. Below explains how the SORT_GROUPS variable affects this second aspect of primary/secondary data sorting.

Return to the IRMA homepage

SORT_GROUPS

`SORT_GROUPS=""`	No groups declared. Each gene segment or genome identified by the headers in the starting reference library (usually consensus.fasta) are considered as primary data. Assumes the data are sorted into the relevant gene segment or genome in the first place.

`SORT_GROUPS="__ALL__"`	All gene segments or genomes in one group. The reference gene segment or genome with the most reads will be counted as primary data.

`SORT_GROUPS="p1,p2,p3,...,pN"`	One group for each pattern (p) in the comma-delimited list. A group is defined by every gene segment or genome name containing the same pattern or substring p. The gene segment or genome with the most reads for that group is considered primary data. Gene segments or genomes not in any groups are considered primary if sufficient reads are found.

Examples:

`SORT_GROUPS=""`	Primary: A_PB2, A_PB1, A_PA, B_PB2, B_PB1 Secondary: B_PA

`SORT_GROUPS="__ALL__"`	Primary: A_PA Secondary: A_PB2, A_PB1, B_PB2, B_PB1, B_PA

`SORT_GROUPS="PB2, PB1 ,PA"`	Primary: A_PB2, A_PB1, A_PA Secondary: B_PB2, B_PB1, B_PA

`SORT_GROUPS="PB1"`	Primary: A_PB2, A_PB1, A_PA, B_PB2 Secondary: B_PB1, B_PA

Example influenza A & B polymerase gene set (and reads found) which would have corresponding headers in consensus.fasta:

Gene segment	Count
A_PB2	1000
A_PB1	1400
A_PA	2150
B_PB2	111
B_PB1	121
B_PA	10

Suppose also that the minimum read count and read patterns allowed are 15 (see MIN_RP and MIN_RC), which would always put B_PA into secondary status for this example.

Return to the Influenza Division Bioinformatics Team homepage

This page last reviewed: Tuesday, July 15, 2025

Content source: Influenza Division Bioinformatics Team