IRMA customizable parameters
Configuration files are loaded up to 4 times by IRMA. (1) A global configuration file is applied to ensure the initialization of all relevant variables and for any module that is run. Variables should not be deleted from this file. The global configuration file path is: IRMA_RES/defaults.sh (2) Module-specific configurations are applied that may override any global arguments. These configurations help adjust the assembly to the organism of interest. They are located in: IRMA_RES/modules/<ORGANISM>/init.sh (3.1) Next, run-specific named configuration files can be applied to specialize the assembly for different situations. The default run-specific file is: IRMA_RES/modules/<ORGANISM>/config/<ORGANISM>.sh (3.2) IRMA searches for the named configuration file when you run it. Suppose you specify FLU-utr as the module/configuration argument. The corresponding file path would be: IRMA_RES/modules/FLU/config/FLU-utr.sh. Finally, IRMA accepts an --external-config argument that will override any other configs. Argument variables are checked for validity. You can use an anonymous pipe to modify single variables, though this will not work in Docker.
A example of this loading process is illustrated below for calling FLU-lowQC:
- Global: <install_path>/IRMA_RES/defaults.sh
- Module: <install_path>/IRMA_RES/modules/FLU/init.sh
- Run-specific: <install_path>/IRMA_RES/modules/FLU/config/FLU-lowQC.sh
- Optional external config: --external-config /your/path/to/the/file.sh
Configurations added:
- Since v1.1.0: USE_IRMA_CORE (experimental) is an opt-in configuration to run the mergeSAMpairs and fastQ_converter tasks more efficiently. This is currently most helpful for paired-end samples with a higher quantity of reads. The irma-core_<OS> binary must be in the IRMA_RES/scripts folder or in the PATH, otherwise the configuration will be turned back off.
- Since v1.0.4: PACKAGED_FASTQ can now optionally turn off packaging together sorted assembly fastq as a "tar.gz" and instead gzip on each fastq separately.
- Since v1.0.0: DEL_TYPE, MIN_CONS_SUPPORT, MIN_CONS_QUALITY, ENFORCE_CLIPPED_LENGTH, INS_T_DEPTH, DEL_T_DEPTH, ALIGN_AMENDED, MIN_BLAT_MATCH, and PADDED_CONSENSUS. Minimap2 uses default SSW params but can take its own params: MM2_A, MM2_B, MM2_O, and MM2_E.
Return to the IRMA homepage
Grid parallelization
Parameter | Values | Kind | Description |
GRID_ON | 0, 1 | off, on | If a Sun Grid Engine (SGE) heritage scheduler is available, use it. We have tested IRMA on UGE 8.2.1. |
GRID_PATH | /shared/dir | <PATH> | If grid execution is on, optionally set GRID_PATH to override the default working area, <install_path>/IRMA_RES/ppath. Folder should be accessible by all compute nodes, such as a shared drive. |
LIMIT_BLAT | 60000 | ≥ 1 | If grid execution is on, use scheduler for BLAT when available read patterns per sample are ≥ threshold. |
LIMIT_LABEL | 1000 | ≥ 1 | If grid execution is on, use scheduler for LABEL when available reads per sample-gene are ≥ threshold. |
LIMIT_SAM | 500 | ≥ 1 | If grid execution is on, use scheduler for SAM when available read patterns per sample-gene are ≥ threshold. |
LIMIT_SSW | 80000 | ≥ 1 | If grid execution is on, use scheduler for SSW when available reads per sample-gene are ≥ threshold. | LIMIT_PHASE | 200 | ≥ 1 | If grid execution is on, use scheduler for phasing when available variants per sample-gene are ≥ threshold. |
MATCH_PROC | 20 | ≥ 1 | If grid execution is on & quantity limit exceeded, number of array tasks for the match step. |
SORT_PROC | 80 | ≥ 1 | If grid execution is on & limit exceeded, number of array tasks for the sort step. |
ALIGN_PROC | 20 | ≥ 1 | If grid execution is on & limit exceeded, number of array tasks for the align step per gene segment. |
ASSEM_PROC | 20 | ≥ 1 | If grid execution is on & limit exceeded, number of array tasks for the final assembly step per gene segment. |
PHASE_PROC | 80 | ≥ 1 | If grid execution is on & limit exceeded, number of array tasks for the variant phasing step per gene segment. |
Single-node setup
Parameter | Values | Kind | Description |
SINGLE_LOCAL_PROC | 16 | ≥ 1 | For parallelization of one task on the computer IRMA is executed, typically the number of logical CPU cores. |
DOUBLE_LOCAL_PROC | 8 | ≥ 1 | For parallelization of two multi-threaded tasks. Usually half SINGLE_LOCAL_PROC. |
ALLOW_TMP | 0,1 | off, on | When not using the grid, try to use /tmp for working directory. |
TMP | /tmp | <PATH> | Absolute path to a writable scratch/tmpfs for IRMA's workspace. |
Variant calling
Parameter | Default | Kind | Description |
MIN_F | 0.008 | [0,1] | Minimum frequency heuristic for calling single nucleotide variants. |
AUTO_F | 0,1 | off, on | Automatically adjust MIN_F based on the maximal zero-confidence minority allele. Go to formula. |
MIN_CONF | 0.80 | [0,1] | Minimum confidence not machine error for single nucleotide variants. Go to formula. |
MIN_F | 0.008 | [0,1] | Minimum frequency heuristic for calling single nucleotide variants. |
MIN_FI | 0.005 | [0,1] | Minimum frequency heuristic for calling insertion variants. |
MIN_FD | 0.005 | [0,1] | Minimum frequency heuristic for calling deletion variants. |
MIN_AQ | 2 | [0,64] | Minimum average allele quality score heuristic for calling insertion & single nucleotide variants. |
MIN_C | 2 | ≥ 1 | Minimum allele count heuristic for calling variants. |
MIN_TCC | 100 | ≥ 1 | Minimum coverage depth heuristic (total coverage count) for calling variants. |
SIG_LEVEL | 0.999 | {.90,.95,.99,.999} | Significance level for variant calling statistical tests. |
Reference editing
Parameter | Default | Kind | Description |
DEL_TYPE | "" | "NNN","DEL","REF" | Advanced option. If sites are completely missing during read gathering use the reference seed (REF), delete by ambiguation (NNN), or just remove (DEL). Default is old behavior: Uses "NNN" with BLAT and "DEL" otherwise. Can specify per round with space delimiter. |
INS_T | 0.25 | [0,1] | Minimum insertion frequency to alter reference. Set to 1 for off. Used during final assembly phase. The frequency is the count of the most frequent insertion after position P divided by the average coverage depth at positions P and P+1. |
INS_T_DEPTH | 0 | ≥ 0 | Minimum insertion coverage depth to alter reference. Set to 0 for off. Used during final assembly phase. The coverage depth is the count of the most frequent insertion after position P divided by the average coverage depth at positions P and P+1. |
DEL_T | 0.60 | [0,1] | Minimum deletion frequency to alter reference. Set to 1 for off. Used during final assembly. Zero coverage depth is always deleted. |
DEL_T_DEPTH | 0 | ≥ 0 | Minimum deletion coverage depth to alter reference. Set to 0 for off. Used during final assembly. Zero coverage depth is always deleted. |
SKIP_E | 0,1 | off, on | Skip reference elongation/extension at the 5′ and 3′ ends. Used during read gathering phase. |
MIN_FA | 1 | [0,1] | Minimum frequency for alternative reference generation based on the alternative or most frequent minority allele. Set to 1 for off. Used during reading gather and final assembly. |
MIN_CA | 20 | ≥ 1 | Minimum allele count for alternative reference generation. |
MIN_AMBIG | 0.20 | [0,0.5] | Minimum called SNV frequency for mixed base calls in the amended consensus folder. Used at the end of the final assembly phase. |
MIN_CONS_SUPPORT | 1 | ≥ 1 | Minimum allele coverage depth to call plurality consensus, otherwise calls "N". Setting this value too high can negatively impact final amended consensus. |
MIN_CONS_QUALITY | 0 | ≥ 0 | Minimum allele average quality to call plurality consensus, otherwise calls "N". Setting this value too high can negatively impact final amended consensus. |
SILENCE_COMPLEX_INDELS | 0,1 | off, on | Optionally silences reads with 4 or more indels in any round of the final assembly where detected. Complex indels may be spurious and can affect consensus. Use is not advisable for long-read technology. |
Quality control
Parameter | Default | Kind | Description |
INCL_CHIM | 0,1 | off, on | Include chimeric reads that are found. Chimeric reads contain both strands. |
USE_MEDIAN | 0,1 | off, on | Use median read quality to filter reads instead of the average read quality. |
QUAL_THRESHOLD | 30 | [0,64] | Minimum threshold for filtering reads by median or average quality. |
MIN_LEN | 125 | ≥ 1 | Minimum read length to include reads in read gathering. This value should not be greater than the typical read length. |
MIN_BLAT_MATCH | 0 | ≥ 0 | Minimum BLAT match length in order to include reads during the read gathering phase. Default settings are effectively 30bp even when set lower; setting to 0 turns off the length check. Final assembly match states may be less due to clipping by SSW or MINIMAP2. This value should not be greater than the typical read length. |
ADAPTER | AGATGTG TATAAGA GACAG |
bases | Transposase adapter sequence. To turn off>, set to an empty string (ADAPTER=""). Clips 5′ on forward adapter and 3′ on the reverse complement adapter. May apply to NextTera paired-end reads. |
ENFORCE_CLIPPED_LENGTH | 0,1 | off, on | If ADAPTER is set and ENFORCE_CLIPPED_LENGTH is on, the MIN_LEN will apply to adapter-clipped reads |
FUZZY_ADAPTER | 0,1 | off, on | If ADAPTER is set and FUZZY_ADAPTER is on, additionally trim adapters up to 1 mismatch |
NO_MERGE | 0, 1 | off, on | Do not merge read pairs after final assembly. Merging provides error correction & detection. |
Meta-assembly program control
Parameter | Default | Values | Description |
ALIGN_AMENDED | 0,1 | off,on | Do global alignment of the plurality consensus to the HMM profile (if it exists) and create an amended version of the alignment. The output will have the suffix ".a2m". Coverage, insertion, deletion, variant and alleles tables will gain extra columns with mapped HMM coordinates and alignment states, though the coverage tables will appear in separate files. |
PADDED_CONSENSUS | 0,1 | off,on | Attempt to pad amended_consensus with Ns for amplicaton dropout: requires ALIGN_AMENDED=1 and ASSEM_REF=1. The output will have the suffix ".pad.fa". |
MAX_ROUNDS | 5 | ≥ 1 | Maximum rounds for the read gathering phase. |
MERGE_SECONDARY | 0,1 | off,on | After round 1, return secondary data into the unmatched read pool. Best for non-segmented data and when references are very similar. |
MATCH_PROG | BLAT | Program for the match step, to quickly match all reads to all references. | |
SORT_PROG | BLAT | {BLAT, LABEL} | Program for the sort step, to determine the best match for each matched read. |
MIN_RP | 15 | ≥ 1 | Minimum number of read patterns to continue to attempt assembly per gene segment. |
MIN_RC | 15 | ≥ 1 | Minimum read count to continue to attempt assembly per gene segment. |
SORT_GROUPS | "" | string | Determines the sort groups for primary and secondary data. Learn more now! |
SECONDARY_SORT | 0,1 | off, on | Applies to LABEL sorting. If on, do two-stage sorting using SECONDARY_LABEL_MODULES after a rough match/sort by BLAT according to GENE_GROUP; otherwise, use LABEL_MODULE for all segments matched. Not to be confused with secondary data. |
LABEL_MODULE | "" | string | Specifies the LABEL module to use in the assembly configuration. Usually this variable is set in the init.sh for the IRMA assembly module, example: LABEL_MODULE="irma-FLU" |
GENE_GROUP | "" | string | Needed for two-stage sorting. Stage one uses BLAT to sort into rough groups that can be further sorted using SECONDARY_LABEL_MODULES. Format is a comma-delimited list of patterns with an optional trailing colon and alias for gene names not found in the list ("otherwise"). Example: HA,NA:OG |
SECONDARY_LABEL_MODULES | "" | string | Needed for two-stage sorting. Specifies LABEL modules to use as a 2nd sort step after a rough match/sort by BLAT. This variable is usually set in the init.sh for the IRMA module. Format must match the order and count of patterns in GENE_GROUP; it uses a comma-delimited list of modules with an optional colon to separate a final module for sorting references not matching any given pattern. Example: SECONDARY_LABEL_MODULES="irma-FLU-HA,irma-FLU-NA:irma-FLU-OG" |
ALIGN_PROG | SAM | {SAM, BLAT} | Program for the align step, to roughly align to and edit each reference for the next round. |
MAX_ITER_ASSEM | 5 | ≥ 1 | Maximum iterations for the final assembly phase. |
ASSEM_PROG | SSW | SSW, MINIMAP2 | Program for the final assembly step, which optimizes by the assembly score. Takes either SSW or MINIMAP2, where the latter integrates all features except multiple chains. |
SSW_M | 2 | ≥ 0 | Smith-waterman (or MM2) local alignment match score. |
SSW_X | 5 | ≥ 0 | Smith-waterman (or MM2) local alignment mismatch penalty. |
SSW_O | 10 | ≥ 0 | Smith-waterman local (or MM2) alignment gap open penalty. |
SSW_E | 1 | ≥ 0 | Smith-waterman local (or MM2) alignment gap extension penalty. |
MM2_A | 2 | ≥ 0 | Minimap2 local alignment match score (defaults to SSW_M). |
MM2_B | 5 | ≥ 0 | Minimap2 local alignment mismatch penalty (defaults to SSW_X). |
MM2_O | 10 | ≥ 0 | Minimap2 alignment gap open penalty (defaults to SSW_O). |
MM2_E | 1 | ≥ 0 | Minimap2 alignment gap extension penalty (defaults to SSW_E). |
REF_SET | $DEF_SET | <FASTA> | Reference set used for the first round of read gathering. Contained in the module's reference folder. |
ASSEM_REF | 0,1 | off,on | Classify REF_SET if needed & use to start final assembly instead of starting from read gathering results. |
SEG_NUMBERS | "" | string | Used for amended consensus renaming, typically module-specific. A comma-delimited list of key:value pairs. Example: "PB2:1,PB1:2,PA:3,HA:4,NP:5,NA:6,MP:7,NS:8" |
RESIDUAL_ASSEMBLY_FACTOR | 0 | integer | Assembles secondary data if observed factor is less than integer RESIDUAL_ASSEMBLY_FACTOR. Leave 0 for off. |
MIN_RP_RESIDUAL | 150 | ≥ 1 | Minimum number of read patterns to continue to attempt residual assembly per gene segment. |
MIN_RC_RESIDUAL | 150 | ≥ 1 | Minimum read count to continue to attempt residual assembly per gene segment. |
DO_SECONDARY | 0, 1 | off, on | Given a successful residual assembly, create a patchwork consensus and run a secondary assembly . |
Miscellaneous
Parameter | Default | Values | Description |
PACKAGED_FASTQ | 0, 1 | off, on | Within the intermediate data subfolder, controls the packaging together of sorted assembly fastq as a "tar.gz" (on) or as separate gzip'd fastq (off). |
USE_IRMA_CORE (experimental) | 0, 1 | off, on | Configuration to run the mergeSAMpairs and fastQ_converter tasks more efficiently. This is currently most helpful for paired-end samples with a higher quantity of reads. The irma-core_<OS> binary must be in the IRMA_RES/scripts folder or in thePATH, otherwise the configuration will be turned back off. |
Return to the Influenza Division Bioinformatics Team homepage