Skip to content

Gene quantification

rustar-aligner can produce two flavours of quantification output during alignment, controlled by --quantMode. Both require a GTF file.

Produces a <prefix>ReadsPerGene.out.tab file with one row per gene and four columns: gene ID, then read counts for unstranded, forward-stranded, and reverse-stranded protocols. Output is identical to STAR’s, so any downstream tool that already consumes STAR ReadsPerGene.out.tab works unchanged (e.g. DESeq2, edgeR, MultiQC).

Terminal window
rustar-aligner \
--genomeDir /path/to/genome_index \
--readFilesIn reads_1.fq.gz reads_2.fq.gz \
--readFilesCommand zcat \
--sjdbGTFfile gencode.v45.gtf \
--quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix sample_
N_unmapped <count> <count> <count>
N_multimapping <count> <count> <count>
N_noFeature <count> <count> <count>
N_ambiguous <count> <count> <count>
ENSG00000000003.15 <count> <count> <count>
ENSG00000000005.6 <count> <count> <count>
...

The first four rows are summary categories (matching STAR / htseq-count semantics). Subsequent rows are per-gene counts. Pick the column that matches your library protocol:

ColumnStrand assumption
2unstranded
3forward / htseq-count -s yes
4reverse / htseq-count -s reverse

If you don’t know your library’s strandedness, all three columns help — you can compare them and infer.

Transcriptome-coordinate SAM (TranscriptomeSAM)

Section titled “Transcriptome-coordinate SAM (TranscriptomeSAM)”

Produces a <prefix>Aligned.toTranscriptome.out.bam file with reads mapped to transcriptome coordinates (one record per matching transcript) instead of genome coordinates. This is the format consumed by RSEM and similar transcript-level quantifiers.

Terminal window
rustar-aligner \
--genomeDir /path/to/genome_index \
--readFilesIn reads_1.fq.gz reads_2.fq.gz \
--readFilesCommand zcat \
--sjdbGTFfile gencode.v45.gtf \
--quantMode TranscriptomeSAM \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix sample_

--quantTranscriptomeSAMoutput controls what’s allowed in the transcriptome BAM:

ValueMeaning
BanSingleEnd_BanIndels_ExtendSoftclipDefault. RSEM-compatible: drop unpaired records, drop reads with indels, extend soft-clips into matches.
BanSingleEnd_ExtendSoftclipKeep indels, still extend soft-clips.
BanSingleEndKeep indels and soft-clips as-is.

Pick the variant that matches your downstream tool’s expectations. RSEM defaults to BanSingleEnd_BanIndels_ExtendSoftclip.

You can request both modes in the same run:

Terminal window
--quantMode GeneCounts TranscriptomeSAM

This emits ReadsPerGene.out.tab and Aligned.toTranscriptome.out.bam in addition to the normal alignment output.

For best speed, supply --sjdbGTFfile at --runMode genomeGenerate time and the transcript-level data structures get persisted into the genome directory. Then at alignment time you only need --quantMode TranscriptomeSAM (or GeneCounts); rustar-aligner reuses the persisted annotations.

If you supply --sjdbGTFfile only at alignment time, transcript info is rebuilt on the fly each run. This works but adds startup cost.

For --quantMode GeneCounts, all three strand columns are written regardless of library protocol — pick the right one downstream.

For --quantMode TranscriptomeSAM, the orientation is inferred from the transcript annotation in the GTF; no explicit strand parameter is needed.