snapatac2.pp.make_fragment_file#

snapatac2.pp.make_fragment_file(bam_file, output_file, is_paired=True, barcode_tag=None, barcode_regex=None, umi_tag=None, umi_regex=None, shift_left=4, shift_right=-5, min_mapq=30, chunk_size=50000000, chrM=['chrM', 'M'], source=None, compression=None, compression_level=None, tempdir=None)[source]#

Convert a BAM file into a sorted fragment file.

Use this function to create a fragment file from paired-end or single-end BAM alignments before importing the data with import_fragments. The conversion filters low-quality records, extracts cell barcodes and optional UMIs, removes duplicates within each barcode, applies Tn5 insertion-site shifts for paired-end reads, and writes fragments in barcode-sorted order.

The BAM file does not need to be pre-sorted or pre-filtered.

Anti-Patterns#

  • Do NOT set both barcode_tag and barcode_regex; choose exactly one barcode source.

  • Do NOT use this function for Cell Ranger ATAC BAM files unless source="10x" is appropriate; Cell Ranger ATAC BAM headers can be malformed, and the Cell Ranger fragment file is usually the safer input.

  • Do NOT place tempdir on a small filesystem for large BAM files; sorting can create large temporary files.

Processing Steps#

  1. Filtering: remove reads that are unmapped, not primary alignment, mapq < 30, fails platform/vendor quality checks, or optical duplicate. For paired-end sequencing, it also removes reads that are not properly aligned.

  2. Deduplicate: Sort the reads by cell barcodes and remove duplicated reads for each unique cell barcode.

  3. Output: Convert BAM records to fragments (if paired-end) or single-end reads.

Notes

When using barcode_regex or umi_regex, the regular expression must contain exactly one capturing group that matches the barcode or UMI.

type bam_file:

Path

param bam_file:

Path to the input BAM file.

type output_file:

Path

param output_file:

Path to the output fragment file.

type is_paired:

bool

param is_paired:

Whether the BAM file contains paired-end reads.

type barcode_tag:

str | None

param barcode_tag:

Extract barcodes from TAG fields of BAM records, e.g., barcode_tag="CB".

type barcode_regex:

str | None

param barcode_regex:

Extract barcodes from read names of BAM records using regular expressions. Reguler expressions should contain exactly one capturing group (Parentheses group the regex between them) that matches the barcodes. For example, barcode_regex="(..:..:..:..):\w+$" extracts bd:69:Y6:10 from A01535:24:HW2MMDSX2:2:1359:8513:3458:bd:69:Y6:10:TGATAGGTTG. You can test your regex on this website: https://regex101.com/.

type umi_tag:

str | None

param umi_tag:

Extract UMI from TAG fields of BAM records.

type umi_regex:

str | None

param umi_regex:

Extract UMI from read names of BAM records using regular expressions. See barcode_regex for more details.

type shift_left:

int

param shift_left:

Insertion site correction for the left end. Note this has no effect on single-end reads.

type shift_right:

int

param shift_right:

Insertion site correction for the right end. Note this has no effect on single-end reads.

type min_mapq:

int | None

param min_mapq:

Filter the reads based on MAPQ.

type chunk_size:

int

param chunk_size:

The size of data retained in memory when performing sorting. Larger chunk sizes result in faster sorting and greater memory usage.

type source:

Optional[Literal['10x']]

param source:

The source of the BAM file. This is used for vendor-specific processing. For example, the BAM files generated by 10X Genomics needs special processing to fix the malformed BAM headers. Currently the only supported source is “10x”.

type chrM:

list[str] | None

param chrM:

A list of mitochondrial chromosome names, used to calculate QC metrics.

type compression:

Optional[Literal['gzip', 'zstandard']]

param compression:

Compression type. If None, it is inferred from the suffix.

type compression_level:

int | None

param compression_level:

Compression level. 1-9 for gzip, 1-22 for zstandard. If None, it is set to 6 for gzip and 3 for zstandard.

type tempdir:

Path | None

param tempdir:

Location to store temporary files. If None, system temporary directory will be used.

returns:

Dictionary containing sequencing and mapping QC metrics, including:

  • “sequenced_reads”: number of reads in the input BAM file.

  • “sequenced_read_pairs”: number of read pairs in the input BAM file.

  • “frac_duplicates”: Fraction of high-quality read pairs that are deemed to be PCR duplicates. This metric is a measure of sequencing saturation and is a function of library complexity and sequencing depth. More specifically, this is the fraction of high-quality fragments with a valid barcode that align to the same genomic position as another read pair in the library.

  • “frac_confidently_mapped”: Fraction of sequenced reads or read pairs with mapping quality > 30.

  • “frac_unmapped”: Fraction of sequenced reads or read pairs that have a valid barcode but could not be mapped to the genome.

  • “frac_valid_barcode”: Fraction of reads or read pairs with barcodes that match the whitelist after error correction.

  • “frac_nonnuclear”: Fraction of sequenced read pairs that have a valid barcode and map to non-nuclear genome contigs, including mitochondria, with mapping quality > 30.

  • “frac_fragment_in_nucleosome_free_region”: Fraction of high-quality fragments smaller than 147 basepairs.

  • “frac_fragment_flanking_single_nucleosome”: Fraction of high-quality fragments between 147 and 294 basepairs.

rtype:

dict[str, float]

Examples

>>> import snapatac2 as snap
>>> bam_file = snap.datasets.pbmc500(type="bam")
>>> metrics = snap.pp.make_fragment_file(
...     bam_file,
...     "fragments.tsv.gz",
...     barcode_tag="CB",
...     source="10x",
... )
>>> "sequenced_reads" in metrics
True