snapatac2.pp.make_fragment_file#
- snapatac2.pp.make_fragment_file(bam_file, output_file, is_paired=True, barcode_tag=None, barcode_regex=None, umi_tag=None, umi_regex=None, shift_left=4, shift_right=-5, min_mapq=30, chunk_size=50000000, chrM=['chrM', 'M'], source=None, compression=None, compression_level=None, tempdir=None)[source]#
Convert a BAM file into a sorted fragment file.
Use this function to create a fragment file from paired-end or single-end BAM alignments before importing the data with
import_fragments. The conversion filters low-quality records, extracts cell barcodes and optional UMIs, removes duplicates within each barcode, applies Tn5 insertion-site shifts for paired-end reads, and writes fragments in barcode-sorted order.The BAM file does not need to be pre-sorted or pre-filtered.
Anti-Patterns#
Do NOT set both
barcode_tagandbarcode_regex; choose exactly one barcode source.Do NOT use this function for Cell Ranger ATAC BAM files unless
source="10x"is appropriate; Cell Ranger ATAC BAM headers can be malformed, and the Cell Ranger fragment file is usually the safer input.Do NOT place
tempdiron a small filesystem for large BAM files; sorting can create large temporary files.
Processing Steps#
Filtering: remove reads that are unmapped, not primary alignment, mapq < 30, fails platform/vendor quality checks, or optical duplicate. For paired-end sequencing, it also removes reads that are not properly aligned.
Deduplicate: Sort the reads by cell barcodes and remove duplicated reads for each unique cell barcode.
Output: Convert BAM records to fragments (if paired-end) or single-end reads.
Notes
When using
barcode_regexorumi_regex, the regular expression must contain exactly one capturing group that matches the barcode or UMI.- type bam_file:
- param bam_file:
Path to the input BAM file.
- type output_file:
- param output_file:
Path to the output fragment file.
- type is_paired:
- param is_paired:
Whether the BAM file contains paired-end reads.
- type barcode_tag:
- param barcode_tag:
Extract barcodes from TAG fields of BAM records, e.g.,
barcode_tag="CB".- type barcode_regex:
- param barcode_regex:
Extract barcodes from read names of BAM records using regular expressions. Reguler expressions should contain exactly one capturing group (Parentheses group the regex between them) that matches the barcodes. For example,
barcode_regex="(..:..:..:..):\w+$"extractsbd:69:Y6:10fromA01535:24:HW2MMDSX2:2:1359:8513:3458:bd:69:Y6:10:TGATAGGTTG. You can test your regex on this website: https://regex101.com/.- type umi_tag:
- param umi_tag:
Extract UMI from TAG fields of BAM records.
- type umi_regex:
- param umi_regex:
Extract UMI from read names of BAM records using regular expressions. See
barcode_regexfor more details.- type shift_left:
- param shift_left:
Insertion site correction for the left end. Note this has no effect on single-end reads.
- type shift_right:
- param shift_right:
Insertion site correction for the right end. Note this has no effect on single-end reads.
- type min_mapq:
- param min_mapq:
Filter the reads based on MAPQ.
- type chunk_size:
- param chunk_size:
The size of data retained in memory when performing sorting. Larger chunk sizes result in faster sorting and greater memory usage.
- type source:
- param source:
The source of the BAM file. This is used for vendor-specific processing. For example, the BAM files generated by 10X Genomics needs special processing to fix the malformed BAM headers. Currently the only supported source is “10x”.
- type chrM:
- param chrM:
A list of mitochondrial chromosome names, used to calculate QC metrics.
- type compression:
- param compression:
Compression type. If
None, it is inferred from the suffix.- type compression_level:
- param compression_level:
Compression level. 1-9 for gzip, 1-22 for zstandard. If
None, it is set to 6 for gzip and 3 for zstandard.- type tempdir:
- param tempdir:
Location to store temporary files. If
None, system temporary directory will be used.- returns:
Dictionary containing sequencing and mapping QC metrics, including:
“sequenced_reads”: number of reads in the input BAM file.
“sequenced_read_pairs”: number of read pairs in the input BAM file.
“frac_duplicates”: Fraction of high-quality read pairs that are deemed to be PCR duplicates. This metric is a measure of sequencing saturation and is a function of library complexity and sequencing depth. More specifically, this is the fraction of high-quality fragments with a valid barcode that align to the same genomic position as another read pair in the library.
“frac_confidently_mapped”: Fraction of sequenced reads or read pairs with mapping quality > 30.
“frac_unmapped”: Fraction of sequenced reads or read pairs that have a valid barcode but could not be mapped to the genome.
“frac_valid_barcode”: Fraction of reads or read pairs with barcodes that match the whitelist after error correction.
“frac_nonnuclear”: Fraction of sequenced read pairs that have a valid barcode and map to non-nuclear genome contigs, including mitochondria, with mapping quality > 30.
“frac_fragment_in_nucleosome_free_region”: Fraction of high-quality fragments smaller than 147 basepairs.
“frac_fragment_flanking_single_nucleosome”: Fraction of high-quality fragments between 147 and 294 basepairs.
- rtype:
See also
Examples
>>> import snapatac2 as snap >>> bam_file = snap.datasets.pbmc500(type="bam") >>> metrics = snap.pp.make_fragment_file( ... bam_file, ... "fragments.tsv.gz", ... barcode_tag="CB", ... source="10x", ... ) >>> "sequenced_reads" in metrics True