Quick start guide¶
Installation¶
Download the latest release or clone the repository:
$ git clone https://github.com/pughlab/ConsensusCruncher.git
Install the dependencies:
All required python libraries can be installed by running:
$ pip install -r requirements.txt
Configuration¶
Set up config.ini
with the appropriate configurations for fastq2bam
and consensus
modes.
Alternatively, you can provide command-line arguments, which will overwrite config.ini
parameters.
Example config file:
[fastq2bam]
fastq1 = # Path to FASTQ containing Read 1 of paired-end reads. [MANDATORY]
fastq2 = # Path to FASTQ containing Read 2 of paired-end reads. [MANDATORY]
output = # Output directory, where barcode extracted FASTQ and
# BAM files will be placed in subdirectories 'fastq_tag'
# and 'bamfiles' respectively (dir will be created if
# they do not exist). [MANDATORY]
name = # Sample name extracted from filename using read number delimiter
# (e.g. Sample1_R1.fastq, delimiter = '_R', sample name = "Sample1"). [MANDATORY]
bwa = # Path to executable BWA. [MANDATORY]
ref = # Path to reference (BWA index). [MANDATORY]
samtools = # Path to executable samtools. [MANDATORY]
bpattern = # Barcode pattern (N = random barcode bases, A|C|G|T = fixed spacer bases). [MANDATORY]
blist = # List of barcodes (Text file with unique barcodes on each line). [MANDATORY]
# Note: You can input either a barcode list or barcode pattern or both.
# If both are provided, barcodes will first be matched with the list and then the constant
# spacer bases will be removed before the barcode is added to the header.
# e.g. ATNNGT means 2-bp barcode is flanked by two spacers matching 'AT' in front and
# 'GT' behind.
[consensus]
bam = # Input BAM file with barcodes in header. [MANDATORY]
c_output = # Output directory for consensus sequences. [MANDATORY]
samtools = # Path to executable samtools. [MANDATORY]
# Optional arguments
scorrect = # Singleton correction [True or False, default: True].
genome = # Genome version to determine which cytoband file to use for data splitting
# (hg19 or hg38, default: hg19)
bedfile = # Bedfile, default: cytoBand.txt. WARNING: It is HIGHLY RECOMMENDED that you
# use the provided cytoBand.txt unless you're working with genome build that
# is not hg19 or hg38. Then a separate bedfile is needed for data segmentation.
# For small BAM files, you may choose to turn off data splitting with '-b False'
# and process everything all at once (Division of data is only required for
# large data sets to offload the memory burden).
cutoff = # Consensus cut-off, default: 0.7
# (70% of reads must have the same base to form a consensus).
bdelim = # Delimiter before barcode in read name
# (e.g. '|' in 'HWI-D00331:196:C900FANXX:7:1110:14056:43945|TTTT')
cleanup = # Remove intermediate files. [True or False]
UMI list example:
TATGCGT
AACGGAT
AAGCCT
AATCGCT
ACCGAT
ACGCAAT
ACGTGT
AGGACAT
ATGTCCT
CACAGT
Running ConsensusCruncher¶
ConsensusCruncher has 2 modes:
Extract UMIs & align FASTQs¶
fastq2bam
extracts UMIs and aligns FASTQs with BWA to generate BAM files.Run
ConsensusCruncher.py [-c CONFIG] fastq2bam
with required input parameters (seeconfig.ini
or--help
for examples).This script extracts molecular barcode tags and removes spacers from unzipped FASTQ files found in the input directory (file names must contain “R1” or “R2”). Barcode extracted FASTQ files are written to the ‘fastq_tag’ directory and are subsequently aligned with BWA mem. Bamfiles are written to the ‘bamfile” directory under the project folder.
Error suppression¶
consensus
generates single-strand and duplex consensus sequences with ‘Singleton Correction’Run
ConsensusCruncher.py [-c CONFIG] consensus
with the required input parameters (seeconfig.ini
or--help
for examples).Using unique molecular identifiers (UMIs), duplicate reads from the same molecule are amalgamated into single-strand consensus sequences (SSCS). If complementary strands are present, SSCSs can be subsequently combined to form duplex consensus sequences (DCS).
If ‘Singleton Correction’ (SC) is enabled, single reads (singletons) can be error suppressed using complementary strand. These corrected singletons can be merged with SSCSs to be further collapsed into DCSs + SC.
Finally, a BAM file containing only unique molecules (i.e. no duplicates) is created by merging DCSs, remaining SSCSs (those that could not form DCSs), and remaining singletons (those that could not be corrected).
Multiple FASTQs¶
ConsensusCruncher.py
processes one sample (2 paired-end FASTQ files or 1 BAM file) at a time.
A sample script to generate shell scripts for multiple samples is provided here.
The script generator will create sh scripts for each file in a fastq directory.
- The following parameters need to be changed in the config file: name, bwa, ref, samtools, bpattern (alternatively if a barcode list is used instead, remove bpattern and add blist as parameter). Please note: fastq1, fastq2, output, bam, and c_output can be ignored as those will be updated using the generate_scripts.sh file.
- Update generate_scripts.sh with input, output, and code_dir.
- Run generate_scripts.sh to create sh files and then run those scripts.
Output¶
Please note the example below is for illustrative purposes only, as sample names and index files were removed for simplification. Order of directories and files were also altered to improve comprehension.
. Filetype
├── sscs
│ ├── badReads.bam Reads that are unmapped or have multiple alignments
│ ├── sscs.sorted.bam Single-Strand Consensus Sequences (SSCS)
│ ├── singleton.sorted.bam Single reads (Singleton) that cannot form SSCSs
├── sscs_SC
| ├── singleton.rescue.sorted.bam Singleton correction (SC) with complementary singletons
| ├── sscs.rescue.sorted.bam SC with complementary SSCSs
| ├── sscs.sc.sorted.bam SSCS combined with corrected singletons (from both rescue strategies) [*]
| ├── rescue.remaining.sorted.bam Singletons that could not be corrected
| ├── all.unique.sscs.sorted.bam SSCS + SC + remaining (uncorrected) singletons
├── dcs
│ ├── dcs.sorted.bam Duplex Consensus Sequence (DCS)
│ ├── sscs.singleton.sorted.bam SSCSs that could not form DCSs as complementary strand was missing
├── dcs_SC
│ ├── dcs.sc.sorted.bam DCS generated from SSCS + SC [*]
│ ├── sscs.sc.singleton.sorted.bam SSCS + SC that could not form DCSs
│ ├── all.unique.dcs.sorted.bam DCS (from SSCS + SC) + SSCS_SC_Singletons + remaining singletons
├── read_families.txt Family size and frequency
├── stats.txt Consensus sequence formation metrics
├── tag_fam_size.png Distribution of reads across family size
└── time_tracker.txt Time log
Through each stage of consensus formation, duplicate reads are collapsed together and single reads are written as separate files. This allows rentention of all unique molecules, while providing users with easy data management for cross-comparisons between error suppression strategies.
To simplify analyses, it would be good to focus on SSCS+SC (“sscs.sc.sorted.bam”) and DCS+SC (“dcs.sc.sorted.bam”) as highlighted above with [*].