Tutorial¶
Sample FASTQ files can be found under the test folder. Please note these FASTQ are only for testing purposes. For the full FASTQs used in our paper, please download the data from the NCBI Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra/) under access numbers SRP140497 and SRP141184.
In order to create consensus sequences, we first need to process FASTQ files into BAM files.
FASTQs to BAMs¶
Given FASTQs as input files, fastq2bam
mode removes the spacer region and extracts the
barcode tag from each sequencing read into the header with extract_barcode.py. The tag
removed FASTQs are then aligned with BWA mem into BAM files (Arguments can be provided in
the config.ini file or as command-line arguments. Please note command-line arguments
over-writes config.ini arguments).
REPO="[insert path to ConsensusCruncher repo]"
BWAPATH="[insert path to BWA]"
BWAINDEX="[insert path to BWA INDEX]"
BWAPATH="[insert path to SAMTOOLS]"
python ConsensusCruncher.py fastq2bam --fastq1 $REPO/test/fastq/LargeMid_56_L005_R1.fastq
--FASTQ2 $REPO/test/fastq/LargeMid_56_L005_R2.fastq -o $REPO/test -b $BWAPATH -r $BWAIndex
-s $SAMTOOLS -bpattern NNT
In the sample dataset, we utilized 2-bp (NN) barcodes and 1-bp (T) spacers. While the barcodes for each read can be one of 16 possible combinations (4^2), the spacer is an invariant “T” base used to ligate barcodes onto each end of a DNA fragment. Thus, a spacer filter is imposed to remove faulty reads. Barcodes from read 1 and read 2 are extracted and combined together before being added to the header.
READ FROM SEQUENCER
Read1:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193 1:N:0:ACGTCACA [<-- HEADER]
ATTAAGCCCCAGGCAGTTGCTAATGATGGGAGCTTAGTGCACAAGGGCTGGGCCTCCCTCTTGGAGCTGAACATTGTTTCTTGGGGACGGCTGTGCCCACCTCAGCGGGGAGGCAAGGATTAAATC [<-- SEQUENCE]
+
BCCCCGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGEGGGGGBGGGGGGGGGGGGGGGGGGGGGGGEGG1:FGFGGGGGGGGG/CB>DG@GGGGGGG<DGGGGAAGGEGGB>DGGGEGGG/@G [<-- QUALITY SCORE]
Read2:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193 2:N:0:ACGTCACA
GGTGGGCTCCAGCCCTGATTTCCTCCCCCAGCCCTGCAGGGCTCAGGTCCAGAGGACACAAGTTTAACTTGCGGGTGGTCACTTGCCTCGTGCGGTGACGCCATGGTGCCCTCTCTGTGCAGCGCA
+
BBBBCGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGB:FCGGGGGGGGGGEGGGGGGGG=FCGG:@GGGEGBGGGAGFGDE@FGGGGGFGFGEGDGGGFCGGDEBGGGGGGGEG=EGGGEEGGG#
------
AFTER BARCODE EXTRACTION AND SPACER ("T") REMOVAL
Read1:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193|AT.GG/1
AAGCCCCAGGCAGTTGCTAATGATGGGAGCTTAGTGCACAAGGGCTGGGCCTCCCTCTTGGAGCTGAACATTGTTTCTTGGGGACGGCTGTGCCCACCTCAGCGGGGAGGCAAGGATTAAATC
+
CCGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGEGGGGGBGGGGGGGGGGGGGGGGGGGGGGGEGG1:FGFGGGGGGGGG/CB>DG@GGGGGGG<DGGGGAAGGEGGB>DGGGEGGG/@G
Read2:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193|AT.GG/2
GGGCTCCAGCCCTGATTTCCTCCCCCAGCCCTGCAGGGCTCAGGTCCAGAGGACACAAGTTTAACTTGCGGGTGGTCACTTGCCTCGTGCGGTGACGCCATGGTGCCCTCTCTGTGCAGCGCA
+
BCGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGB:FCGGGGGGGGGGEGGGGGGGG=FCGG:@GGGEGBGGGAGFGDE@FGGGGGFGFGEGDGGGFCGGDEBGGGGGGGEG=EGGGEEGGG#
FASTQ files with extracted barcodes are placed in the fastq_tag directory and are subsequently aligned with BWA to generate BAMs in the bamfiles folder.
.
├── bamfiles
├── fastq
├── fastq_tag
└── qsub
ConsensusCruncher¶
consensus
mode creates a consensus directory and folders for each bam file. BAM files
undergo consensus construction through the workflow illustrated above. Output BAMs are
grouped according to type of error suppression (SSCS vs DCS) and whether Singleton
Correction (SC) was implemented.
.
├── bamfiles
├── consensus
│ ├── LargeMid_56_L005
│ │ ├── dcs
│ │ ├── dcs_SC
│ │ ├── sscs
│ │ └── sscs_SC
...
│ ├── LargeMid_62_L006
│ │ ├── dcs
│ │ ├── dcs_SC
│ │ ├── sscs
│ │ └── sscs_SC
│ └── qsub
├── fastq
├── fastq_tag
└── qsub
Within a sample directory (e.g. LargeMid_56_L005), you will find the following files:
Please note the example below is for illustrative purposes only, as sample names and index files were removed for simplification. Order of directories and files were also altered to improve comprehension.
. Filetype
├── sscs
│ ├── badReads.bam Reads that are unmapped or have multiple alignments
│ ├── sscs.sorted.bam Single-Strand Consensus Sequences (SSCS)
│ ├── singleton.sorted.bam Single reads (Singleton) that cannot form SSCSs
├── sscs_SC
| ├── singleton.rescue.sorted.bam Singleton correction (SC) with complementary singletons
| ├── sscs.rescue.sorted.bam SC with complementary SSCSs
| ├── sscs.sc.sorted.bam SSCS combined with corrected singletons (from both rescue strategies) [*]
| ├── rescue.remaining.sorted.bam Singletons that could not be corrected
| ├── all.unique.sscs.sorted.bam SSCS + SC + remaining (uncorrected) singletons
├── dcs
│ ├── dcs.sorted.bam Duplex Consensus Sequence (DCS)
│ ├── sscs.singleton.sorted.bam SSCSs that could not form DCSs as complementary strand was missing
├── dcs_SC
│ ├── dcs.sc.sorted.bam DCS generated from SSCS + SC [*]
│ ├── sscs.sc.singleton.sorted.bam SSCS + SC that could not form DCSs
│ ├── all.unique.dcs.sorted.bam DCS (from SSCS + SC) + SSCS_SC_Singletons + remaining singletons
├── read_families.txt Family size and frequency
├── stats.txt Consensus sequence formation metrics
├── tag_fam_size.png Distribution of reads across family size
└── time_tracker.txt Time log
Through each stage of consensus formation, duplicate reads are collapsed together and single reads are written as separate files. This allows rentention of all unique molecules, while providing users with easy data management for cross-comparisons between error suppression strategies.
To simplify analyses, it would be good to focus on SSCS+SC (“sscs.sc.sorted.bam”) and DCS+SC (“dcs.sc.sorted.bam”) as highlighted above with [*].
Within the stats file you should expect to see the following (Please note as this is a test dataset, the number of consensus reads is very low):
# === SSCS ===
Uncollapsed - Total reads: 19563
Uncollapsed - Unmapped reads: 17
Uncollapsed - Secondary/Supplementary reads: 24
SSCS reads: 0
Singletons: 19522
Bad spacers: 0
# QC: Total uncollapsed reads should be equivalent to mapped reads in bam file.
Total uncollapsed reads: 19563
Total mapped reads in bam file: 19563
QC: check dictionaries to see if there are any remaining reads
=== pair_dict remaining ===
=== read_dict remaining ===
=== csn_pair_dict remaining ===
0.02919737100601196
# === DCS ===
SSCS - Total reads: 0
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 0
SSCS singletons: 0
# === Singleton Correction ===
Total singletons: 19522
Singleton Correction by SSCS: 0
% Singleton Correction by SSCS: 0.0
Singleton Correction by Singletons: 4
% Singleton Correction by Singletons : 0.020489703923778302
Uncorrected Singletons: 19518
0.020557292302449546
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 4
SSCS SC - Unmapped reads: 0
SSCS SC - Secondary/Supplementary reads: 0
DCS SC reads: 2
SSCS SC singletons: 0