Kai Ye (EMBL/EBI), Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired short-end reads
Pattern growth algorithm for string matching to detect minimal/maximal substrings. Simple standard pattern detection method. To detect deletions using mapped paired-end reads focuses on cases where one paired read cannot be mapped; break up the unaligned read into two parts and try to align them independently within a certain range of the anchor (mapped paired-end read, usually 2 x average distance). Map second read part within the maximal expected deletion size (usually between 1 and 1 million bases)
Simulation of deletions on chrX using 20 instances of different deletion sizes, 36bp reads at different coverage. Parameter of expected deletion size, at 10bp recovers 80%, decreases gradually to 100k, 1M bp with high FDR. Can be improved by longer read lengths.
Real data sample (4 billion paired 35bp reads, 40X) with 270.000 1-16bp indels (146k deletions), overlap around 90% with the Bentley et al paper results. 162k deletions between 1bp and 10kb. Peaks in a size of deletion vs frequency chart tend to be repeat elements (SINEs, LINEs)
Runtime and memory requirements scale with maximum deletion size. Real data example required 1.5 GB; available at http://www.ebi.ac.uk/~kye/pindel/
Fiona Hyland (Applied Biosystems) on Next generation algorithms: detection of SNPs, InDels and CNV in massively parallel short-read oligonucleotide ligation sequencing
Starts with a general overview of the SOLID system (improved accuracy due to reading each base twice allowing for distinction of errors and variation). Includes the mandatory comparisons to the Illumina platform.
How to detect…
- SNPs: diBayes (PDF), bayesian algorithm incorporating all measurable errors of the system including the population polymorphism rate (based on work from Gabor Marth). Also has a number of empirical detection filters (minimum coverage, allele ratio etc) to combine with the statistical model which can be set a different combinations
61% of SNPs called at 29X, 40% of the heterozygotes at 5X. Concordance with HapMap genotypes, full discovery of homozygotes, 99% of heterozygotes at 27X. FDR remains low at 5X (0.0005)
- Insertions/Deletions, similar method to the previous talk on pattern growth using the mapped read of a paired-end read as an anchor followed by a gapped alignment of the second read within a given window size.
Empirical constraints (75% of reads must have consistent InDel size, compatible color space). Tested so far on fragment data, 96% of small InDels confirmed by Sanger sequencing, results published in Genome Research
- CNVs: Calculating local score in a window size without a paired sample (observed / average vs expected / global coverage). Normalize for mappability (length of the read and the distance of the reads) and GC content are crucial important steps. HMMs with copy number as states (fixed number), set initial state and transition probabilities; post-processing includes merger of adjacent windows with similar CNVs.
Decent overlap with the Toronto DB
Paul Medvedev (Toronto) on detecting CNVs with mated short reads
Standard input/output (reference genome, sequence, CNV annotations in reference genome). Workflow includes:
- Capture reference adjacencies in a ‘donor graph’ (regions that are similar to each other), representing the reference genome as a walk through the graph
- Capture donor adjacencies in the same graph. Mate pairs spanning the break points result in discordant mate pairs (similar mapped distance and location). Algorithm detects these clusters, adding an edge to the graph for each such cluster to indicate adjacency Step 3: Identify the correct walk through the graph for the donor using coverage information
Uses a probabilistic model to match network walks (traversal counts) with the depth of coverage. Results in improved sensitivity in repeat regions / segmental duplications, claim better resolution compared to depth-of-coverage methods as mappings do not have to be unique.
Weldon Whitener (): a method for detecting small-scale human microsatellite length-polymorphisms using Solexa/Illumina paired-end sequencing data
Bridging reads that are uniquely anchored on either side of a repeat locus; using hanging reads (one paired-end read in the non-unique region). Model assumes normal distribution of reads and an upward bias of bridging reads.
Shift of bridging read distributions to the left or right (insertion vs deletion). As the number of bridging reads decreases the number of hanging repeats increases.
Test set at 40X, 1106 loci with triple repeats, 90% of reference repeats called correctly. For other libraries an empirical distribution might be required as they are bi-model/non-normal, conditioned on the distribution across the reference length. Heterozygous calls amount to a source of error in the exact indel length calls. Might be able to fix by sampling each allele at 0.5 probability.
Seunghak Lee (Toronto) on detecting small indels from clone-end sequencing with mixtures of distributions
Difference in mapped distance between mate pairs to detect inserts; small indels annot be detected (for those < 3 STD), usually discarded as potential noise. Suggest to use all mate pairs rather then discard those too close to the insert size.
In haploid cases create a distribution of mapped distances in each mapped pair cluster. The distribution shifts if there’s an indel (again, see previous talk). Uses the Central Limit Theorem. Should be interesting for the biostats folks back home, but I’m not even going to try to explain the idea.
In diploid cases indels the comparison is between a diploid donor genome and the haploid reference genome. Results in two distributions for heterozygous indels. Uses an EM algorithm to detect indels based on an the means of distribution.
Error rate decreases (as expected) with size of deletion and coverage. Small indels can be detected given enough coverage. Difficult to tell from the image shown, but 30X seems to be sufficient for indels in the 20bp range.
Published as MoDIL in Nature Methods