Short-seq SIG: SNP Discovery and Cancer Genomics

Jun 28, 12:00 am

Adrian Dalca (Toronto) on VARiD: Variation detection in color-space and letter space,

Two different data types with inherent errors and biases, useful to combine into a single framework. Sequencing errors in color space is a change in one color; a SNP is a change in the underlying sequence resulting in two color changes. Usually a mix of signals though (heterozygous SNPs etc).

Uses a HMM: 16 states of dinucleotides but only four colours; only certain transitions allowed. Transitions only depend on previous state. Modeling different errors in color and letter space.

  • unknowns (donor)
  • emissions (reads)
  • dependence on previous states

Expanded the framework for heterozygous SNPs by doubling the number of states, allowing different error rates for different colours/letters. One problem: proposed variants need to be checked in the original reads whether they actually were discovered or are an artifact of the combination in the HMM.

Test data set show low false positive rates (similar to Corona).

Probablistic detection of SNPs in tumour transcriptomes Sohrab Shah (UBC)

Workflow for clinically relevant variations: alignment of reads, detection of variants (focus of the talk), confirmation of SNPs and distinction between somatic and germ line changes, annotation of SNP with frequency, functional significance etc

SNVMix1 models allelic counts in a probabilistic graph model; 3 states (reference, hetero/homozygous) using a prior distribution of genotypes and biases towards the three states. Obviates the need for depth-based thresholding, fits data to model using EM. Comparison to MAQ (which uses fixed parameters) shows that the fitted model results in increased accuracy.

Uncertainty in the form of mapping and base-call qualities taken into account by SNVMix2, low quality positions contribute less to the SNP calls. Gains additional coverage (?) / accuracy beyond what would normally be filtered by quality-based thresholding. Comparison of SNP calls with MAQ; MAQ with slightly lower false-positive rate at lower depth. SNVMix1/2 recovers 305 true positives, calls 197 false positives.

16 ovarian cancer transcriptomes from samples with different clinical outcomes. Identified recurring FoxL2 point mutation in granulosa cell tumours, not found in 800 other cancers.

Deeply sequenced breast cancer genome, identify the mutational evolution by comparison with the primary sample. 32 new somatic mutations, some genes hit multiple times but never at the same position. Four out of the 32 mutations present in the primary.

Dirk Evers (Illumina) on detecting polymorphisms in cancer tumour/normal pairs

Quick overview of Eland mapping process. Finding places where the paired second read does not match the reference alignment to identify short indels; better method is finding multiple reads in the area — builds a contig of ‘shadows’ (non-matching paired reads) using a smith-waterman approach.

Identify tumor-specific variants in a melanoma cell line (COLD-829) with 75bp reads, coverage between 32 and 40X. Provides statistics of SNPs from CASAVA; identified somatic substitutions using a threshold-based approach. 30% of somatic substitutions in coding positions, 5 megabase of sequence used to validate SNPs by sanger sequencing, 44 out of 50 somatic substitutions identified correctly. Large number of somatic indels (1010?).

Runs through a number of biological examples, including deletion of a 3.8 kb segment, inversions, CNVs (captured by changes in coverage, compares favorably to array data, complicated by the cell line’s karyotype).

Moving to probabilistic models for their inhouse software. Eland v2 to be released in Q4 using multiple seeds with a banded gap alignment. CASAVA cleanup, move to standard short-seq formats, provide an API at some point.

Gerald Quon (Toronto), ISOLATE: a computational strategy for identifying the primary origin of cancers using high throughput sequencing

5% of all new cases are cancers of unknown primary origin at presentation. Two major problems:

  • unknown site of origin
  • sample heterogeneity; surgically removed tumours tend to be a mix of tumour and health surrounding tissue cells, mixture affects downstream analysis if not accounted for

Usual approach includes supervised classification based on select marker genes from known tumours. Works well on differentiated tumours, poorly on secondary tumours from metastatic cancer

ISOLATE is a probabilistic mixture model to identify the site of origin in an unsupervised way and estimate the composition of the tumour and finally identify differentially expressed genes. Leverages high throughput-sequencing’s higher dynamic range.

Input to the model: expression profile of the tumor sample and a source panel of characterized sites of origins including typical cell contaminants. Output is site of origin, composition, differentially expressed genes. Difference to ICA: ISOLATE does not try to solve these problems sequentially but in a cooperative manner.

Source panel from existing high throughput data. Synthetic test set by generating tumor samples from the source panel (perturbation of DE genes, tumor-tumor variability).

  • Differential expression error compares the identified ranking of genes with known (synthetic) DE. Largely invariant to the number of tumor samples, does not need a large number of samples.
  • Origin error: number of times origin identified incorrectly. Maintained low error rate for ISOLATE despite increased number of sources (> 5). Moderate error rate even with few tumor samples
  • Heterogeneity error: average error in composition prediction (amount each source contributes). Strict representation of cancer expression profiles in ISOLATE seems to help with the composition prediction (wish he’d given more details here)

Real world (clinical) test set, had to use a microarray set of 93 tumor samples due to lack of next-gen sequencing samples. Six different sites of origins, tissue panel a different data set with the same set of six healthy tissues. Compares ISOLATE to LDA, KL; ISOLATE tends to match/beat the reference approaches.

Available as a python package.

Oliver Hofmann

,

---

Comments

Commenting is closed for this article.

---