Cole Trapnell (Berkeley) — TopHat: discovering splice junctions with RNA seq
Short overview of transcripts and isoforms. Two tools in Tuxedo:
- TopHat – spliced short-read alignment (expands on Bowtie)
- Cufflinks – isoform assembly and quantitation
Align reads with Bowtie. Pileup of reads are possible transcripts / exons, generate possible pairings of exons, generate synthetic sequences around junctions and re-align unmapped reads. Number of possible pairings grows quadratically, TopHat adds a number of constraints to avoid this. Misses very long introns.
New developments in sequencing (100bp reads, paired ends of any size) allow new options. New version is faster (fully threaded based on Bowtie), semi-canonical intron support, support for micro-exons. Split up reads into 25mers, map independently, look for linear alignments (that might miss some of the fragments). Running time independent of maximum intron size. Junctions another Bowtie index file.
Cufflinks to handle ab initio assembly of reads in few, highly probable transcripts. Uses partial ordering of mates and building a DAG for the ordered set. Graph based algorithm to find the minimum number of transcripts required to explain the reads.
Test data by Barbara Wold, C2C12 cell line. R50 isoforms account for 50% of the reads. Cufflinks and UCSC agree for most moderately and highly expressed transcripts (85%). UCSC has additional transcripts likely not expressed in the tested cell line.
TopHat available for download.
Regina Bohnert (FML/MPI): Quantitative detection of alternative transcripts
Using rQuant to quantitate RNA-seq data avoiding the simple approach of naive read counting. Biases introduced by cDNA library construction, sequencing, read mapping.
Normalized positional read coverage along the transcripts based on the relative transcript position (different profiles for different length bins). Used Flux Simmulator for synthetic data modeling physical RNA fragmentation and random priming. Biases change with priming strategy. rQuant tries to minimize the error/loss function in transcription profile mixtures (i.e., optimize the transcript weights) after fixing the profiles
Fix the transcripts and optimize the profiles, iterate over those two steps until at a stable minimum
Compared rQuant (position-wise method with profiles) to itself without profiles and segment-wise approaches using simulated data and a subset of A thaliana spliced genes, seems to result in improved quantitation accuracy.
- Additional biases such as the exonic GC content (higher GC content results in higher read coverage)
Software should be available at her homepage “soon”.
Jinze Liu (North Carolina) on mapping RNA-seq short reads for splice junction discovery
Problem: Infer splices from RNA-seq tags in Equus (which is EST-poor). Prefix and suffix of a tag align to different parts of the genome; splice junction t unknown. Needs to be error tolerant.
Approach: either suffix or prefix need to align with at least m/2 nucleotides in the genome. Find half-tag alignments followed by constrained search around the anchor within distance D. Spliced alignment search minimizes mismatches between tag and genome (not conditioned on genomic features).
Linear scaling with genome size, number of tags, maximum allowed intron size and tag size. That’s quite interesting, would make this great for an initial screen of a data set before possibly trying more complex approaches.
Require minimum anchor size to keep false positives low, minimum amount of tags crossing an intron. Good sensitivity at 8+ spliced reads (usual dependency on read coverage).
Inanc Birol (BC Cancer Agency): de novo transcriptome assembly with ABySS
New paradigm in assembly: large number of short and uniform reads. Old approach of overlap/overlay/consensus no longer viable. Alternative in the de Bruijn graph method to characterize the order of the reads with respect to each other in a digraph.
k-mer extension require higher coverage with increased k; with k less than the read length the virtual coverage increases (e.g., 8-mer reads with 5-mer sub-reads). Selection of k determines sensitivity and specificity. Assembly branches / false branches / bubbles due to misreads, repeats. Need to be trimmed, remove lesser supported branches in bubbles to come up with a linear assembly
Output includes:
- Assembly contigs
- ‘Popped’ bubbles (allelic differences near-repeats, read errors)
- Contig adjacency (k-1 overlaps and mate pairs)
Sample performance on E.coli K12, 21M Illumina reads, 325X with 36bp reads. Well studied substrain with no rearrangements. No mis-assemblies of contigs > 100bp
Difference in transcript assembly: coverage levels depend on expression levels, split assembly paths can be isoforms, contig sizes vary with product sizes (small!), contig clusters are the expressed genes. Can discover novel transcripts.
194M reads from a lymphoma patient (both paired and unpaired reads). Picked k-mer of 28 as it results in a sharp decrease of contigs. Assembly between 2 and 8 hours (unpaired, paired) result in 30Mbp of transcriptime, 93% overlap with UCSC genes, 7% in intergenic regions. Retained popped bubble information provides allele information.
ABySS explorer (in review in IEEE Tr Vis and Comp Graphics) to explore contigs.
Chol-Hee Jung (IMB Queensland): Identification of novel non-coding RNAs using profiles of short seq reads from next generation sequencing data
In short: RNomics through reverse engineering. 10 million tags from 12 Drosophila datasets, 11% of the genome with perfectly matching tags. Cover miRNA, transposons, tRNA, snoRNAs. Recent studies identified novel ncRNA forms (mi/si/piRNA). Using the concept of tag-contigs to find novel ncRNA groups — continuously overlapping sequence tags on the same strand identified about half a million contigs
Identify contigs of specific length with high coverage in unannotated parts of the genome. Reduced the set to 100k clusters after removal of transposons. Take subset of clusters with specific length, depth, features (89 with depth > 100, no transposon similarity). First group mainly in introns, downstream of tRNAs (8 clusters with identical 18mer), confirmed by Northerns. Group 2 specifically expressed in tissue subsets.