AGBT Day 3

10 days ago

Disclaimer: All notes, comments and links are just cut’n‘pasted from Twitter (#AGBT), FF and RSS feeds, blog posts are attributed, Twitter comments are not.

In general, people seem to be cautious over the niche that PacBio is likely to fit in, have written off Helicos, and seem to be excited about the single-mindedness of Complete Genomics and the new technologies such as Ion Torrent’s machine.

General notes

  • Summary of the current state of sequencing technology at Genetic Inference. Helicos is noticeably absent.
  • Slide used to describe Complete Genomic’s sequencing process
  • Details on Ion Torrent’s sequencing chemistry using an ion-sensitive layer (essentially a pH-meter); no light/scanning/cameras required.
  • Omics! take on the Pacific Biosciences release, and more coverage from MassGenomics. Lots of skepticism given the lack of hard data.
  • From Omics!, something already discussed on the SAM mailing list: “SAM/BAM stores all sorts of information on read pairs, and the strobe sequencing can generate many more than 2 tags per DNA fragment.”
  • IBM timed their press release on a new data analysis method well

The workshops and New Technologies session

Illumina

Tons of data. 2 billion bases per run, 25GB/day with the HiSeq 2000. Work in progress: total human transcriptome, 16 tissues. Massive throughput, plug-and-play reagents, remote access (e.g. iPhone). Two human genomes in one run.

Elliot Margulies: Using 2 HiSeq flowcells rather than 1 moves >10 read coverage from 97% to 98%. Not worth using both; bsically, we’re now at the point where one run is overkill for sequencing a genome. SNP chip concordances: 98.2% for GAIIx, 99.7% for one HiSeq FC, >99.9% for both. Depths: 34X, 39X, 77X

Complete Genomics

1 million genomes in the next five years, 500 genomes/month this year.Customers are delivered software and data – nothing else. The company will take care of the steps between DNA sample and genetic variant calls. Wants to build 10 sequencing centers worldwide. 50 genomes completed so far. Rade Drmanac from Complete: Read errors very low (0.05%) Errors come from bubbles, dust, DNA damage, somatic variation. RD calls GCs offering “cloud sequencing”. Throughput per run is now approaching 2 trillion bases. That’s utterly insane. 20-instrument facility could do 100,000 genomes per year.

Jared Roach (ISB): four genomes from Complete from same family. Combining info across samples —> 99.9997% accuracy.

Dr. Zemin Zhang (Genentech): 1 mutation for every 3 cigarettes smoked… good to know. Cancer genomes. Aml norm and tumor. Had array data. Seq data>90% coverage above 10x. 40 and 60x avg coverage. Signed up for more genomes.

Helicos

ChIP-seq, direct-RNA. Runs with over 1e9. Single molecule. Helicos has short reads, high indel error rates, weighs 900 kg and costs $800K. Yeah, this talk is a tough sell.

PacBio

Eric Schadt: sequencing to explore energy-producing bacteria. Added 13X PacBio seq to 19X Illumina seq to assemble a bacterial genome. Strobed reads powerful for spanning repeats. Can integrate DNA variation, molecular traits and phenotypes to construct a probabilistic causal gene network.

LifeTech

Joseph Beechem (Life Tech) just got applause for having the longest talk title of the session. JB is introducing Life Tech’s new single molecule sequencing instrument based on quantum dots. Portable Nanometer sequencer. Tunable read lenght, tunable accuracy. Table top instrument. Can replenish a batch of polymerase mid-way through a run to replace dead molecules; effectively unlimited read length. Beta instruments will be available to a small set of collaborators by the end of 2010.

Ion Torrent

Jonathon Rothberg of Ion Torrent is a student of history. He sees Second Generation tech as a minicomputer. The box is just a box: Ion Torrent is a chip. Measures H+ as a base is incorporated. No lights, no moving parts. Not a single complexity.

Can leverage developments in the semiconductor industry, not reliant on optical technology like other platforms. Offering two free Ion Torrent instruments to researchers who come up with the best possible applications. Can sequence in hotels and on the backs of donkeys. With wireless you can analyze your data in GeneSifter.

Talk notes

  • Joseph Puglisi, Stanford University School of Medicine, The Molecular Choreography of Translation. Using the PacBio system to track translation.
  • Bing Ren, UCSD, Epigenomic Landscapes of Pluripotent and Lineage-Committed Human Cells.
  • Jesse Gray, Harvard Medical School, Widespread RNA Polymerase II Recruitment and Transcription at Enhancers During Stimulus-Dependent Gene Expression.
  • Keynote: Henry Erlich, Roche Molecular Systems, Applications of Next Generation Sequencing: HLA Typing With the GSFLX System.
  • Christopher Mason, Weill Cornel Medical College, Developmental Changes in Human Neocortical Transcriptome Revealed by RNA-Seq. No visible end to gene discovery. The more you sequence the more you see.
  • Yardena Samuels, NHGRI, Mutational Analysis of the Melanoma Genome.
Oliver Hofmann

,

Comments

---

AGBT Day 2

11 days ago

Day 2

General twitter comments slowed down to a crawl in the morning, most likely due to most attendants suffering from the horrible WiFi connection or last night’s cocktail parties. Again, disclaimer, all notes, comments and links are just cut’n‘pasted from Twitter (#AGBT), FF and RSS feeds, blog posts are attributed, Twitter comments are not as this is mostly for my own records. Thanks everyone for the excellent coverage!

General notes

  • Pacific Biosciences CEO Hugh Martin previews “third-generation” sequencer. Other take on the new machine from Genetic Future with some more numbers and feature details
  • Also from Genetic Future a general update
  • MassGenomics’s update on cancer genomics at AGBT, and another MassGenomic’s article around the general topic. “In 2-3 years all ped leukemia patients will get a full genome sequencing along with parents and sibs at ST Jude’s” (J. Downing)
  • VCFTools gets released

Talk notes

  • Keynote from James Downing on cancer genomics at St Judes
  • Elaine Mardis: use next-gen seq to find mutations present in a subset of cancer cells; more common mutations likely older. Testing PacBio. Accuracy 94% sites 6 fp and fn. Sensitivity. Detected low prev. Tumor cellularity. Prevalence. Tier 1 good. Tier 3 ok. PacBio proof of principal: cancer mutations tally pretty well with Illumina/454. Neither depth or cost mentioned. See MassGenomic’s previous coverage on pediatric cancers
  • Levi Garraway, Harvard Med. Cancer transitioning from ‘where in the body is the primary’ to mutation-based diagnoses. Can (almost) give each cancer patient 1/10th of a lane of targetted Illumina: 100X coverage for ~250 cancer genes. Clinical oncologists will ‘shoot you or run away in terror’ if presented with a patient’s full genome sequence.
  • Nicole Cloonan, The University of Queensland, Translation-State RNAseq of Human Embryonic Stem Cells using Paired-End Sequencing. Beyond the Plurinet.
  • Shuro Sen, NHGRI, Transcriptome Profiling of ClinSeq Particpants by Massively Parallel Short-Read DNA Sequencing.
  • Brian Haas, Broad, Genome annotation using mRNA-Seq: A case study of Schizosaccharomyces pombe. Overview of current algorithm and approaches for mRNA-seq assembly
  • Manual Garber, Broad, Annotating LincRNA Transcripts Using Targeted Sequencing.
  • Stan Nelson (UCLA), “Whole Human Cancer Genome Sequencing: Progress Towards Common Application”. Gold standard for SNP discovery isn’t longer reads at high-quality, but low coverage overall
  • Dietrich Stephan (co-founder of @Navigenics, now CEO of the IGNITE Institute for Individualized Health) is discussing pers. medicine. There will be no market for targeted sequencing once whole-genome sequencing drops to a few hundred dollars.
  • David Craig (TGen): Lower-coverage whole-genome seq can be more cost-effective than exome seq for Mendelian discovery.
  • Kevin McKernan (Life Tech): sequencing your boss’ genome raises some serious career-ending possibilities. Announcing the improved SOLiD 4 technology: increased accuracy, increased output, changes to colour space encoding. Actually quite cool: some serious improvements in accuracy. By using both 2-base and 1-base encoding, SOLiD can bring error rates down to 1 in 10^6. Sequencing is so accurate that errors from library prep now dominate. Are working on gentler prep methods.
  • Pacfic Biosciences Workshop with many more numbers, as well details from Jonas Korlach’s talk. Stephen Turner: Raised nearly $300M. Influenza work. 9hrs to data. Single molecule heat maps detect subspecies. With singe reads. Single read transcripts too. 10,351 Base reads. Strobe sequencing allows you to generate pulses of sequence separated by an arbitrary distance (mate pair equivalent). Circular consensus. Can give q40 reads. Coverage not gc biased. Can detect and discriminate C 5meC and OHMeC and 8-oxoG (Methylation assays take advantage of differential kinetcs of base incorporation at me-sites and measure different kinds of me-groups). V1 2010 to 2013. V2 2014. Human genome in 15 min for $100. Three questions about error rates, and still Turner has given us no hard numbers. Presentation was like watching Lost. You think you’ll finally get some answers but you end up w/even more questions.
Oliver Hofmann

,

Comments

---

AGBT Day 1

12 days ago

Disclaimer: all notes, comments and links are a mere cut-n-paste from Twitter (#AGBT), Friendfeed or my RSS feeds (blog posts are linked, Twitter comments are not attributed though — my apologies).

General notes

  • Notes on running a sequence facility
  • MassGenomics’s first impressions
  • Blog post on statistical aspects, recommended read
  • Question from the floor notes that the 10% false discovery rate for 1000 Genomes = 1.8M false SNPs entering databases!
  • The 454 session. Not hugely impressed with specs; 700bp median read length but high error rates after 600. Very few 1k reads in 1k kit
  • David Gordon seems to have tweaked phred: Consed and Next_phred for Next-Gen Sequencing

Talk notes

  • Debbie Nickerson again; DN: there are 2,500 Mendelian diseases in which we still don’t know the causative gene; sequencing will change that. U. Wash will be tackling 20 Mendelian diseases (160–200 exomes) in 2010; also 7,000 exomes for complex trait. using evolutionary constraint to score candidate disease genes; powerful approach for filtering out false positives. also crucial to integrate information from “normal” individuals to filter out non-disease-causing variation. Details here.
  • Stacey Gabriel on exome sequncing. Including capture cost an exome should cost 1/15X of a genome. SG done 120 cancer genomes deep seq — need to do 500 of the same tumor type 2 find 3% _ this is the plan for the ICGC projects. Details here.
  • Arend Sirow, Stanford: Sequences do not determine phasing: phasing different in different (related) cell types. Use sequence start sites to measure nucleosome binding sites. See Antony’s notes.
  • Carlos Bustamante from the Broad; African- and Mexican-American samples display wide variation in degree of European ancestry. People are not always what think they are. 1st principal component of worlwide variation is essentially a perfect measure of African/European admixture. Details via Antony.
  • Jim Knight on de novo assembly of Bonobo genome. Assembled 24X 454 sequence into max 90kb contigs/1.6Mb scafolds
  • In the Microbiome session, Julie Segre asks if sequencing can tell us more than culture in identify skin bacteria? The microbiome of your left arm is no more similar to your right arm than someone else’s. why do we sterilize our outsde (purell) and infect our inside (probiotic yogurt)? See details.
  • Metagenomic approaches to pathogen discovery – David Wang. Their lab looking at Respiratory and GI infections–5m deaths annually, 40% from unknown agents. Antony’s notes.
  • M.Stromberg, BC, “Novel Mobile Element Insertions detected inhuman pop..”
  • Charlie Rose, Novel viruses have been present forever- human interconnectivity is what has changed. Risks higher than ever. Monitor airline cabin crews as early warning for new virus threats
  • Good ol’ Monsanto: Enjoyed the bit from the Monsanto Rep: Deforestation of Africa and Brazil is “Visionary” and opportunity for growth
  • Notes from Thomas Briese’s talk “New Frontiers in Molecular Diagnosis of Infectious Diseases”
  • Notes from Penny Chisholm’s talk “From Single Cells to Global Metagenomics”:
  • Notes from Elliot Margules – Sequencing and analysis of matched tumor and normal genomes from a melanoma patient
  • Notes from Kristian Cibulskis on ITector: Accurate Somatic Mutation Detection in Whole Genome and Exome Capture:
  • Vanessa Hayes is discussing the work published this week in Nature on African genomes. VH just demonstrated the four different click sounds in Bushman languages as a preliminary to discussing their genomes. Desmund Tutu has a very thorough genome sequence: 30x SOliD, 7.2X Illumina, 16X 454, and a 1M Illumina chip. African genomes yields a huge number of novel variants. ow building a new Affymetrix SNP chip to target the 1.3 million novel SNPs discovered in this project. Also see notes on Bustamante talk.
  • Margret Hoehe (Max Planck) is up now, discussing detailed genetic analysis of the human MHC gene cluster.
  • Daniel MacArthur, “Loss-of-Function Mutations in Healthy Human Genomes: Implications for Clinical Genome Sequencing”. loss of function variants in the 1000 Genomes Project. 1656 genes that contain loss-of-function variants in the 1KG data. Individuals have ~50–100 genes knocked out each (!). Can make a classifier for LOF-tolerant and non-LOF-tolerant genes; useful in search for Mendelian disorder genes. Notes here
  • Stephan Zuchner (U. Miami) on using exome sequencing to track down mutations in spastic paraplegia. he definition of the exome is still unclear – how many protein-coding exons actually exist? current exome designs don’t include functionally important non-coding regions (promoters, untranslated regions).
  • Notes from Timothy Triche – Children’s Hospital Los Angeles, Unraveling the Complexity of Primary and Metastatic Ewing’s Sarcoma Using Helicos Singele Molecule Sequencing
  • Notes from Ian Bosdet – Mutational Profiling of Pre and Post-Treatement Lung Tumors.
  • Notes from Ogan Abaan – NIH/NCI, “Identification of novel cancer mutations in sarcomas”

Oliver Hofmann

,

Comments

---

AGBT meeting notes: first evening

12 days ago

Couldn’t make it to AGBT (again), so I’m unfortunately limited to monitoring FF Twitter and my RSS feed for coverage. Figured I might as well share my finds.

Pre-meeting

General notes

Talk notes

  • Richard Gibbs estimates 1,000 personal genomes will be produced in 2010, but not yet reaching high quality. Data production and computation flip flop; datastream comes in lurches, as opposed to graduate growth of computation. Computational resources being outpaced by data accumulation; we need to learn how to throw away data.
  • Debbie Nickerson, University of Washington, discussing experiences of data production at a sequencing facility. We’re likely to see 10,000 exomes in 2010 in addition to the 1,000 complete genomes predicted by Richard Gibbs. array-based sequence capture techniques are impossible to scale; solution-based capture more reproducible, efficient (less variation in quality).
  • Kelly Frazer just gave a very interesting talk on choosing the right sequence capture method. Short answer: it depends. Sorry…
  • Wold recommends computation for chip/rna-seq article
Oliver Hofmann

,

Comments

---

ISMB/ECCB 2009 Workshop: Bioinformatic Cores

251 days ago

Bioinformatics Core Facilities Workshop, Fran Lewitter, Michael Rebhan, Brent Richter & David Sexton

Missed the first five minutes due to another meeting.

Project types:

Prioritization: just long term misses new customers, grant opportunities. Focus on money or project size misses pilot projects, does not allow diversification. Suggestion: based on merit, what allows the core to grow in new directions, expand, supports the institutional community in general

Hiring: 50% FTE available for a new person, enough to get someone started. Identify new technologies, hire to get an early start. Consultants can fill gaps.

Time: maintain an overview of timeframes (putative project starts). Wrap up projects to avoid task switching overhead as much as possible. Make time for the planning stages, use management tools (Basecamp, Trac etc). Work is periodic, plan accordingly

Expectations: be open and transparent with regards to availability, feasibility, stick to realistic time estimates and turn down work if the resources are not available. Collaborate between Cores

Core Cancer Research UK, Cambridge: does not charge, only tracks usage by project and group. Analysis heavy (at least 50% based on description), with partial workflow building, software development (Bioconductor). LIMS for sample tracking and pipelines, re-usage of public tools (Ensembl, Galaxy — mostly to ‘offload’ a number of tasks back to the biologists).

Long term/researcher projects tend to bog down a core. Focus on short term, genomic projects that can help multiple groups with rapid turnover. Main difference to us: generate their own array, short-seq data. Scaling issues: usually 6-10 projects per person at any given time.

Manage workload: define scope early, manage using collaboration software, deliver data in stages to keep everyone in the loop (and happy!). Standardize and automate as early as possible. Train as much as possible to offload work.

Fran: Reward the group (chocolate and ice-cream seem to be enough!). Small group problem — someone on vacation stalls a large number of projects. Group develops own long term projects (TargetScan website in this case). Her take on priorities: publication, grant, ongoing experiments, exploratory research.

Tracking: Work hours between labs, departments, short/long projects. Report once a year to each lab to improve communication. Play fair with whom to support next. Seek co-authorship for collaborative projects (but does not seem to require this)

Hot topics of the month, those best received are hands-on tutorials that benefit the most people with basic tools (Ensembl, statistics packages). Unlikely to spend more time than this, but can be spread out over time. Conscious effort to prioritize them, otherwise they never get set up. Huge success with basic perl courses just to do very basic data processing

Chargeback model the standard approach from a quick poll, but 50% of the Cores represented do not charge their users at all (cost covered by their institutes). Fair number support industrial collaborations. Importance of time tracking, cost, part of every project discussion. Provide detailed reports (what was done and how long did it take) to the collaborator / customer at regular intervals.

Large-scale projects: try to avoid those, or outsource / hire specifically support. Buy commercial solutions and (try!) to customize the tools which can be risky depending on the level of commercial support.

Authorship: mostly just acknowledgement, many cores with focus on master’s level students; core authorship depend heavily on the scientific contribution. Alternative benefits – higher salary, ability to get involved in a large number of different interesting projects

Collaboration: central place to share knowledge of methods, tools, evaluation. One place to deposit this information could be the BioinfoCoreWiki or the mailing list.

Post-analysis of short reads

What kind of questions are being asked, can they handle the data themselves, and is there any way to build re-useable workflows? Experience seems to be that so far no question (beyond the assembly step) has come up twice.

Simon Andrews, Babraham Institute. Growth of data, but no additional people to handle short-seq. Huge, diverse range of data types (chip-seq, SNPs, mRNA, bisulfite, …) all require different downstream methods. Not involved in QC but take over for mapping additional QC to save time on subsequent analysis to set expectations on what can be done with the data (and build expertise what went wrong and why during a run).

Additional visualization, quantitative analysis and biological analysis. An iterative process that kills effectivity (please change the cutoff, different colours, just a slightly different view — sounds very familiar). Division of labour required, biologists need to work with and delve into their own data. Support with scripts, keep track of data being used across experiments, results by a junior person. Core facility to handle the management of primary data, develop software/glue, and only get involved in more complex problems that cannot be handled by the biologists.

One example for this is SeqMonk, to be used by scientists (and thus needs to be useable on desktop machines, hence handles mapped data only). A tool to once again offload tasks back to the requester. Scan/visualize their data to understand it, generate reports. Experiment agnostic, as generic as possible. A number of other tools are also available from their website

Core facility jobs:

Get involved in the design stage, provide realistic quotes. Lead times to evaluate the correct algorithm and methods.

Take from Partners Core: 4 Solexas deployed across Partners’ institutions, about 3TB/week/instrument. Pre-analysis pipeline deployed by ERIS within an HPC environment, regular test of new base callers and assemblers by the Core. In 2009 283 alignments, 145 raw data sets handled by a single bioinformaticists, all that can be done is care/feed the pipeline and return alignments. All downstream analysis by investigators.

Most users only just getting started (likely further increase in data). Huge amount of applications (see Nat Biotech paper from 2008), focus on genome re-sequencing, small RNA, SAGE, Chip-seq. About 20 different analysis tools being tested (Eland, MQ, MOSAIK, RMAP, SHRIMP, SOAP, SSA, velvet, PyroBayes etc). Usually come back to basics:

Post-analysis further down the line: GenomeQuest, Genomatix, CLC bio tools as pay services, mostly in the testing phase. Most important aspects: development of the study with investigators, QC and high quality alignments. Looking into Galaxy as a framework for standard workflow development. Can be a technical challenge when it comes to scaling.

Additional challenges include:

Added from the audience: additional formats (not converging yet). Cooperate with sequencing center that ensures a bioinformatics consult has taken place prior to the samples being run (going to be difficult at Harvard, too de-centralized. Still worth trying to make contacts).

Try to compile information and discussions from the SEQanswers forum. Usually pass on only coordinates and processed data on to collaborators, handling of data standards is a problem of the Core.

Data storage: keep processed data, discard raw data (Partners: about 300TB of storage). Can be a problem as base callers are improving. Very few projects warrant going back, though, difficult enough to keep up with new data. In general no charge for storage though.

Oliver Hofmann

,

Comments

---

« Older