ISMB/ECCB 2009 Workshop: Bioinformatic Cores

Jul 2, 08:50 am

Bioinformatics Core Facilities Workshop, Fran Lewitter, Michael Rebhan, Brent Richter & David Sexton

Missed the first five minutes due to another meeting.

Project types:

  • Long term: Pipeline development
  • Medium: Software / method development
  • Short: Ad hoc projects, quick analysis, preliminary data generation

Prioritization: just long term misses new customers, grant opportunities. Focus on money or project size misses pilot projects, does not allow diversification. Suggestion: based on merit, what allows the core to grow in new directions, expand, supports the institutional community in general

Hiring: 50% FTE available for a new person, enough to get someone started. Identify new technologies, hire to get an early start. Consultants can fill gaps.

Time: maintain an overview of timeframes (putative project starts). Wrap up projects to avoid task switching overhead as much as possible. Make time for the planning stages, use management tools (Basecamp, Trac etc). Work is periodic, plan accordingly

Expectations: be open and transparent with regards to availability, feasibility, stick to realistic time estimates and turn down work if the resources are not available. Collaborate between Cores

Core Cancer Research UK, Cambridge: does not charge, only tracks usage by project and group. Analysis heavy (at least 50% based on description), with partial workflow building, software development (Bioconductor). LIMS for sample tracking and pipelines, re-usage of public tools (Ensembl, Galaxy — mostly to ‘offload’ a number of tasks back to the biologists).

Long term/researcher projects tend to bog down a core. Focus on short term, genomic projects that can help multiple groups with rapid turnover. Main difference to us: generate their own array, short-seq data. Scaling issues: usually 6-10 projects per person at any given time.

Manage workload: define scope early, manage using collaboration software, deliver data in stages to keep everyone in the loop (and happy!). Standardize and automate as early as possible. Train as much as possible to offload work.

Fran: Reward the group (chocolate and ice-cream seem to be enough!). Small group problem — someone on vacation stalls a large number of projects. Group develops own long term projects (TargetScan website in this case). Her take on priorities: publication, grant, ongoing experiments, exploratory research.

Tracking: Work hours between labs, departments, short/long projects. Report once a year to each lab to improve communication. Play fair with whom to support next. Seek co-authorship for collaborative projects (but does not seem to require this)

Hot topics of the month, those best received are hands-on tutorials that benefit the most people with basic tools (Ensembl, statistics packages). Unlikely to spend more time than this, but can be spread out over time. Conscious effort to prioritize them, otherwise they never get set up. Huge success with basic perl courses just to do very basic data processing

Chargeback model the standard approach from a quick poll, but 50% of the Cores represented do not charge their users at all (cost covered by their institutes). Fair number support industrial collaborations. Importance of time tracking, cost, part of every project discussion. Provide detailed reports (what was done and how long did it take) to the collaborator / customer at regular intervals.

Large-scale projects: try to avoid those, or outsource / hire specifically support. Buy commercial solutions and (try!) to customize the tools which can be risky depending on the level of commercial support.

Authorship: mostly just acknowledgement, many cores with focus on master’s level students; core authorship depend heavily on the scientific contribution. Alternative benefits – higher salary, ability to get involved in a large number of different interesting projects

Collaboration: central place to share knowledge of methods, tools, evaluation. One place to deposit this information could be the BioinfoCoreWiki or the mailing list.

Post-analysis of short reads

What kind of questions are being asked, can they handle the data themselves, and is there any way to build re-useable workflows? Experience seems to be that so far no question (beyond the assembly step) has come up twice.

Simon Andrews, Babraham Institute. Growth of data, but no additional people to handle short-seq. Huge, diverse range of data types (chip-seq, SNPs, mRNA, bisulfite, …) all require different downstream methods. Not involved in QC but take over for mapping additional QC to save time on subsequent analysis to set expectations on what can be done with the data (and build expertise what went wrong and why during a run).

Additional visualization, quantitative analysis and biological analysis. An iterative process that kills effectivity (please change the cutoff, different colours, just a slightly different view — sounds very familiar). Division of labour required, biologists need to work with and delve into their own data. Support with scripts, keep track of data being used across experiments, results by a junior person. Core facility to handle the management of primary data, develop software/glue, and only get involved in more complex problems that cannot be handled by the biologists.

One example for this is SeqMonk, to be used by scientists (and thus needs to be useable on desktop machines, hence handles mapped data only). A tool to once again offload tasks back to the requester. Scan/visualize their data to understand it, generate reports. Experiment agnostic, as generic as possible. A number of other tools are also available from their website

Core facility jobs:

  • Mapping (probably with base calling as well?)
  • Post-mapping QC
  • Filtering (remove biases, PCR artifacts)
  • Construct standard workflows (no consensus yet)
  • Customized analysis (when unavoidable)

Get involved in the design stage, provide realistic quotes. Lead times to evaluate the correct algorithm and methods.

Take from Partners Core: 4 Solexas deployed across Partners’ institutions, about 3TB/week/instrument. Pre-analysis pipeline deployed by ERIS within an HPC environment, regular test of new base callers and assemblers by the Core. In 2009 283 alignments, 145 raw data sets handled by a single bioinformaticists, all that can be done is care/feed the pipeline and return alignments. All downstream analysis by investigators.

Most users only just getting started (likely further increase in data). Huge amount of applications (see Nat Biotech paper from 2008), focus on genome re-sequencing, small RNA, SAGE, Chip-seq. About 20 different analysis tools being tested (Eland, MQ, MOSAIK, RMAP, SHRIMP, SOAP, SSA, velvet, PyroBayes etc). Usually come back to basics:

  • Eland, Cross-Match (alignment)
  • MAQ and Vaal (alignment and variant detection)

Post-analysis further down the line: GenomeQuest, Genomatix, CLC bio tools as pay services, mostly in the testing phase. Most important aspects: development of the study with investigators, QC and high quality alignments. Looking into Galaxy as a framework for standard workflow development. Can be a technical challenge when it comes to scaling.

Additional challenges include:

  • alignment process (speed vs quality), need to evaluate each tool for a specific purpose. (That is going to be our main problem as well, does not scale at all)
  • lack of standards (comparison across platforms, versions, alignments). Best current way is to convert all to phred-like scores
  • the right algorithm (no benchmarks, comparison methods. Best practice is experimentation)

Added from the audience: additional formats (not converging yet). Cooperate with sequencing center that ensures a bioinformatics consult has taken place prior to the samples being run (going to be difficult at Harvard, too de-centralized. Still worth trying to make contacts).

Try to compile information and discussions from the SEQanswers forum. Usually pass on only coordinates and processed data on to collaborators, handling of data standards is a problem of the Core.

Data storage: keep processed data, discard raw data (Partners: about 300TB of storage). Can be a problem as base callers are improving. Very few projects warrant going back, though, difficult enough to keep up with new data. In general no charge for storage though.

Oliver Hofmann

,

---

Comments

Commenting is closed for this article.

---