This is a joint session with the BOSC SIG and we are switching to a 15 minute talk format.
Morris Swertz (Groningen) on MOLGENIS, an extensible platform for genotype and phenotype experiments
A database generator tool written in Java with a SOAP/R interface. Distinguish between specialized parts and generalizable subsets that remain constant. Generates mySQL/postgreSQ databases dynamically. Interesting for the Eclipse folks, way out of my comfort zone though.
XGAP (Xtensible Genotype And Phenotype as a simple data model and interchange format for genetics experiments. Project links out to R, Taverna and of course MOLGENIS.
Robert Murphy (Carnegie Mellion), the Protein Subcellular Location Image DB (PSLID)
Using machine learning techniques to analyze the subcellular localization of proteins. Different cells have different shapes, sizes and orientations so comparison-based approaches won’t work. Need feature-based approaches
SLIC is open source (Matlab, Python, C++) and includes tools for segmentation, feature calculation, ML tools. Granularity can be handled in 2D/3D and in static samples and time series. Example shows better annotation results than those obtained by curators.
Mixture models are required to handle combinations of different basic patterns (20% in ER, 80% in cytoplasm). Objects can be shared, PUnMix learns to unmix an image based on instances of previous patterns; uses clustered feature objects. Tested by creating real images based on mixed probes localizing to different subcellular compartments.
SLML Tools building model instances from an image collection and vice versa, generating images from model instances to communicate information on subcellular localization. The model as a compressed representation of the distribution description.
Last piece of the system is PSLID to create database and web interface automatically from sets of images, i.e., a workflow set for a given experiment. Lots of annotated data sets available via their website
Mark Welsh (Geospiza.com) on Open Binary file formats for large-scale data management
Description of the BioHDF project building on the existing HDF5 hierarchical data format, focus on next-gen sequencing. Three phases:
- Primary data acquisition
- Per lane/sample data analysis (mapping, aligning)
- Tertiary analysis across many samples (SNP calls, differential expression)
Secondary analysis is complex, yet all data sets need to be kept in memory for the tertiary analysis. Tries to solve the complexity with a consistent data model and avoid redundant data processing. Additionally difficulty: requirements to drill down to the original underlying read data based on tertiary analysis results.
Aim is to generate domain-specific HDF5 extensions to move away from a flat-file format. HDF is a toolkit to build novel binary fileformats; including additional command-line tools, format annotations and so on. Early stages of the feedback and soliciting feedback. Incremental data storage allowing for changes in the queries without having to re-run early processing steps. Project description eventually will be on the HDF site
Brad Chapman (biopython.org, MGH) — lowering barriers to publishing biological data
Reuseable libraries help deal with data sharing / parsing problems (PyGr, BioPython). Similar approach to databases (BioSQL, Chado), web applications (Galaxy, GBrowse). Similar tools for web interfaces to speed up the process from idea to useable interfaces able to handle larger datasets.
- Provides reusable presentation components, quickly deployable frameworks
- Requires integration of libraries, schemas
A (simple) example of this: biosqlweb.appspot.com on the Goggle App Engine, a cloud computing framework; utilizes a number of existing frameworks like pylons, jQuery, BioPython and others on top of the Google framework.
Kam Dahlquist (Loyola Marymount) on XMLPipeDB to build relational databases from XML sources
Motivation: Maintenance and updates of GenMAPP, usually following changes of identifiers and new data / novel model organisms. Database generation needs to be robust to changes in source file formats (using a single data source, UniProt in this case).
- SD-to-DB based on Hyperjadb2 reads an XSD or DTD; nominal post-processing for UniProtDB and GoDB to ensure datatypes are supported in SQL, filtering reserved words
- XMLPipeDB as a re-useable utility, breaks up XML files into record chunks for import due to the data size, additional TallyEngine to check XML/DB consistency
- Downstream application GenMappBuilder that formats the data for GenMapp
Turns out to be very robust, requiring only minimal updates — and enabled undergrad students to build model organism databases for use in GenMapp.
That’s it
This concludes the DAM Sig. Lots of discussion on friendfeed, a master thread that links to the individual talks can be found in the ISMB 2009 DAM thread.