Keynote: Daphne Koller – Individual Genetic Variation: From Networks to Mechanisms
Understanding gene regulation: from networks to mechanisms — some new results caused a slight change of topics. RNA degradation mechanism, DNA modifications, endogenous and exogenous perturbations all important in gene regulation. Aims:
- Inferring regulatory networks
- Identify effect of pertubations
- Result on phenotype
Regulatory networks for expression
mRNA level of regulator an (imprecise) indicator of regulator activity. Target expression partially predicted by expression of regulators. Broad view of a regulator gene: TFs, signal transduction proteins, RNA processing factors, anything that might play a direct or indirect role in gene regulation. Second assumption: co-regulated genes have similar regulatory mechanisms, group genes into modules and predict expression profile for the entire module
Structure of the regulatory program (Segal 2003 Nat Genetics): notion of a regression tree for regulatory programs. Boolean-logic style description that is easy to understand, but has disadvantages such as poor regulator selection lower in the tree, misses lot of regulators due to lack of statistical power and the choice between correlated regulators can be arbitrary.
Move to regulation in a linear regression model to identify activating / repressing effect of regulators on a module. Usually hundreds of regulators with an impact on the module. Lasso (L1) regression, constant ‘drive’ towards zero, induces sparsity in the solution. Also computationally more efficient — but does not get around the arbitrary regulator choice.
Elastic net regression to avoid arbitrary regulator / feature choice. Cluster genes to modules, learn regulatory program for module, repeat for all modules, iterate after re-assignment of genes to modules based on how well a program predicts the expression of a gene in the module (in essense a bayesian network, multiple genes share same program).
Genetic variation and regulation
Application of this approach to genotype to phenotype analysis. Data set eQTL (Brem 2002 Science), two different yeast strains, 112 individuals with array and SNP data. Adapt the regulatory network approach by including the genotype / markers. How do markers affect the expression level of a given module?
Sample results:
- Telomere module (40 out of 42 in telomere region): most dominant regulator a region on chromosome XII which includes Rif2 regulator
- Chromatin modules (4 out of 5 genes are consecutive): Sir1-containing region is the strongest controlling factor; additional modules controlled by regions with previously unknown Sir1 homologs
Evolutionary strategy to group target genes, identified 16 chromatin regulator regions
- Puf3 module (147/153 targets of a Puf3 pulldown), sequence specific mRNA binding protein regulating degradation.
But Puf3 is not correlated with the module — P-bodies are (translational repression of mRNA stored in the P-body). Dhh1 regulates mRNA de-capping, Puf3 predicted (and confirmed) to be involved in localization to the P-body. Primary regulator of Puf3/the regulators is a large region on chr14 with 30 genes. Trying to identify the ‘regulatory potential’:
- not all SNPs equally likely to be causal. Can create a set of features to select (coding, conservation, non-synonymous). How do you weight these features?
- use Bayesian L1-regularization, prior a laplacian distribution, each regulator with its own prior determined by regulatory features of the regulator
- for each module we have properties of the regulators allows the selection of regulators in a more biased way
Metaprior method: bootstrap to learn the regulatory program, learn regulatory weights (e.g., stop codings get a higher weight), compute the potential for each SNP in the genome to bias the regulatory potential of each regulator, iterate until convergence. (Empirical hierarchical Bayes). Regulatory potentials do not change the selection of strong regulators, but helps to disambiguate between multiple weak regulators. Strong regulators teach us what to look for in the putative weak regulators. In this case feature ranking / significance (learned)
- strongest feature is the stop codon
- cis-regulation (SNP and adjacent gene)
- conservation
- different combination of gene functions (protein binding, glucose process, RNA modification)
Statistical evaluation using PGV, the percent of genetic variation explained by the predicted regulatory program for each gene. Explained about 50% variation for half of the genes.
Used the approach to find causal regulators in 13 chromosomal hotspots, including the P-body region, with good results. Learning regulatory priors that are specific to a data set and organism, can handle any kind of regulator type.
Regulation in the context of cell differentiation
Understanding the process underlying differentiation with the ImmGen consortium, 200 arrays from 63 immune cell types. Can identify shared regulatory programs for all 60 samples. Does not have the G-regulators from the eQTL study. One network for each cell type overfits the data, but can bias towards shared regulation. Use the ontogeny to guide conserved regulation. Penalize changes in the regulatory program.
Test data prediction on six cell types, predict on one array (leave one out), the soft, lineage-aware model provides better accuracy. Identified novel member (JARID1B) as a candidate regulator.
Phenotype
How does expression change cause changes in phenotype, and what regulatory programs cause them? FL to DLBCL (Diffuse large B cell lymphoma) transformation in patients. Represent each module as a metagene, use ML technique to learn classifiers to distinguish FL-t (pre-transformed) from DLBCL (to appear in Blood). Module of interest in embryonic development (ESC1/2/TGF-beta), also a good predictor of survival. Potentially therapeutic implications based on a connectivity analysis, identify drugs likely to interfere with key genes in the module.
Metabolic syndrome in mouse experiment (300+ animals) using a high-fat diet, samples from four tissues. Phenotype network can use modules from different tissues. Interesting module includes liver biosynthesis, in-between insulin and cardiovascular disease traits.
Gregory Kryukov (Brigham) – Learning from Resequencing Data: What To Do When the $1000 Genome Arrives?
Mendelian diseases can be characterized by linkage analysis (classical association studies). New sequencing technologies, ways to collect clinical populations and exon capturing approaches can revolutionize the search for genes underlying human phenotypes. All genes have rare coding variants, and while sequencing will uncover low frequency variants the power to detect their associations is reduced and the multiple testing correction becomes very stringent. One option: combine all non-synonymous variants into a single test
Sequence all exons from a population, characterize variations at the extreme. Theory:
- most new mis-sense mutations are functional
- are only weakly deleterious
- are likely to influence phenotype in the same direction
58 genes from 768 individuals (existing population sequencing data), estimate parameters of demographic history (from non-coding variation data), simulate genotypes, simulate phenotypes (quantitative traits), simulate a sequencing study to estimate the power
Demographic history of 370 generations agrees with experimental data, add natural selection (quarter of mutations neutral, majority at around 10^3 selection co-efficient).
For 20.000 genes we need a p-Value of 2×10^-6 to associate mutations with phenotype. Small sample of 1000 individuals have 75% power to detect genes which shift the phenotype deviation by 2 std devs . Sample sizes approaching 10.000 individuals re-sequencing based association studies are feasible (with phenotype information from 100k individuals).
Computational predictions of damaging mutations will help, as does multistage design.
Mikhail Zaslavskiy (Ecole des Mines de Paris) – Global Alignment of Protein-Protein Interaction Networks by Graph Matching Methods
Motivation: automatic identification of protein functional orthologs. Standard approach like reciprocal best BLAST hits have problems when several top hits have similar scores — which pair to chose? Additional information helps to resolve ambiguity.
Protein clustering based on blast similarity clusters (InParanoid): only proteins in the same cluster can be annotated as functional orthologs.
PPI networks (the usual hairball) can also be used to resolve this. If ortholog assignments conserve PPI interactions are ranked higher. Identify the mapping that maximizes the number of overlapping PPI cases.
- Constrained alignment: InParanoid approach (Ideker 2006, MRF model). New approach here is the Message Passing (MP) Algorithm. MP is based on a forward-backward recursion, each node represents a protein cluster. Provides the maximum number of conserved interactions if the MP graph is a tree.
- Balanced alignment: find the alignment that maximized the number of conserved interactions and the sum of all BLAST similarity scores of the pairs (Sing 2008), here with graph matching algorithms. Balanced alignment using the gradient ascent and path algorithm to solve the graph matching.
PPI networks and InParanoid clusters from the Ideker paper, 2244 clusters (1552 with only 2 proteins / orthologs, 692 ambiguous clusters that need to be resolved). No cycles in this graph means constrained alignment approach can be used, results in 238 conserved interactions (run time of 1-2 seconds). Balanced alignment does not use the InParanoid clusters, recovers the highest number of conserved interactions when compared to PATH, IsoRank.
Constrained algorithm with message passing is an exact solution, graph matching algorithms deliver good performance for balanced alignment problems.
Jose Caldas (CIT, Helsinki) – Probabilistic retrieval and Visualization of Biologically Relevant Microarray Experiments
Trying to find a method to relate results in large array databases based on expression information rather than annotation. A need to retrieve all ‘related’ experiments from a database using the data, not the text. Standard approaches like spearman correlation coefficients, but it would be interested to use sets of experiments as a query rather than a single array.
Query with a binary phenotype comparison and try to get back other, similar comparisons. Requires encoding of the phenotype comparison such as a vector of t-tests (0/1 vector for differentially expressed genes). Can be noisy, use a vector of differential GSEA. Less features to compare, additional biological information available for the sets.
Uses standard GSEA, number of genes in leading edge as vector value, ignoring the directionality. Uses LDA (Latent Dirichlet Allocation, used in bag-of-words approaches) rather than standard vector comparisons (combines gene sets into ‘topics’).
750+ binary phenotype comparisons from 288 experiments, focus on 105 comparisons for this analysis. Gene sets assigned to topics are coherent across a wide range of biological processes. Graph visualization allows to explore from phenotype pairs, gene sets or topics.
Retrieval performance better than random, but seems to have low-to-moderate recall for reasonable precision numbers (?)
Sven Nelander – Models from experiments: combinatorial perturbations of cancer cells
What happens when you change more than one factor in a cell? E.g., paired perturbation screens. Here: cancer patients, based on Nelander, Mol Syst Biol 2008
Defining feature of cancer cells: breakdown of regulatory systems. Pharmacology frequently targets the function of proteins in these pathways. Hope is to selectively counteract specific mutations (particularly gain of function mutations that can be inhibited). In many cases a tumour contains a whole set of mutations, computational methods to predict how it will react to a given drug.
Approaches to combinatorial perturbations:
- Gene-gene interaction / epistasis
- Drug-drug interaction
- Genomic interaction screens
- Flux models
- Now: “systems biology”
Drugs as input to the system, molecular / phenotype response as output, search for models recapitulating the experiment. Use the models to predict new potential interventions. Sample experiment from MCF7 cells. Used a continued Hopfield network (neural network which allows feedback loops), trade-off between model fit and simplicity to construct model from the data. MC simulation results in probabilities of functional interaction based on the frequency with which interactions where observed during the simulations.
Functional interpretation of the top pathways / network recapitulates MAK cascade of the EGF receptor, PL3K-dependent AKT control, m-TOR signaling. Experimental verification int the Sander lab. Cross-validation with leave-one-out experiments do “rather well”.
Current work: characterize human tumour samples. Input includes point mutations, CNV, gene fusions, DNA methylation (analogue to drug input), phenotype output are changes in expression measurements. Adapted CoPIA to cancer genomics data, the perturbation is gene dosage (gain/loss of DNA) with direct effect on transcription levels. Model also captures indirect regulatory effects. Reduced the problem to a linear summary model, derived from the S-system (1969).
- each tumour at a transcriptional steady state
- same model applies to all patients
- only try to explain differences between patients
Bootstrapped Lasso (Francis Bach 2008) and CoPIA approach (only effective up to 20 genes). Glioblastoma CoPia model based on profiles from 200 tumours shows EGFR asa pleiotropic regulator along with new testable predictions (GCPR, Necdin, others with strong neural expression). NDN over-expression in U343 glioma cells slows growth of cells in a dose-dependent manner. In four out of six cases the experimental validation matches the CoPIA model predictions
Anton Nekrutenko (Penn State) – Galaxy Library System for Management of Next Generation Sequencing Data
How to handle large data sets in tools with complex interfaces? Galaxy as a software framework. Instance is a piece of hardware running the framework, talk is focused on the software part.
NSG data must be served in a (immediately) useful form, ideally with a proliferation of ‘best practice’ workflows and practices. Showcase using the Penn State (modified) instance. Sequencing center provides sequencing run information as Galaxy libraries, pre-loaded on the server. Data can either be downloaded for offline / individual processing, or alternatively processed within the Galaxy framework. Includes:
- statistical analysis of data quality
- convert to graphical representation for easier evaluation
- align to target genome (usually also pre-loaded on the server)
- SNP calls, along with visualization of SNPs in a genome-browser view
[Brief demonstration of the mobile-optimized version of the site that shows job histories and status.. just in case you want to check during lunch.]
Switch of speakers, emphasis on tool integration. Quick (if unreadable from the back of the room) walkthrough of a tool integration process; takes about 2-3 minutes to write up an XML binding for a command line program.
Igor Ulitsky (Tel Aviv) – Regulatory networks define phenotypic classes of human stem cell lines
Identify differences between pluripotent and multipotent cell types? Should allow a rapid analysis of new cell lines and distinguish pluripotency from self-renewal and surival, increase the safety of cell-replacement therapies and direct differentiation in better ways.
mRNA, miRNA, DNA-methulation, genomic and chromatin structure. Focus here on mRNA and miRNA in 200 stem cell-related samples followed by unbiased clustering and identify mechanisms using a protein interaction network
mRNA
Non-negative matrix factorization to reduce dimensionality (metagenes vs experiments), repeat from multiple starting points and observe which samples are frequently co-clustered (i.e., robust). Ended up with k=12 clusters, most stem cell types group as expected (hESC and IPSCs, NSC split into coherent but distinct clusters). Used MATISSE which identifies sets of genes / modules that form connected subgraphs in the PPI network and have highly correlated expression patterns (and are up/down-regulated in specific NMF clusters with respect to other clusters, a new extension). Yielded the Plurinet network described in the original Nature paper.
Double-checked the plurinet expression patterns across stem cell and differentiated cells with good results; PluriNet can be used as a cell type classifiers. See http://www.openstemcellwiki.org/ for more details.
miRNA
Quick overview of miRNAs (mechanism, specificity, seed sequences). Profiled miRNA expression in 26 cell lines including pluripotent and differentiated cell lines. This time a separation into six different groups with all differentiated cell lines being present in one robust cluster despite tissue heterogeneity. The relevant ESC miRNAs are clustered (and upregulated), with a large cluster of primate specific miRNAs on chr19.
miRNA sequences upregulared in ESC frequently with a AAGUGC seed sequence (10^-14 pValue), similar families in the early embryonic development in zebrafish and xenopus. Reverse complement of seed sequence over-represented in miRNAs downregulated in ESCs.
Detecting pathways that are co-regulated by miRNA (group of miRNAs and targets with targets forming a connected component that have similar expression in similar cell types and regulated by the same miRNA). Found 57 such modules, mir-16/17 example which seem to regulate cell cycle progression. Additional example for miRNA associated with neurodegenerative disorders.
Rui Jiang (MOE Key Lab)- Network modeling of human interactome and phenome
Prioritization of disease genes: Guilt by association learning methods — gene more likely to be causal if it shares properties with disease-associated genes. E.g., use PPI networks and examine distance or other graph properties for candidate genes with regards to known disease genes. Requires annotated and relevant disease (seed) genes, limiting the power of these methods. Scope limited to diseases for which we know causal genes.
Multi-layered network model that includes diseases clinical traits, molecules and gene variation. Text mining of the human ‘phenome’ provides a similarity measure of all human diseases
Linear regression model uses the gene proximity to define the disease similarity. CIPHER: calculate concordance score of gene to a phenotype of interest, used to rank multiple candidate genes
Evaluated using three different screens, method robust to noise in the input data (?). Case study using breast cancer data set to evaluate 22 OMIN candidate genes associated with breast cancer, 16 of which are in HRPD. Ranked high in a genome-wide scan (details got skipped to quickly to capture).
Used to generate a genetic landscape of human diseases; similar diseases tend to cluster together. Standard network alignment approaches (NetBlast) can be used to align the phenome and interactomoe to identify ‘bi-modules’, all of them enriched in a given disease category
Keynote: Trey Ideker (USCD) – New Challenges and Opportunities in Network Biology
All coverage over at Friendfeed