Bas Dutilh (UMC St Radboud) Increasing the coverage of a metapopulation consensus genome by iterative reading and mapping
Wild type as the average of the species. To obtain consensus genome sequence the community; as a related reference genome is available chose 32bp Solexa reads for cost-reasons. 65% of genome covered, lots of gaps after Maq mapping. Only 9% of the reads map at all to the reference genome.
First goal: lower the mapping stringency by mapping every read to their optimal position, then filter out the ones that are highly unlikely to be from this species. BlastN with high eValue cutoff, alignment length cutoff at 20nt. Assembly as a better representation of the community, so a re-mapping of reads against the assembly should provide a better wild-type representation. Iterative mapping approach increases the coverage and reduces the zero-coverage regions.
Confirm convergence to wild type by contamination with 50% of reads from a different species; hardly any of the contaminants get incorporated during the iterative assembly. Proteomic validation using MS/MS to identify which iteration explains most of the peptide peaks.
Reference based assembly cannot incorporate large genomic changes. Hybrid approach (de novo assembly, mapping de novo contigs to reference, remap to chimera assembly) might address this.
Juliane Klein (Tuebingen) on LOCAS: a new low coverage assembler for short reads
Used in the 1001 genomes project, re-sequencing effort of A thaliana. Define subregions of overlapping blocks after initial mapping to reference genome to allow local improvements, handling of small InDels and integrate unmapped reads (‘left overs’ and unmapped mate reads).
LOCAS assigns unmapped mate reads to the same block as their mapped paired read prior to subregion merging; left overs mapped after subregion arrangement (?), potentially bridging polymorphic regions.
More details on the graph-based alignment approach in the paper.
Adam Kowalczyk (National ICT Australia): Poisson model significance for short-read concentrations
Exploit the digital nature of the data to improve statistical models. Skipped this talk, sorry to say.
Su Yeon Kim (Berkeley) on the design of association studies with pooled next-gen sequencing data
Identify major/minor allele (here: minor for the rarer allele) associated with complex diseases using large-scale genotyping platforms in a large number of case/controls (Wellcome Trust: total of 17.000 samples); still known variants only explain at most 5% of the heritable variation of the disease (due to interaction, epigenetics, structural variation, other factors). Focus here is on rare SNPs.
Aim to identify rare alleles with a minor allele frequency (MAF) less than 5% which requires a lot of individuals. Cost-effective strategies help to reduce the expenses involved. Examples include a focuses approach on target regions of interest (all human exons). Alternative approach is to use pooled examples as we only need the MAF across the population.
Adds another source of variation (pooling variance) on top of sequencing errors, mapping problems. Examined a pool of five DNA samples in an empirical study. Based on unique mutations in the five samples pooling variance can be estimated and turns out to be relatively low.
Also focused on just capturing exons which exhibits addtional variation. Optimal design using a two-stage design with re-sequencing in the first stage. Only select top SNPs out of the initial sequence data. Developed a likelihood ratio statistics (LRT) taking into account uncertainty in genotypes, the pooled structure and sequencing errors (chi-square statistics is not appropriate for this). Details of the formula in the publication.
Extensive simulation studies with different case/control numbers, pool size and sequencing error rates. Other variables include sequencing depth, pooling and exon variance. Not accounting for population structure at this stage. More individuals with lower depth more cost effective than fewer samples at high coverage.
At fixed sample size balance coverage and pool size; high depth at larger pool sizes beats low depth with smaller pools.