CMB

About CMB
Home Page
People
Publications
Dissertations
 
 

Imprint

Gene Structure and Array Design


Main focus of the group is the development of algorithms and tools for the analysis of next-generation sequencing (NGS) data. The recent advances in high-throughput sequencing technologies led to an enormous increase in the amount of data generated. Furthermore, steadily improving and newly emerging technologies require a continuous adaptation of existing software. In many cases dedicated software has to be developed to efficiently handle and analyze the vast amount of sequencing data. While a few processing steps like quality control and mapping are quite independent of the sequencing application (e.g. ChIP-seq, re-sequencing) for each application specific software is needed to address the questions of interest. Transcriptome sequencing (RNA-seq)

[Schulz, Richard, Sun, Haas]

Given our early experience with RNA-seq data the group was among the first to develop algorithms to evaluate transcript expression based on NGS data. Our tools CASI and DASI allow the detection of alternative transcript expression within one cell-type or of differential expression of transcripts across different samples, respectively. In addition, we implemented an EM-based method, POEM, to even quantify expression of alternative transcripts. These predictions were subsequently validated experimentally on samples of HEK and B-cells in collaboration with the group of Marie-Laure Yaspo. Besides the aspect of studying transcript expression a main challenge is to assemble transcripts reliably from short read sequences. This task is complicated by the fact that the abundance of reads originating from different transcripts may vary in several orders of magnitude caused by different expression levels. We therefore developed a de novo assembly tool (Oases) that efficiently reconstructs transcripts taken estimated expression levels into account. In a collaborative project with the MPI for neurological research we recently started to apply state-of-the-art mapping tools as well as de novo assembly as a basis for the detection of fusion transcripts in samples of small-cell lung carcinomas. Such artificial transcripts may be prime candidate mutations driving tumor progression as shown for other tumor types. Detection of disease-causing mutations (Re-sequencing)

[Emde, Love, Sun, Richard, O'Keeffe, Haas]

A major application of NGS is the sequencing of entire genomes or genomic regions of interest to determine the specific genotype of an individual. This information can be either used to unravel evolutionary relationships or e.g. to uncover mutations associated with a certain phenotype. In contrast to traditional methods NGS-based re-sequencing allows a comprehensive but also less biased screening for sequence variations (SV) at even lower costs.
In tight collaboration with the group of Hilger Ropers we set up a computational processing pipeline to enable the large-scale analysis of re-sequencing data with the aim to detect potential disease-causing mutations from samples of patients suffering of intellectual disability (ID). Frequently, causal mutations disrupt gene function by changing the protein sequence, or by deletion/duplication of the gene or parts of it. Therefore, the computational pipeline needs to evaluate all the different types of sequence variation. However, depending on the size of the SVs different computational approaches have to be applied to recover SVs comprehensively. In a first step, we evaluate the basic read mapping alignment for consistent deviations from the reference genome sequence. This strategy allows to determine base exchange variations and short (<=5 bp) insertions/deletions. In order to reduce the number of false-positive variant calls we correct for potential PCR amplification artifacts, and apply a robust quality-based read clipping.
In a second step, we apply our spliced mapping tool, SplazerS, to recover reads that cross boundaries of potential insertion/deletion events by generating artificial paired-end reads. Deviations from the expected distance of such read pairs allows us to predict not only short insertions (<=30 bp) and medium-sized deletions (<50 kb) but also putative retrocopies or pseudogenes. Finally, we detect large duplication/deletions by evaluating read depth distribution along the genomic region of interest. Significant increase or decrease in read depth indicates potential duplication/deletion events, respectively. In case of e.g. exome enrichment data read depth is usually non-uniform but is rather skewed towards the ends of the enriched region depending on the enrichment technology used. We addressed this issue with our software ExomeCopy, which computes a representative background distribution of read depth against which a sample is compared. This strategy enables the detection of duplication/deletion even across data derived from different enrichment technologies.
On top of the comprehensive set of tools for SV prediction we provide functional annotations for all SVs in order to prioritize SVs according to their potential functional impact. This includes filtering for already known variations that are expected not be associated with diseases, but also annotating known disease-associations extracted from HGMD. In addition, we add information about sequence conservation, impact on protein sequence or splicing in order to further prioritize candidate mutations. All SVs detected, including detailed functional annotations, are finally stored in a database allowing to query for distinct regional or functional subsets of SVs.
Besides the development of software infrastructure to recover and annotate sequence variations the group also applied these tools successfully on patient cohorts for the detection of putative disease-causing mutations. In a first project we analyzed 136 patients with autosomal-recessive ID where a genomic linkage interval was already known from previous studies. Targeted sequencing of these regions revealed 50 genes now newly associated with autosomal-recessive ID.
In a parallel project we analyzed the entire exome of chromosome X of >400 male patients suffering of X-linked ID. After filtering out common variants, our mutation analysis on average yields 3-4 candidate mutations per individual that are subsequently validated experimentally and checked for co-segregation within the family. These results are input for genetic counseling of the parents of the affected children.

Selected Publications

    Emde AK, Schulz MH, Weese D, Sun R, Vingron M, Kalscheuer VM, Haas SA, Reinert K. (2012) Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS.
    Bioinformatics, 28(5):619-627

    Love MI, Mysicková A, Sun R, Kalscheuer V, Vingron M, Haas SA. (2011) Modeling Read Counts for CNV Detection in Ex ome Sequencing Data.
    Statistical Applications in Genetics and Molecular Biology, 10(1), Article 52

    Najmabadi H, Hu H, Garshasbi M, Zemojtel T, Abedini SS, Chen W, Hosseini M, Behjati F, Haas S, Jamali P, Zecha A, Mohseni M, Put tmann L, Vahid LN, Jensen C, Moheb LA, Bienek M, Larti F, Mueller I, Weissmann R, Darvish H, Wrogemann K, Hadavi V, Lipkowitz B, Esm aeeli-Nieh S, Wieczorek D, Kariminejad R, Firouzabadi SG, Cohen M, Fattahi Z, Rost I, Mojahedi F, Hertzberg C, Dehghan A, Rajab A, B anavandi MJ, Hoffer J, Falah M, Musante L, Kalscheuer V, Ullmann R, Kuss AW, Tzschach A, Kahrizi K, Ropers HH. (2011) Deep seq uencing reveals 50 novel genes for recessive cognitive disorders.
    Nature, 478(7367):57-63.

    Schraders M, Haas SA, Weegerink NJ, Oostrik J, Hu H, Hoefsloot LH, Kannan S, Huygen PL, Pennings RJ, Admiraal RJ, Kalscheuer VM, Kunst HP, Kremer H. (2011) Next-Generation Sequencing Identifies Mutations of SMPX, which Encodes the Small Muscle Protein, X -Linked, as a Cause of Progressive Hearing Impairment.
    Am J Hum Genet., 88:628-634



Contact:
Stefan Haas
MPI for Molecular Genetics
Computational Molecular Biology
Ihnestr. 73
D-14195 Berlin
Phone: + 49 + 30 8413 1164
Fax: + 49 + 30 8413 1152
Email: stefan.haas@molgen.mpg.de