Mikhail Kolmogorov, Ph.D.
- Center for Cancer Research
- National Cancer Institute
- Building 41, Room A100C
- Bethesda, MD 20892
- 240-858-3169
- mikhail.kolmogorov@nih.gov
RESEARCH SUMMARY
The focus of Dr. Kolmogorov’s laboratory research is computational genomics -- algorithms, mathematical models and tools aimed at answering fundamental questions about living systems through the analysis of large-scale sequencing data.
Areas of Expertise
Mikhail Kolmogorov, Ph.D.
Research
Cancer is a disease of the genome. Even with the latest breakthroughs in DNA sequencing technologies, it is impossible to read the complete genomic sequence from the beginning to the end. Instead, these technologies sample millions of short fragments (called "reads") from random locations in the genome. Reconstructing the genome sequence from these short fragments is not unlike assembling a giant jigsaw puzzle. Our goal is to develop methods to analyze large-scale sequencing data, which will ultimately enhance our understanding of carcinogenesis and improve genomics-based cancer diagnostics.
One of our research projects focuses on reconstructing cancer genome architecture using long sequencing reads. Cancer is driven by somatic changes in the genome, which can range from small nucleotide substitutions to chromosome-scale rearrangements. Until recently, it was difficult to study chromosomal architecture using traditional short-read sequencing because of the mapping ambiguity and the limitations of a single reference genome. In contrast, long-reads provide a much more comprehensive view of the structural genomic changes. We will develop reference-free graph-based algorithms to shed light on the ubiquitous, but elusive carcinogenesis processes such as chromotripsis, chromoplexy or extrachromosomal DNA amplification.
Another interest of our laboratory is the genomic analysis of highly heterogeneous cell communities. Solid tumors often consist of multiple clonal cell lines that are evolving under selective pressure. A seemingly unrelated example of a highly heterogeneous community is an environmental metagenome, such as bacteria in the human gut. We are developing methods for characterizing these complex communities using high-coverage bulk sequencing data.
Current and Past Research Highlights
Strain-level metagenome deconvolution. Microbial communities in many environments include distinct lineages of closely related organisms, which have proved challenging to separate in metagenomic assembly. It is difficult to distinguish between read errors and real polymorphisms between bacterial strains, but high-fidelity (HiFi) long reads have the potential to solve this issue. Here we recovered 428 complete or nearly-complete bacterial genomes from a single sheep gut metagenomic sample, the highest resolution achieved with metagenomic deconvolution to date. HiFi assembly has resolved many closely-related microbial lineages into distinct contigs, proving to be a powerful tool to characterize complex heterogeneous environments.
Metagenome assembly with metaFlye. Shotgun metagenomic assembly is a powerful method to characterize complex microbial communities (such as human gut or tumor microenvironments). Until recently, metagenome assemblies based on short reads (such as Illumina) were highly fragmented and incomlete (e.g. missing 16S genes). To enable long-read based analysis, we developed metaFlye, the first dedicated method for long-read metagenomic assembly. Using metaFlye we reconstructed many complete bacterial genomes from various metagenomic communities. We also showed that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products (such as Colibactin).
Long-read assembly using Flye. The new long-read sequencing technologies (such as Pacific Biosciences or Oxford Nanopore) increased the read length up to tens of thousands of nucleotides, and substantially improved the quality of many genome assemblies. These technologies, however, are facing the challenge of the high error rates. We have created the Flye algorithm for assembly of long and error-prone reads to address this challenge. Flye is using the novel repeat graph framework, which enables fast and accurate assemblies of various organisms. In particular, Flye is good for assembly of human genomes using ultra-long Oxford Nanopore sequencing data (such as NA12878 or CHM13).
We develop the long-read assembly methods with the help of our collaborators from Rob Knight’s lab, T2T consortium, Tim Smith’s lab, JGI and many others.
Comparative assembly using multiple references. Since many de novo assemblies of large genomes are still incomplete, one can use the information for related reference genomes to order and orient the contig fragments. We have developed Ragout that infers structural rearrangements between the multiple input references and reconstructs the most probable architecture of a target genome. We used Ragout to produce chromosome assemblies of multiple mice genomes, which gave insights into rodent genome evolution and novel functional loci. Mouse assemblies were generated as a part of Mouse genomes sequencing project, hosted by Wellcome Sanger Institute.
Tools for assembly graphs analysis. The analysis of genome graphs is helpful in studying repeat structure of genomes (for example, mosaic segmental duplications in humans). To visualize large and complex assembly graphs, we developed AGB - an interactive graph visualization tool. We have also introduced a new Synteny Paths approach for comparison of two related genomes in a graph from, similarly to synteny block for linear genomes. The tools were developed in a collaboration with the Center for Algorithmic Biotechnology and Bioinformatics Institute in St. Petersburg, Russia.
Publications
- Bibliography Link
- View Dr. Kolmogorov's Google Scholar bibliography.
Assembly of long, error-prone reads using repeat graphs
metaFlye: scalable long-read metagenome assembly using repeat graphs
Chromosome assembly of large and complex genomes using multiple references
Assembly of long error-prone reads using de Bruijn graphs
Biography
Mikhail Kolmogorov, Ph.D.
class="FrameContents">Before joining the Cancer Data Science Laboratory in January 2022, Mikhail was a postdoctoral fellow at the University of California (UC) Santa Cruz, supervised by Dr. Benedict Paten. Prior to that, he was a postdoctoral fellow at the UC San Diego, co-supervised by Dr. Rob Knight and Dr. Pavel Pevzner. Mikhail completed his Ph.D. in September 2019 in Computer Science from UC San Diego, under the mentorship of Dr. Pavel Pevzner. He received his M.Sc. in bioinformatics from St. Petersburg University of the Russian Academy of Sciences.
Job Vacancies
We have no open positions in our group at this time, please check back later.
To see all available positions at CCR, take a look at our Careers page. You can also subscribe to receive CCR's latest job and training opportunities in your inbox.
Team
News
Learn more about CCR research advances, new discoveries and more
on our news section.
Resources
Our Software
Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. Flye also has a special mode for metagenome assembly.
HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes preserve heterozygous structural variants (in addition to small variants) and are locally phased.
Ragout is a tool for chromosome-level scaffolding using multiple references. Given initial assembly fragments (contigs/scaffolds) and one or multiple related references (complete or draft), it produces a chromosome-scale assembly (as a set of scaffolds).
Provides interactive visualization of assembly graphs, a wide range of tuning parameters, and various options for modifying/simplifying the graph.
Asgan is a tool for analysis and comparison of assembly graphs. The tool takes two assembly graphs in the GFA format as input and finds the minimum set of homologous sequences (synteny paths) shared between the graphs. As output, Asgan produces various statistics and a visualization of the found paths.
dipdiff is a simple SV calling package for diploid assemblies
A tool that postprocesses whole genome alignment (for two or more genomes) and produces coarse-grained synteny blocks.