Texas A&M University statistician Bani K. Mallick has spent the past two decades using Bayesian statistics to develop more efficient algorithms for quantifying, qualifying and classifying big data. Thanks to a new grant from the National Institutes of Health, Mallick now has his sights set on the second-leading cause of death in the United States and one of the world’s biggest data-generating problems to date — cancer.
Mallick will receive $2.3 million during the next five years in support of his proposal, “Graph-Based Bayesian Analysis of Genomics and Proteomics Data.” Working in collaboration with fellow Texas A&M statistician Raymond J. Carroll as well as MD Anderson Cancer Center researchers Veerabhadran Baladandayuthapani and Han Liang, he will develop new statistical models designed to merge two vital informational areas: cancer-related data and analysis.
“The world of bioinformatics and big data are joining together to discover innovative ways to integrate knowledge for cancer treatments,” Mallick said. “This project will create a wide assortment of novel methods for better integrating data across platforms so that we can effectively obtain a much more complete understanding of cancer characteristics and behavior and thereby improve its prevention, prediction and treatment.”
Mallick, a distinguished professor and holder of the Susan M. Arseven ’75 Chair in Data Science and Computational Statistics in the Texas A&M Department of Statistics, is considered one of today’s most influential and productive statisticians as a pioneer in Bayesian nonparametric regression and classification research. He is director of both the Center for Statistical Bioinformatics and the Bayesian Bioinformatics Laboratory and has developed novel methodology and theory that has become the foundation for interdisciplinary research in myriad fields, from bioinformatics and veterinary medicine to engineering and traffic mapping.
Mallick’s grant is classified as a research project (RO1) award, the original and historically oldest NIH grant mechanism otherwise known as a standard single-investigator grant. Although these are fairly common in the life sciences, as indicated by the dozen or so across the Texas A&M College of Science in biology and chemistry, Mallick notes they are somewhat rare in the fundamental sciences like mathematics and statistics. All RO1s are required to have public health impacts, as are program project (PO1) and cooperative agreement (U01) grants.
As the first step in tackling such an immensely complex problem as cancer, Mallick says one must first be prepared to address the huge databases of detailed genomic, molecular, imaging and clinical data from large cohorts of cancer patients that have been and continue to be assembled by major national and international projects.
As but one example, he notes that when a patient is diagnosed, their tumor’s genome might be sequenced to see if it is likely to respond to a particular drug. Detection of further change is possible by repeating the genomic sequencing when each specific treatment is administered and as it progresses. The patient sample also is characterized by a wide variety of platforms that measure most cellular changes in the tumor, including DNA mutations, copy-number alterations, epigenomic changes and changes in gene expression and protein abundance. Further patient-specific information, such as medical history and life style, laboratory results, images and scanned data subsequently will be recorded by the physicians. In sum, Mallick says it all adds up to an enormous amount of information, generated from a single patient.
“Multiply all that by millions of people diagnosed with cancer each year, and we can begin to see the incredible size of the data,” Mallick said. “In addition, the data are coming from different platforms and are increasingly complex in nature. For these reasons, much of that information cannot yet be effectively exploited, nor extracted from all the platforms simultaneously.”
Mallick notes that while integrating data across all platforms is central to a better understanding of cancer, it also is immensely challenging. Obstacles include the complexity of the data coming from each individual platform, the very large number of variables in some platforms (for instance, he cites the Infinium HumanMethylation450 BeadChip array, which has 485,576 distinct probes per sample) and the complex, many-to-many relationships among probes from different platforms.
Multifaceted integration obstacles aside, however, Mallick says the most concerning issue with cancer — and the ultimate reason he believes it has yet to be completely cured — is that it mutates differently for each individual and often reacts in unexpected ways based on the individual’s specific genetic makeup. Given that patients require personalized treatment and medication in order to manage their specific type of cancer, Mallick says the development of personalized therapeutic options based on genetic and molecular markers offers great promise to improve cancer therapy outcomes. He points to significant advances in biotechnology that have allowed the measurement of molecular data points across multiple levels from a single tumor sample to aid in clinical decisions as tangible progress.
“One of the key data-analytic challenges is to develop holistic approaches that combine information within and across these multiple molecular levels to inform the prognosis and guide evidence-based management of current and future cancer patients,” Mallick said. “A critical step toward addressing this challenge is to develop a deeper understanding of the underlying molecular mechanisms leading to cancer initiation and progression.”
As with other facets of such a complex problem, Mallick admits it’s a complicated task because disease phenotypes and cellular functions are driven not by individual entities, such as genes and proteins, but by a consequence of coordinated changes across multiple networks and pathways reflecting various pathobiological processes. Therein lies not only the primary challenge but also his critical motivation.
“For our purposes, big data here refers to a massive collection of data sets that are not only large but also enormously complex because they contain multi-scale, multi-structured data,” Mallick said. “The goal of this project is development of systematic, integrated and data-driven approaches based on these complex data sets.
“It goes without saying there’s a lot at stake across multiple disciplines and industries, not to mention for society as a whole.”
For more information on Mallick and his research, go to http://www.stat.tamu.edu/~bmallick/.
# # # # # # # # # #
About Research at Texas A&M University: As one of the world’s leading research institutions, Texas A&M is at the forefront in making significant contributions to scholarship and discovery, including that of science and technology. Research conducted at Texas A&M represented annual expenditures of more than $866.6 million in fiscal year 2015. Texas A&M ranked in the top 20 of the National Science Foundation’s Higher Education Research and Development survey (2014), based on expenditures of more than $854 million in fiscal year 2014. Texas A&M’s research creates new knowledge that provides basic, fundamental and applied contributions resulting in many cases in economic benefits to the state, nation and world. To learn more, visit http://research.tamu.edu.
Contact: Shana K. Hutchins, (979) 862-1237 or email@example.com or Dr. Bani K. Mallick, (979) 845-1275 or firstname.lastname@example.org