Big Data in Medical Science

By Charles Weston
Published November 30, 2022 | 3 min read

The First Lady Jill Biden said in early February 2022 that “we are living in a golden age of research and discovery”

The golden age is a phrase used too often, and every biological discovery made, standing of the shoulders of scientists that have come before, give us further insight into biological processes and disease. Perhaps what stands us apart now, though, is the ability to generate, and process, biological data on a scale that would have previously been unimaginable

Image of Genome Sequencing

This year is the 20^th anniversary of the completion of the human genome project in which all – well, most – of the first human genome was produced at a cost estimated around $3bn. Genomic data generation was already growing exponentially, and this work drove scientists to generate data at ever-faster rates. The cost of sequencing has fallen dramatically – already under $1,000 for a human genome, with the potential to reach $100 as early as next year – and at the same time, the ease of generating this data has increased.

The infrastructure behind this to process and store the data has scaled, and statistical approaches and machine learning algorithms have been developed to help scientists to elucidate meaningful findings from this data.

Genomics is now commonly used in cancer and rare disease diagnoses, and is being developed to quantify risk for common diseases. It is used extensively in plant and ecological science, in forensics and, of course, in the detection and monitoring of pathogens like the Sars Cov 2 (with over 8 million sequences of these viruses having been generated just in the last two years).

Data generation is now scalable to an extent that there are multiple programmes around the world seeking to sequence hundreds of thousands, or indeed millions of people to improve our understanding of health and disease.

Advances in sequencing technologies have resulted in big instruments with increasing throughput all the way through to a handheld, cheap device that produces DNA data within minutes and captures additional data from the DNA molecules. This small instrument has provided far more people with access to sequencing technologies, and could ultimately lead to consumer devices.

Genomics and genomic modification are just two pieces of the biological puzzle, with instructions from DNA carried by RNA to the manufacturing sites in the cells where the proteins are made. Proteins are things like hormones and enzymes, which are absolutely fundamental to affecting biological processes. They are incredibly complex, made from many more building blocks than those that make up DNA or RNA and with hundreds of modifications that change their function.

Although RNA can be studied like DNA, in a field called transcriptomics, the study of proteins, which is called proteomics, has been limited by this complexity.

Image of proteomics

But we’re now on the edge of another step change in data generation, with next-generation proteomics companies developing tools that can generate many orders of magnitude more data they can look for hundreds, or even thousands of proteins at the same time, or help scientists understand what different proteins are doing in different parts of the same organ or simply help understand what the shape of these molecules are.

Last year, Google’s subsidiary Deep Mind published AI-predicted structures of over 350,000 different proteins, and just this month Meta published on 600,000. Coupled with the emergence of these new detection instruments, the field of proteomics is grow exponentially.

Another amazing source of data is clinical records. Healthcare systems have data on millions of people, from birth to death, and the use of this data, appropriately anonymised, can also be mined for biological insight. The complexity of this data is again absolutely huge but as these systems morph to electronic records, and as machine learning systems become ever more sophisticated, they are providing actionable insights that directly lead to improvement in health.

Imagine the difficulty of combining genomics, transcriptomics, proteomics, and clinical records… Indeed, there are many ‘omics beyond those described above, each of which adds further dimensions of data, but as the fields of biology and big data analytics continue to grow together, alongside the huge investments in research being made by governments, foundations, venture capital and the public markets, the pace of biological discoveries is going to continue to accelerate.

Perhaps Jill Biden is indeed right, and this is a golden age of research and discovery but I think that, despite near-term turbulence in the capital markets, this golden age is going to get more and more lustrous as biologics and big data analytics converge and evolve.

Charles Weston
Equity Analyst, European Midcap Healthcare Research, RBC Capital Markets

Charles Weston joined RBC Capital Markets in May 2019 to cover the European midcap healthcare sector. Since 2006 Charles has been a research analyst, with roles at Berenberg, Numis, Nomura Code and Morgan Stanley, and he has also spent three years as an engineer in medical devices and two years as a CFO within a genomics company. Charles is a CFA Charterholder and received his Masters degree in Chemistry from the University of Oxford.

Big DataClinical RecordsDNAGenome SequencingGenomicsHealthcare SectorMedical DevicesMedical SciencePharmaceuticalProteomicsTranscriptomics

Big Data in Medical Science

Related Content