Building a starry sky of knowledge in the life sciences

As the life sciences generate increasingly more data, an institute near Cambridge is addressing how we archive, share and analyse the world’s biological data.

Modern science can’t exist without data. The volume and range of data generated by scientists is growing. The life sciences are an excellent example of this. Researchers in labs around the world perform experiments using increasingly complex technology and as a result, generate large amounts of data. Based on these experiments, they gain a better understanding of how life works. This understanding often is represented in scientific publications.

But this is only the beginning. To make their science more open, on publication of their paper (and sometimes before) researchers upload the data to public data resources, such as the ones managed by the European Bioinformatics Institute (EMBL-EBI), one of the six sites of the European Molecular Biology Laboratory (EMBL).

This data publication enables two things. Firstly, it allows other scientists to examine the workings of the paper, ensuring the analysis is sound. Secondly, the presence of the data empowers other scientists to use these data to develop new experiments, answer new research questions and gain new insights. And so the cycle of discovery continues.

Enriching the data

However, none of this would be possible without research infrastructures that manage data and make it ‘FAIR’:

  • findable
  • accessible
  • interoperable
  • reusable

EMBL-EBI is one of the very few places in the world that provides the master copy of biological research data over many decades.

Not only does EMBL-EBI store the data on behalf of the world, but we also enrich and integrate them into the wider biological data ecosystem, using specialist knowledge and expertise.

If you take the human genome as an example, everything we know about it is the result of thousands of independent experiments performed by scientists around the world. Recreating these experiments every time researchers want to reuse the data simply wouldn’t work. As well as being exhausting, expensive and repetitive, it would require specialist skills.

Instead, EMBL-EBI’s Ensembl selects, computes, transforms and integrates information from these experiments into a consolidated view of the human genome that is freely available to all. Moreover, it also regularly updates the view as new data emerges.

A starry sky of data and knowledge

EMBL-EBI holds the reference human genome. It also holds information about genes, RNA, proteins sequences, structures and much more. These data show:

  • the genes that make up the human genome
  • how some genes encode proteins
  • how proteins fold into three dimensional structures or not
  • how they interact to execute particular parts of our molecular biology

This enables us to understand what happens in our bodies when we are healthy, when we become ill or when we age.

As well as holding this data for humans, EMBL-EBI also holds it for a whole spectrum of species, including:

  • chickens
  • macaques
  • marmosets
  • mice
  • platypus
  • rats
  • tomatoes
  • wheat

It’s a comprehensive list that even includes viruses like SARS-CoV-2, the virus that causes COVID-19.

I like to describe all of the information held by EMBL-EBI as being like a starry sky of data and knowledge.

Scientists are continually working out another data point, another star, another constellation. EMBL-EBI keeps that whole cosmos of data available for science to use, both now and in the future.

Together with colleagues at EMBL’s five other sites across Europe, EMBL-EBI also builds data analysis tools, performs cutting-edge research and helps train the next generation of scientists.

A foundational resource for artificial intelligence (AI)

It’s partly because of this wealth of publicly available data that the life sciences are perfectly placed to leverage increasingly powerful AI tools. The vast, information rich and carefully curated datasets managed by EMBL-EBI are perfect for training AI algorithms.

Take, for example, the revolutionary AlphaFold AI from Google DeepMind, which is able to accurately predict protein structures. To train the algorithm, the company used publicly available data from EMBL-EBI managed data resources, including:

  • experimentally-determined protein structures from the Protein Data Bank
  • protein sequences and annotations from UniProt
  • metagenomics data from MGnify

Google DeepMind’s engineering prowess, combined with the large volumes of public biological data and the intriguing research question of how proteins fold, created a superb synergy which resulted in the development of AlphaFold.

This marked a step change in biology, with AlphaFold’s AI predictions now being used to support research in drug discovery, vaccine development, agri-tech and beyond.

With more AI tools being developed every day to help us extract knowledge from the vast volumes of life science data available, I expect AlphaFold is just the beginning. Powerful AI tools for genomics and bioimaging are already being developed.

Custodians of the world’s biological data

But the starry sky of data doesn’t maintain itself. Like any research infrastructure, it requires robust and sustained funding and resourcing. This enables EMBL-EBI to respond to an increase in data submissions and usage.

EMBL-EBI has seen a significant increase in the number of scientists submitting and accessing data to its data resources, especially since the COVID-19 pandemic. This highlights the importance of robust data sharing during public health emergencies.

An 2021 independent report by Charles Beagrie Ltd found that data resources managed by EMBL-EBI are critical to the life sciences and they underpin research impacts estimated to be worth £1.3 billion annually. This shows just how embedded EMBL-EBI’s data resources really are in the science and innovation landscape in the UK, Europe and globally.

EMBL-EBI is based just outside Cambridge, a hotbed of innovation in the life sciences. We benefit from our close European links facilitated by EMBL’s wide network of collaborations and are strengthened by our global partnerships such as the Worldwide Protein Data Bank and the Global Alliance for Genomics and Health.

Over the years, EMBL-EBI has secured funding from the Biotechnology and Biological Sciences Research Council (BBSRC) and UK Research and Innovation (UKRI) to ensure our research infrastructure is fit for purpose.

The latest investment came in November 2023, when UKRI awarded EMBL-EBI £80.7 million from the UKRI Infrastructure Fund over the course of six years to transform the institute’s technical infrastructure to meet the data needs of the life sciences community.

This investment will provide storage, network, compute and cloud infrastructure to support EMBL-EBI in accommodating the rising production of publicly funded research data, including new data types such as bioimaging. It will also help us develop new data resources specific to serious global challenges, such as:

  • antimicrobial resistance
  • environmental integrity
  • global food security
  • personalised medicine

For over 30 years, EMBL-EBI has been building the starry sky of data and knowledge bit by bit, but none of this would have been possible without the amazing support of our funders, partners and users around the world.

Investing in research infrastructure to support the likes of EMBL-EBI are mission critical to the success of UK bioscience.

Read BBSRC’s infrastructure strategic framework to find out how we are supporting adaptive, resilient and sustainable infrastructure to ensure the UK remains a leader in global bioscience research and innovation.

Top image:  EMBL-EBI south building. Credit: Jeff Dowling

This is the website for UKRI: our seven research councils, Research England and Innovate UK. Let us know if you have feedback or would like to help improve our online products and services.