Building a starry sky of knowledge in the life sciences

Modern science can’t exist without data. The volume and range of data generated by scientists is growing. The life sciences are an excellent example of this. Researchers in labs around the world perform experiments using increasingly complex technology and as a result, generate large amounts of data. Based on these experiments, they gain a better understanding of how life works. This understanding often is represented in scientific publications.

But this is only the beginning. To make their science more open, on publication of their paper (and sometimes before) researchers upload the data to public data resources, such as the ones managed by the European Bioinformatics Institute (EMBL-EBI), one of the six sites of the European Molecular Biology Laboratory (EMBL).

This data publication enables two things. Firstly, it allows other scientists to examine the workings of the paper, ensuring the analysis is sound. Secondly, the presence of the data empowers other scientists to use these data to develop new experiments, answer new research questions and gain new insights. And so the cycle of discovery continues.

Enriching the data

However, none of this would be possible without research infrastructures that manage data and make it ‘FAIR’:

findable
accessible
interoperable
reusable

EMBL-EBI is one of the very few places in the world that provides the master copy of biological research data over many decades.

Not only does EMBL-EBI store the data on behalf of the world, but we also enrich and integrate them into the wider biological data ecosystem, using specialist knowledge and expertise.

If you take the human genome as an example, everything we know about it is the result of thousands of independent experiments performed by scientists around the world. Recreating these experiments every time researchers want to reuse the data simply wouldn’t work. As well as being exhausting, expensive and repetitive, it would require specialist skills.

Instead, EMBL-EBI’s Ensembl selects, computes, transforms and integrates information from these experiments into a consolidated view of the human genome that is freely available to all. Moreover, it also regularly updates the view as new data emerges.

A starry sky of data and knowledge

EMBL-EBI holds the reference human genome. It also holds information about genes, RNA, proteins sequences, structures and much more. These data show:

the genes that make up the human genome
how some genes encode proteins
how proteins fold into three dimensional structures or not
how they interact to execute particular parts of our molecular biology

This enables us to understand what happens in our bodies when we are healthy, when we become ill or when we age.

As well as holding this data for humans, EMBL-EBI also holds it for a whole spectrum of species, including:

chickens
macaques
marmosets
mice
platypus
rats
tomatoes
wheat

It’s a comprehensive list that even includes viruses like SARS-CoV-2, the virus that causes COVID-19.

I like to describe all of the information held by EMBL-EBI as being like a starry sky of data and knowledge.

Scientists are continually working out another data point, another star, another constellation. EMBL-EBI keeps that whole cosmos of data available for science to use, both now and in the future.

Together with colleagues at EMBL’s five other sites across Europe, EMBL-EBI also builds data analysis tools, performs cutting-edge research and helps train the next generation of scientists.

A foundational resource for artificial intelligence (AI)

It’s partly because of this wealth of publicly available data that the life sciences are perfectly placed to leverage increasingly powerful AI tools. The vast, information rich and carefully curated datasets managed by EMBL-EBI are perfect for training AI algorithms.

Take, for example, the revolutionary AlphaFold AI from Google DeepMind, which is able to accurately predict protein structures. To train the algorithm, the company used publicly available data from EMBL-EBI managed data resources, including:

experimentally-determined protein structures from the Protein Data Bank
protein sequences and annotations from UniProt
metagenomics data from MGnify

Google DeepMind’s engineering prowess, combined with the large volumes of public biological data and the intriguing research question of how proteins fold, created a superb synergy which resulted in the development of AlphaFold.

This marked a step change in biology, with AlphaFold’s AI predictions now being used to support research in drug discovery, vaccine development, agri-tech and beyond.

With more AI tools being developed every day to help us extract knowledge from the vast volumes of life science data available, I expect AlphaFold is just the beginning. Powerful AI tools for genomics and bioimaging are already being developed.

Custodians of the world’s biological data

But the starry sky of data doesn’t maintain itself. Like any research infrastructure, it requires robust and sustained funding and resourcing. This enables EMBL-EBI to respond to an increase in data submissions and usage.

EMBL-EBI has seen a significant increase in the number of scientists submitting and accessing data to its data resources, especially since the COVID-19 pandemic. This highlights the importance of robust data sharing during public health emergencies.

An 2021 independent report by Charles Beagrie Ltd found that data resources managed by EMBL-EBI are critical to the life sciences and they underpin research impacts estimated to be worth £1.3 billion annually. This shows just how embedded EMBL-EBI’s data resources really are in the science and innovation landscape in the UK, Europe and globally.

EMBL-EBI is based just outside Cambridge, a hotbed of innovation in the life sciences. We benefit from our close European links facilitated by EMBL’s wide network of collaborations and are strengthened by our global partnerships such as the Worldwide Protein Data Bank and the Global Alliance for Genomics and Health.

Over the years, EMBL-EBI has secured funding from the Biotechnology and Biological Sciences Research Council (BBSRC) and UK Research and Innovation (UKRI) to ensure our research infrastructure is fit for purpose.

The latest investment came in November 2023, when UKRI awarded EMBL-EBI £80.7 million from the UKRI Infrastructure Fund over the course of six years to transform the institute’s technical infrastructure to meet the data needs of the life sciences community.

This investment will provide storage, network, compute and cloud infrastructure to support EMBL-EBI in accommodating the rising production of publicly funded research data, including new data types such as bioimaging. It will also help us develop new data resources specific to serious global challenges, such as:

antimicrobial resistance
environmental integrity
global food security
personalised medicine

For over 30 years, EMBL-EBI has been building the starry sky of data and knowledge bit by bit, but none of this would have been possible without the amazing support of our funders, partners and users around the world.

Investing in research infrastructure to support the likes of EMBL-EBI are mission critical to the success of UK bioscience.

Read BBSRC’s infrastructure strategic framework to find out how we are supporting adaptive, resilient and sustainable infrastructure to ensure the UK remains a leader in global bioscience research and innovation.

Ewan Birney

Deputy Director General, European Molecular Biology Laboratory (EMBL)

Professor Ewan Birney is Deputy Director General of the European Molecular Biology Laboratory (EMBL). He is also Director of EMBL’s European Bioinformatics Institute (EMBL-EBI) with Dr Rolf Apweiler and runs a small research group.

He played a vital role in annotating the genome sequences of human, mouse, chicken, and several other organisms. He led the analysis group for the ENCODE project, which is defining functional elements in the human genome.

Ewan’s main areas of research include:

functional genomics
assembly algorithms
statistical methods to analyse genomic information (in particular, information associated with individual differences)
compression of sequence information

Ewan completed his PhD at the Wellcome Sanger Institute with Richard Durbin. He has received a number of prestigious awards including:

2003 Francis Crick Award from The Royal Society
2005 Overton Prize from the International Society for Computational Biology
2005 Benjamin Franklin Award for contributions in open source bioinformatics

Ewan was elected a Fellow of The Royal Society in 2014 and a Fellow of the Academy of Medical Sciences in 2015.

Ewan is a non-executive Director of Genomics England, and is a consultant and advisor to a number of companies, including Oxford Nanopore Technologies and GSK. He is also the Chair of the Global Alliance for Genomics and Health.

Ewan is a member of BBSRC Council.