Earlham Institute: Leading the data-driven science revolution

ArticleDownload pdf

Director Professor Neil Hall discusses the Earlham Institute’s multidisciplinary approach, current projects, and breakthroughs, as well as the research institute’s collaboration with Google to produce the next generation of coders. Employing an innovative computational science and biotechnology approach, the Earlham Institute is bringing biology into the digital age.

At the forefront of modern life science research, the Earlham Institute aims to answer ambitious biological questions, and generate resources that enable academic and industrial investigators to make new discoveries. Established in 2009, the Earlham Institute was founded as a national facility to promote the use of genomics and innovation in the UK. Ever-expanding, the research institute’s research groups include talented computer scientists, molecular biologists, mathematicians, and geneticists. The Earlham Institute applies computational methods for the collection, analysis and management of large biological datasets – driving progress onwards within the data-driven science revolution.

Outlining some of the Earlham Institute’s achievements and long-term strategy, Professor Neil Hall reveals why he thinks there may be a future shift away from viewing biology as the ‘messy, capricious and unpredictable’ sister of ‘clean’ physics.

What is involved in your role as the Director of EI?
As director, I am involved in setting the strategy for the institute and engaging with our main funder, the Biotechnology and Biological Sciences Research Council (BBSRC), as well as our key collaborators. I ensure that the Earlham Institute remains leading-edge – both in our technology platforms and in the research we deliver. I also have my own research group in microbial genomics.

Research Features actively encourages scientific collaboration. How does EI’s multidisciplinary approach improve our understanding of genomics?
We believe that transformational changes in research technology are driven by research need. Hence, at the Earlham Institute, we have computer scientists, molecular biologists, mathematicians, and geneticists (along with other disciplines) working side by side. So when one of the research principal investigators thinks “I need to be able to do something we can’t currently do”, our technology platforms can work with the various experts to deliver a solution. This would not work if we were merely concerned with delivering data to external users.

Could you briefly outline EI’s interdisciplinary programme: Digital Biology?
Our Digital Biology research programme is dedicated to applying computational methods to the analysis and management of large biological datasets. This can be solving mundane problems, such as how to rapidly and accurately perform quality assessment of large DNA sequencing datasets, to more complex problems like reconstructing biological systems from multiple complex datasets. For example, we are developing methods to take multiple measurements of crops – including images, biochemical data and genetic data – and for processing the collected data, in order to predict which genes need to be selected to increase yields or protect against crop disease. The overarching aim is to take complex data and deliver something that is directly useful to breeders.

EI’s recent project, Engineering DNA with synthetic biology, aims to develop novel antibiotics in the future. What progress have you made so far?
Our Engineering Biology research programme is relatively new, with a BBSRC investment of over £3m to set up our ‘DNA foundry’. The foundry is a technical platform to generate synthetic DNA, and enables us to follow-up hypotheses generated from the analysis of large datasets through our in silico analysis. The lab has only recently been completed, with a new Faculty working in this area, including Dr Nicola Patron and Prof Anthony Hall. We have also been collaborating with the Giles Android Group at JIC, who are engineering plants to fix their own nitrogen. There is some way to go before we will be developing novel antibiotics.

EI is currently building the basis of a new system for large, energy-efficient DNA sequence searching. How do you think Project GENESYS: Genetic Search System could help future researchers in the health science field?
I think the major advantage of the optical processing technology we are testing in this project is that it could make high-performance computing (HPC) available in an affordable and portable system. It is well known to now be faster and cheaper to collect biological data, particularly DNA sequence and images. So, the computational processing of that data is the new bottleneck for biological research, and the expense, running cost and technical expertise is a hurdle that must be overcome for it to be used as a deployable technology in a healthcare setting. In this project, we aim to allow genetic sequence analysis to be performed locally, without the prohibitive running and build costs of HPC systems.

What have been EI’s most significant achievements over the last year?
There are many candidates for this honour and not all scientific: we have run many successful training events and we have a number of young faculty members who have won prodigious grants which are all institutional achievements. Perhaps one of the most significant achievements was made by Bernardo Clavijo, who worked with the Broad Institute, US, to develop a novel computational approach to assemble the wheat genome. This genome is one of the most complex sequenced to date, and to put together the millions of short sequences we generate into a complete genome is a longstanding computational problem. Now this has been achieved, we can start to compare genomes from many different varieties of wheat and also start to understand the genetic basis of important traits, such as disease resistance, drought tolerance, bread-making quality, etc.

The DNA foundry is a technical platform to generate synthetic DNA, and enables us to follow-up hypotheses generated from the analysis of large datasets through our in silico analysis

What is in the pipeline within EI’s long-term strategy?
As many of our faculty have been here for less than three years, then a lot will be new. We have a rapidly expanding research field in genomics of fish, both wild and aquaculture species. Fish are growing in importance as a protein source for the growing human population and we need to accelerate genetic improvement. Also, cichlid fish are a model species for understanding natural evolutionary processes like speciation.

We also have new faculty working on non-coding RNA, conservation genomics, crop disease, and infield phenotyping. Hence, I expect the next five years to see a real diversification in our work programme.

EI is currently collaborating with Google, as a mentoring organisation for the Google Summer Code 2016 programme. How do you feel the partnership will help produce the next generation of coders?
The Summer of Code enables young people to contribute to open source software, so that they can do something genuinely useful with their skills. This will hopefully inspire them to enter a career in computing. The Earlham Institute, like many organisations, are helping with mentoring the student volunteers. The fact is that not all young people think of coding as ‘cool’, but Google is widely recognised as being a cool organisation. Hence, their name alone will help to gain people’s attention.

What are you personally most excited about for the future of computational science and biotechnology?
There are many practical applications that are coming to fruition currently, like personalised medicine, synthetic biology, and genomic selection in breeding programmes. However, I am actually excited by the possibility that within my lifetime we can derive knowledge directly from large-scale biological data such as genome sequence. We think of physics as being clean, where behaviour follows rules and theory can make accurate predictions. However, biology is thought of as messy, capricious and unpredictable – just too complex. I believe we are getting better at measurement, and importantly, our theory and models are improving so that we might be getting closer to a ‘physics’ world view.

• Neil Hall has been working in genomics for over 15 years. He has previously led research groups at the Sanger Institute, The Institute for Genomic Research, and The University of Liverpool. His research focuses on comparative and evolutionary genomics in pathogens (particularly parasitic protists) to understand the molecular basis of important phenotypes, such as virulence and host specificity. His group also apply genomics to the analysis of microbial communities, in order to understand how they may influence health or respond to changing environments. Neil serves on the Wellcome Trust Biomedical Resources Committee and the BBSRC Exploring New Ways of Working Strategy Panel.

Leading the data-driven science revolution