Last November the Wellcome Trust launched the Data Re-use Prize to celebrate innovative reuse of open data either in antimicrobial resistance (AMR) or malaria. Entrants were asked to generate a new insight, tool or health application from two open data resources, the AMR ATLAS dataset or the Malaria ROAD-MAP dataset.
MRC-LID PhD student and member of the winning team for AMR, Quentin Leclerc, dropped by the SGUL RDM Service to talk about the prize and the challenging but rewarding process of reusing open data.
Quentin, congratulations on the win. Can you tell me a little bit about your team’s entry for the Data Re-Use Prize?
Sure. We developed a tool to help inform empiric therapy. Empiric therapy is basically when physicians pool multiples sources of data together to make the best informed guess about how to treat a patient. This is before they know exactly what bacteria a patient is infected with and its potential resistance to antibiotics. Say, for example, a patient has sepsis and needs to be treated right away. A physician might determine the most likely causes as E.coli and S. aureus infection and then make an informed guess about the best antibiotic to prescribe to treat both of these bacteria, bearing in mind regional estimates of each of pathogen’s resistance to different antibiotics. The physician is basically thinking, “given what we know about the common causes of this condition and antibiotic resistance, which antibiotic is likely to work best?”
Our proof of concept web app integrates data from a range of open data sources to visualise antibiotic resistance rates for common infections to help physicians prescribe faster and more accurately. If developed, the tool can potentially be used to inform national guidelines on how to treat common infections in many countries, particularly in low and middle income counties where data aren’t always available to inform empiric therapy at the local or hospital level.
Sounds very exciting. As a first year PhD student, what was it like to win a prize like this?
It was really unexpected. We didn’t expect to win, we just thought, ‘we’ll publish our findings anyway so let’s see how this goes’. The other entries for the prize were very specific while our entry was pretty broad so we weren’t very confident. It was a real surprise and a great effort from everyone on the team.
As a PhD student, it was an interesting experience overall. This project is very different from my PhD but working on this tool helped me to get used to the various datasets out there and to look at the big picture of antimicrobial resistance and antibiotic prescribing. It was an enlightening process.
Can you tell me a little bit more about the process of reusing existing data? What was it like?
It was surprising. The thing with data is that it’s collected for a purpose. When someone comes in trying to use that data for a different purpose, they start to see what’s missing. They start to make approximations and assumptions to use the data for something it wasn’t intended for. The ATLAS dataset is very accurate and it’s very rich but it suits its original purpose. For example, we needed to group the data in increasingly complex ways. Once we started doing this, the sample sizes started to look quite small. The dataset wasn’t suited to those kinds of groupings.
When we started comparing the ATLAS dataset to other datasets, the AMR data appeared to show slightly different information. So we started to ask, who collected this data? In what contexts would this data have been collected? Might there be a sampling bias that explains this difference we’re seeing between the datasets? There was a legitimate reason for the difference we were seeing, but that’s why it’s really important to think about why you’re using a dataset and exactly what you want to achieve because the data may not suit your purpose.
Also, we integrated data from a range of sources. When you start doing this, comparing available datasets, you realise the heterogeneity of the data that’s out there; they are all in different formats, they have different naming conventions, even the bacteria aren’t named in the same way and we had to work out exactly which bacteria different datasets were referring to. There aren’t any standards across the different sources to make integrating the datasets easy.
So there were a lot of challenges to reusing data that someone else created?
Yes, we needed to keep in mind that the data was not created to answer our research question. We also found that there was a lack of information in the available literature around the common causative pathogens of several infections to help us understand and use the data correctly.
What advice would you give to researchers wanting to reuse open datasets but are hesitant?
It is important to look at the dataset and really understand it. Ask yourself why it was collected, where it was collected, how it was collected. Don’t take anything for granted. Open datasets are incredible resources but you can’t blindly go in there.
Once you understand the dataset you’ll naturally get the confidence to use it and ask the right questions of it. You won’t be scared or overwhelmed by it. You’ll also save a lot of time once you start working on the data and better understand how to combine it with other datasets.
Quentin and his team’s winning entry, Antibiotic Resistance: Interdisciplinary Action (AR:IA), is openly available here. The team was led by Dr Gwen Knight at the London School of Hygiene and Tropical Medicine and included Nichola Naylor, Francesc Coll and Alexander Aiken.
If you have any questions about finding and reusing open data contact Michelle Harricharan, Research Data Support Manager.
UPDATE 03/05/2019: You can read the official SGUL news release on this prize here.