Open Access Week 2019: Publicly funded research data are a public good

This week October 21 – 27, 2019 is Open Access week, an international event celebrating and promoting openness in research.

In keeping with this year’s theme, Open for Whom? Equity in Open Knowledge, this blogpost reflects on the public benefits of open data, the current challenges and opportunities.

We’re using the Library’s twitter account (@sgullibrary) to retweet interesting articles and blogpost all this week.


Open for whom?

This week the international research community is celebrating Open Access Week by reflecting on equity in open knowledge; enabling inclusive and diverse conversations on a single question: “open for whom”? Today’s blog post focuses specifically on open research data. UK Research and Innovation (UKRI) state in their Common Principles on Data Policy that:

Publicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner.

But who exactly does open research data benefit? We often speak about the benefits of open data to research and innovation:

  • enabling transparency
  • promoting reproducibility
  • boosting opportunities for collaboration
  • enhancing opportunities for innovation
  • reducing inefficiencies in research

The public ultimately benefit from open research data but are often treated as beneficiaries and not active, engaged partners.

This year’s theme asked me to challenge an assumption that open research data are for (and used primarily by) scientific/technical specialists working “in the public interest”, rather than the public themselves. A noble endeavour, I thought. So off I set…

Picture of a unicorn galloping over a rainbow.
Designed by Freepik

Who is the public?

At the very start, I faced a conundrum – who exactly is the public? The National Co-ordinating Centre for Public Engagement (NCCPE) helped ‘define the territory’. The short answer is everyone. Anyone can be a part of the range of groups that make up the public.

Graph of stakeholders in public engagement supplied by The National Co-ordinating Centre for Public Engagement.
Source: The National Co-ordinating Centre for Public Engagement

Non-governmental organisations, social enterprises, health and well-being agencies, local authorities, strategic bodies and community, cultural and special interest groups all comprise members of the public with an interest in accessing data to inform decisions that will benefit their group.

Releasing raw data in ways that make the data easy to find, access, understand and reuse helps maximise the potential benefits of research data across the social spectrum. It should be easy to discover what research data are available and how that data can be accessed. When released, data should be in open formats so that anyone can be able to access it, not just a select or privileged few possessing expensive, proprietary software. Data should also be shared with sufficient information about how it was created, how it should be understood and how to reuse it meaningfully and responsibly. Finally, data should always be shared under licences which tell people what they can do with it. Called FAIR data, these principles of data management and sharing enable maximum reuse of research data.

Measured voices

It’s here that a measured voice within in me started whispering… and I listened carefully.

Colourfully drawn arrows going in different directions on a blackboard

Is this really enough? This still has the potential to get messy. Very messy. Especially if we’re talking about health and medical data derived from human beings, which can be sensitive and which we have taken responsibility for protecting.

In the fallout of various data scandals, including scandals about the data used to train artificial intelligence, organisations everywhere are scrambling to restore public trust in the way we handle and use data. Part of restoring that trust is in the transparency offered by open data. Another aspect of restoring trust is in safeguarding the data that people provide us with and using that data responsibly, in ways individuals have consented to.

This tension between openness and our professional responsibilities is recognised in the UKRI’s data policy as well:

UKRI recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process.

This is a tension we are constantly negotiating given the kinds of data that we handle at St George’s.

Data ethics

A new field of applied ethics, called data ethics, gives us a useful framework for exploring and responding to legal and moral issues related to data collection, processing, sharing and reusing. The Open Data Institute has developed the Data Ethics Canvas to help organisations identify and manage ethical issues related to data. The UK Department of Digital, Culture, Media and Sport also provides a Data Ethics Framework to guide the use of data in the public sector. 

Being responsible in our data sharing means that a large amount of data produced from human participants are only available on request from other researchers. This takes me right back to where I started, though with the caveat that it might be particularly relevant for health and medical research: an assumption that open research data are for (and used primarily by) scientific/technical specialists working “in the public interest”, rather than the public themselves.

But maybe there’s a middle ground for health and medical data derived from human participants? Maybe there are possibilities for us to create meaningful and lasting partnerships with ‘the public’ to realise the public benefits of data? The UK Biobank engages very closely with their participants, but they are still participants. I wonder if there are examples out there of projects where participants are also decision-makers about their data. Or examples of projects that have formed collaborations with civil society and/or public sector groups to realise the greater benefits of data. It would be nice to see examples of initiatives like these to use as a springboard for wider conversation. 

Michelle Harricharan, Research Data Support Manager (researchdata@sgul.ac.uk)


If you are interested receiving updates from the Library on all things open access, open data and scholarly research communications, you can subscribe to the Library Blog using the Follow button or click here for further posts from us.

Challenging but rewarding – Wellcome Trust Data Re-use Prize winner, Quentin Leclerc, on reusing open data

Last November the Wellcome Trust launched the Data Re-use Prize to celebrate innovative reuse of open data either in antimicrobial resistance (AMR) or malaria. Entrants were asked to generate a new insight, tool or health application from two open data resources, the AMR ATLAS dataset or the Malaria ROAD-MAP dataset.

MRC-LID PhD student and member of the winning team for AMR, Quentin Leclerc, dropped by the SGUL RDM Service to talk about the prize and the challenging but rewarding process of reusing open data.

Quentin, congratulations on the win. Can you tell me a little bit about your team’s entry for the Data Re-Use Prize?

Sure. We developed a tool to help inform empiric therapy. Empiric therapy is basically when physicians pool multiples sources of data together to make the best informed guess about how to treat a patient. This is before they know exactly what bacteria a patient is infected with and its potential resistance to antibiotics. Say, for example, a patient has sepsis and needs to be treated right away. A physician might determine the most likely causes as E.coli and S. aureus infection and then make an informed guess about the best antibiotic to prescribe to treat both of these bacteria, bearing in mind regional estimates of each of pathogen’s resistance to different antibiotics. The physician is basically thinking, “given what we know about the common causes of this condition and antibiotic resistance, which antibiotic is likely to work best?”

Our proof of concept web app integrates data from a range of open data sources to visualise antibiotic resistance rates for common infections to help physicians prescribe faster and more accurately. If developed, the tool can potentially be used to inform national guidelines on how to treat common infections in many countries, particularly in low and middle income counties where data aren’t always available to inform empiric therapy at the local or hospital level.

app screenshot
Some visualisations from the team’s AR.IA app

Sounds very exciting. As a first year PhD student, what was it like to win a prize like this?

It was really unexpected. We didn’t expect to win, we just thought, ‘we’ll publish our findings anyway so let’s see how this goes’. The other entries for the prize were very specific while our entry was pretty broad so we weren’t very confident. It was a real surprise and a great effort from everyone on the team.

Team photo
Team photo (l to r): Gwen Knight, Quentin Leclerc, Nichola Naylor and Alexander Aiken
Missing: Francesc Coll

As a PhD student, it was an interesting experience overall. This project is very different from my PhD but working on this tool helped me to get used to the various datasets out there and to look at the big picture of antimicrobial resistance and antibiotic prescribing. It was an enlightening process.

Can you tell me a little bit more about the process of reusing existing data? What was it like?

It was surprising. The thing with data is that it’s collected for a purpose. When someone comes in trying to use that data for a different purpose, they start to see what’s missing. They start to make approximations and assumptions to use the data for something it wasn’t intended for. The ATLAS dataset is very accurate and it’s very rich but it suits its original purpose. For example, we needed to group the data in increasingly complex ways. Once we started doing this, the sample sizes started to look quite small. The dataset wasn’t suited to those kinds of groupings.

When we started comparing the ATLAS dataset to other datasets, the AMR data appeared to show slightly different information. So we started to ask, who collected this data? In what contexts would this data have been collected? Might there be a sampling bias that explains this difference we’re seeing between the datasets? There was a legitimate reason for the difference we were seeing, but that’s why it’s really important to think about why you’re using a dataset and exactly what you want to achieve because the data may not suit your purpose.

Also, we integrated data from a range of sources. When you start doing this, comparing available datasets, you realise the heterogeneity of the data that’s out there; they are all in different formats, they have different naming conventions, even the bacteria aren’t named in the same way and we had to work out exactly which bacteria different datasets were referring to. There aren’t any standards across the different sources to make integrating the datasets easy.

So there were a lot of challenges to reusing data that someone else created?

Yes, we needed to keep in mind that the data was not created to answer our research question. We also found that there was a lack of information in the available literature around the common causative pathogens of several infections to help us understand and use the data correctly.

What advice would you give to researchers wanting to reuse open datasets but are hesitant?

It is important to look at the dataset and really understand it. Ask yourself why it was collected, where it was collected, how it was collected. Don’t take anything for granted. Open datasets are incredible resources but you can’t blindly go in there.

Once you understand the dataset you’ll naturally get the confidence to use it and ask the right questions of it. You won’t be scared or overwhelmed by it. You’ll also save a lot of time once you start working on the data and better understand how to combine it with other datasets.

Quentin and his team’s winning entry, Antibiotic Resistance: Interdisciplinary Action (AR:IA), is openly available here. The team was led by Dr Gwen Knight at the London School of Hygiene and Tropical Medicine and included Nichola Naylor, Francesc Coll and Alexander Aiken.    

If you have any questions about finding and reusing open data contact Michelle Harricharan, Research Data Support Manager.

UPDATE 03/05/2019: You can read the official SGUL news release on this prize here.