Preprints in the biological, medical and health sciences: some questions answered.

The open research movement is about disseminating scientific outputs widely and openly as soon as possible. One of the ways that researchers can rapidly share their work with a wide audience is by posting a preprint to a preprint server. The practice of sharing and commenting on preprints has recently been described as ‘science in real time1

What is a preprint?
Why post preprints online?
Before you post your preprint, what should you consider?
Where can I post preprints?
Where are preprints indexed?
How do I find out about preprints?
Can SGUL researchers record and deposit preprints in CRIS/SORA/SGUL Data Repository?
The future of preprints
Queries about preprints or open research?

What is a preprint?

The preprint is the original version of your work, before peer review and before acceptance by a journal.

Why post preprints online?

  • Publishing your research as a preprint means that you can get your work out fast. From 2021, the Wellcome Trust2 will require that any research they fund that is relevant to a public health emergency be published as a preprint, in order to disseminate findings on such important areas as quickly as possible3,4.
  • Your work will be citeable and shareable as soon as it’s posted, allowing you to demonstrate the work you’re doing to funders, colleagues and potential collaborators.
  • Immediate feedback from your peers can help you improve your manuscript, as well as opening up potential avenues for follow up work or collaborations.
  • By publishing your findings as a preprint, you can publically establish priority by date stamping your findings and making your preprint part of the scientific record.
  • Preprint servers (examples below) allow for disseminating hard-to-publish but important work such as negative/null findings.
  • In fields where posting preprints to preprint servers is commonplace, these can become a one stop shop for getting a quick overview of the newest developments in the field – a piece in Nature5 highlights how biorXiv can be used to help researchers stay abreast of what their colleagues are working on.

Before you post your preprint, what should you consider?

If you are posting as a step prior to publishing in a journal, check whether your prospective journal has any rules around preprints – do they consider posting preprints as ‘prior publication’?

What’s the best platform for what you want to achieve? If you want feedback on your paper from a specific group before going more public, you could share it on St George’s data repository via a closed group or a private link.

Are there charges for posting? Where there are charges, these tend to be much less than open access fees in more established journals, however you will still need to consider how these are paid.

Where can I post preprints? is a preprint server for the biological sciences. Many journals allow you to submit work that has been previously published as a preprint, and preprints posted to bioRxiv can also be directly transferred for submission to a variety of other peer review services (eg Plos, BMC). An analysis6 earlier this year of biorXiv preprints found that “two-thirds of preprints posted before 2017 were later published in peer-reviewed journals”.

medRxiv is a preprint server using the same software as bioRxiv, and papers on health sciences topics can be posted there.

BioMed Central have recently launched a new prepublication option, In Review, for articles under consideration in four of their journals: BMC Anesthesiology, BMC Neurology, BMC Ophthalmology and Trials.

F1000 Research, Wellcome Open Research and the new AMRC Open Research operate under a slightly different model: preprints posted to these sites are then openly peer reviewed, and the article is considered published once it has passed peer review. 

All these sites screen contributions for plagiarism and appropriateness, and to ensure they meet ethical standards.

Where are preprints indexed?

bioRxiv and medRxiv preprints are indexed by Google, Google Scholar, CrossRef and other search tools. They are not indexed by Web of Science, however they will be indexed in EPMC as follows:

“To distinguish preprints from peer reviewed articles in Europe PMC, each preprint is given a PPR ID, and is clearly labelled as a preprint, both on the abstract view and the search results… When preprints have subsequently been published as peer-reviewed articles and indexed in Europe PMC they are crosslinked to each other.”

Preprints are not indexed in PubMed until they have achieved sufficient peer review.

How do I find out about preprints?

Preprint platforms have options to set up alerts for subject categories, recent additions and to track papers when they are revised.

Rxivist combines preprints from bioRxiv with data from Twitter to help find the papers being discussed in a particular field, to help researchers deal with the “avalanche” of research7 they may be faced with. 

I’m a SGUL researcher, can I record and deposit my preprints in SGUL’s CRIS (Current Research Information System), St George’s Research Data Repository or publications repository, SORA (St George’s Online Research Archive)?

Records for preprints can come into your CRIS profile from CrossREF & EPMC. This is useful as it adds to the completeness of your publication list in CRIS.

As and when a paper from biorXiv or medrXiv goes onto to be published in a journal, then we’d expect to see a record for this in CRIS too.

For the purposes of making full text available via SORA, we have historically only made those versions of an article post peer review (either the final accepted MS or publisher version where possible) publically available.

For REF 2021, while preprints will be eligible for submission8, only outputs which have been ‘accepted for publication’ (such as a journal article or conference contribution with an ISSN) are within the scope of the REF 2021 open access policy. SGUL researchers should continue to follow the deposit on acceptance advice and upload the accepted version of their papers to CRIS for SORA.

The future of preprints

While there has been debate on the pros and cons of preprints in terms of whether research disseminated in this way will advance healthcare for patients9, improvements to preprint platforms (such as medRxiv’s cautionary advice to news media on their homepage) and backing by funders should mean that as a tool for researchers to quickly share & find preliminary findings, preprints will be around for the foreseeable future.

As funder mandates and preprint practices develop in the medical and health sciences, we will keep our system capabilities for capturing and promoting researchers’ preprints under active review.

Queries about preprints or open research?

Contact us

CRIS & Deposit on acceptance:

Open Access Publications:

Research Data Management:

We look forward to hearing from you.

Michelle Harricharan, Research Data Support Manager
Jenni Hughes, Research Publications Assistant
Jennifer Smith, Research Publications Librarian

Look out for a Library blog post on open peer review during Peer Review Week which is taking place September 16-20 2019.

If you are interested receiving updates from the Library on all things open access, open data and scholarly research communications, you can subscribe to the Library Blog using the Follow button or click here for further posts from us.


1. Knowledge Exchange. Preprints: Science in real time [Internet]. Bristol: Knowledge Exchange; 2018 [cited 2019 Aug 7]. Available from:

See also the slide deck:

Chiarelli, A; Johnson, R; Pinfield, S; Richens, E. Practices, drivers and impediments in the use of preprints: Phase 1 report [Internet]. 2019 [cited 2019 Aug 8]. Available from:

2. Wellcome Trust. Open Access Policy 2021 [Internet]. London: Wellcome; 2019 [cited 2019 Aug 8]. Available from:

3. Peiperl L. Preprints in medical research: Progress and principles. PLoS Med [Internet]. 2018 [cited 2019 Aug 8];15(4):e1002563. Available from:

4. Johansson MA, Reich NG, Meyers LA, Lipsitch M. Preprints: An underutilized mechanism to accelerate outbreak science. PLoS Med [Internet]. 2018 [cited 2019 Aug 8];15(4):e1002549. Available from:

5. Learn, JR. What bioRxiv’s first 30,000 preprints reveal about biologists [Internet]. 2019 [cited 2019 Aug 8]. Available from:

6. Abdill, RJ, Blekhman, R. Tracking the popularity and outcomes of all bioRxiv preprints. bioRxiv [Internet]. 2019 [cited 2019 Aug 7];515643. Available from:

7. Abdill, RJ; Blekhman R. Sorting biology preprints using social media and readership metrics. PLOS Biol [Internet]. 2019 [cited 2019 Aug 8];17(5):e3000269. Available from:

8. REF 2021. Guidance on submissions (2019/01) Section 238. [Internet]. 2019 [cited 2019 Aug 7]. Available from:

9. Krumholz HM, Ross JS, Otto CM. Will research preprints improve healthcare for patients? BMJ [Internet]. 2018 [cited 2019 Aug 8];362:k3628. Available from:

Challenging but rewarding – Wellcome Trust Data Re-use Prize winner, Quentin Leclerc, on reusing open data

Last November the Wellcome Trust launched the Data Re-use Prize to celebrate innovative reuse of open data either in antimicrobial resistance (AMR) or malaria. Entrants were asked to generate a new insight, tool or health application from two open data resources, the AMR ATLAS dataset or the Malaria ROAD-MAP dataset.

MRC-LID PhD student and member of the winning team for AMR, Quentin Leclerc, dropped by the SGUL RDM Service to talk about the prize and the challenging but rewarding process of reusing open data.

Quentin, congratulations on the win. Can you tell me a little bit about your team’s entry for the Data Re-Use Prize?

Sure. We developed a tool to help inform empiric therapy. Empiric therapy is basically when physicians pool multiples sources of data together to make the best informed guess about how to treat a patient. This is before they know exactly what bacteria a patient is infected with and its potential resistance to antibiotics. Say, for example, a patient has sepsis and needs to be treated right away. A physician might determine the most likely causes as E.coli and S. aureus infection and then make an informed guess about the best antibiotic to prescribe to treat both of these bacteria, bearing in mind regional estimates of each of pathogen’s resistance to different antibiotics. The physician is basically thinking, “given what we know about the common causes of this condition and antibiotic resistance, which antibiotic is likely to work best?”

Our proof of concept web app integrates data from a range of open data sources to visualise antibiotic resistance rates for common infections to help physicians prescribe faster and more accurately. If developed, the tool can potentially be used to inform national guidelines on how to treat common infections in many countries, particularly in low and middle income counties where data aren’t always available to inform empiric therapy at the local or hospital level.

app screenshot
Some visualisations from the team’s AR.IA app

Sounds very exciting. As a first year PhD student, what was it like to win a prize like this?

It was really unexpected. We didn’t expect to win, we just thought, ‘we’ll publish our findings anyway so let’s see how this goes’. The other entries for the prize were very specific while our entry was pretty broad so we weren’t very confident. It was a real surprise and a great effort from everyone on the team.

Team photo
Team photo (l to r): Gwen Knight, Quentin Leclerc, Nichola Naylor and Alexander Aiken
Missing: Francesc Coll

As a PhD student, it was an interesting experience overall. This project is very different from my PhD but working on this tool helped me to get used to the various datasets out there and to look at the big picture of antimicrobial resistance and antibiotic prescribing. It was an enlightening process.

Can you tell me a little bit more about the process of reusing existing data? What was it like?

It was surprising. The thing with data is that it’s collected for a purpose. When someone comes in trying to use that data for a different purpose, they start to see what’s missing. They start to make approximations and assumptions to use the data for something it wasn’t intended for. The ATLAS dataset is very accurate and it’s very rich but it suits its original purpose. For example, we needed to group the data in increasingly complex ways. Once we started doing this, the sample sizes started to look quite small. The dataset wasn’t suited to those kinds of groupings.

When we started comparing the ATLAS dataset to other datasets, the AMR data appeared to show slightly different information. So we started to ask, who collected this data? In what contexts would this data have been collected? Might there be a sampling bias that explains this difference we’re seeing between the datasets? There was a legitimate reason for the difference we were seeing, but that’s why it’s really important to think about why you’re using a dataset and exactly what you want to achieve because the data may not suit your purpose.

Also, we integrated data from a range of sources. When you start doing this, comparing available datasets, you realise the heterogeneity of the data that’s out there; they are all in different formats, they have different naming conventions, even the bacteria aren’t named in the same way and we had to work out exactly which bacteria different datasets were referring to. There aren’t any standards across the different sources to make integrating the datasets easy.

So there were a lot of challenges to reusing data that someone else created?

Yes, we needed to keep in mind that the data was not created to answer our research question. We also found that there was a lack of information in the available literature around the common causative pathogens of several infections to help us understand and use the data correctly.

What advice would you give to researchers wanting to reuse open datasets but are hesitant?

It is important to look at the dataset and really understand it. Ask yourself why it was collected, where it was collected, how it was collected. Don’t take anything for granted. Open datasets are incredible resources but you can’t blindly go in there.

Once you understand the dataset you’ll naturally get the confidence to use it and ask the right questions of it. You won’t be scared or overwhelmed by it. You’ll also save a lot of time once you start working on the data and better understand how to combine it with other datasets.

Quentin and his team’s winning entry, Antibiotic Resistance: Interdisciplinary Action (AR:IA), is openly available here. The team was led by Dr Gwen Knight at the London School of Hygiene and Tropical Medicine and included Nichola Naylor, Francesc Coll and Alexander Aiken.    

If you have any questions about finding and reusing open data contact Michelle Harricharan, Research Data Support Manager.

UPDATE 03/05/2019: You can read the official SGUL news release on this prize here.

The GDPR and health research

St George’s researchers will already be aware of the EU General Data Protection Regulation (GDPR) and the new UK Data Protection Bill, which will govern how we handle personal data after 25 May 2018. While we have learnt a lot about our obligations under the new regulations, researchers may not be clear about what these obligations mean for research. The SGUL Joint Research and Enterprise Services (JRES), Governance and Legal Assurance Services and the Research Data Management Service have come together to clear up a number of misconceptions about what the new regulations may mean for health and social care research. Read on!

It is not clear how the GDPR relates to health and social care research

GDPR has a broad scope beyond clinical research but does relate to all personal data which includes web search engines, social media, and much more.  Specifically, data required in research (and the way it is managed) would be within its remit. Identifiers such as name, addresses, date of birth, and electronic medical numbers all constitute personal information. However, the GDPR expands the personal data definition to include information such as location information, genetic data and IP addresses. In sum, any data that could potentially be used to directly or indirectly identify a person is considered personal data. In addition, pseudonymised data will now be considered personal data and therefore governed by the GDPR.

We will have to change all of our research processes to meet the requirements of the GDPR

As many, including the Medical Research Council, have already acknowledged, the GDPR reiterates many of the key principles of good research practice when handling personal data. Research, particularly health research, is governed by very strict guidelines and many of the mechanisms currently in place for assuring good practice can provide the safeguards needed to comply with the GDPR, for example, our ethics procedures and data management plans already address many of the requirements for privacy impact assessments and privacy by design. What we need to ensure is that all of our research is included in these processes, not just our funded research.

The GDPR will stifle research innovation

The GDPR ensures that innovation in health research can continue, but with the appropriate safeguards for data subjects. The new Data Protection Bill (which will replace the current Data Protection Act 1998) is currently going through parliament. This will direct the way the GDPR is implemented within the UK and any specific exemptions or “derogations”. It is widely accepted, but yet to be confirmed, that clinical research will have a number of related derogations to ensure that we are able to carry on normally with the business of improving and transforming health.

The research community will not be able to re-use/re-purpose data for future research

We are aware that it is not always possible to know all the ways research data could be processed when we are collecting it. The legislation also recognises this. Article 6(4) allows for further processing of personal data beyond the purposes for which it was collected, as long as those operations are considered ‘compatible’ with the original purpose under which consent was given, for example, medical research.

Further, secondary processing of data not collected for research, can subsequently be used for research, as long as appropriate safeguards are met and the processing is in the public interest. This means we can continue to access health data to better understand and treat health conditions.

I am going to have to re-consent participants every few years if I want to continue to hold their personal data

Consent is not the lawful basis on which our researchers hold and process personal data. As a public authority, we will usually process personal data for health and social care research as a ‘task in the public interest’, as such your participants may not need to be re-consented under the GDPR. However, under GDPR you will need to ensure you have been lawful, fair and transparent about the personal data you have collected and how it is managed. It is important to understand what information has been provided to your participants already and does this meet the GDPR requirements for transparency and accountability. This may require updates to your participant information sheet, or the addition of an information leaflet. The Health Research Authority (HRA) is working on consistent templates and wording to support researchers and sponsors have confirmed, if required, this would be a non-substantial amendment, that is, one not requiring formal ethics approval.

Even though consent is not the legal basis for processing personal data for research, the common law duty of confidentiality is not changing, so consent is still needed for people outside the care team to access and use confidential patient information for research. Therefore, consent continues to be required to meet the high ethical and research governance expectations we place on our researchers.

How can I be fair and transparent?

Being fair and transparent with research participants means respecting their rights and wishes, and ensuring their personal data is used in line with their expectations.  The GDPR requires that the information provided should be concise and easy to understand. If you want to retain information you should state the reason and allow the participant to make that judgement.

Organisations should also display corporate level privacy information about their research in locations where it will be noticed, for example links on website homepages and in waiting rooms. Linking this to your information sheets is a good way of ensuring participants are aware of our institutional role in research.

The JRES is working on updating template documents such as protocol templates and information sheets, to ensure appropriate guidance is provided and considered during the development of our research.

My funder expects me to make my data openly available at the end of my project, the GDPR will prevent me from doing this

The GDPR does not preclude data sharing, it only requires that data is shared responsibly and robustly. This has always been the case with data sharing. The GDPR only covers data that personally identifies a living person. Research that does not involve personal data is not covered under the GDPR and can be shared. The legislation also does not cover data that has been appropriately anonymised according to the ICO’s Anonymisation Code. This is what the ICO calls de-identified data for publication. There are also options to share de-identified data for limited disclosure or access. The ICO Anonymisation Code covers different forms of data publication and the Research Data Management Service is available to discuss your options.

A participant has requested to withdraw from the study but my data has already been anonymised and analysed; I have to start all over

In exceptional circumstances research participants are exempted from erasure if it is “likely to render impossible or seriously impair the achievement of the objectives of that processing” (Article 17(3)(d)). So you can continue to use this data in some circumstances. For data that has already been thoroughly anonymised, the GDPR does not apply.

The responsibility for GDPR compliance falls solely on project teams

The responsibility for compliance is corporate, that is, the organisation is accountable to the ICO, so it is important that researchers do not make decisions about legal compliance alone.

For St George’s University initiated research, we will usually be the data controller. This means we are responsible for outlining what data needs to be collected, why and how it is to be used/managed. For studies we collaborate in (where we are not the lead) we may be the data processor. In this instance, we are being directed on the data requirements and management.

If you are in doubt you should check as this is particularly important if a research participant asks you about their personal data rights.


We hope this post has helped you to get better acquainted with how the new legislation will affect our research activities. With regards to health and social care research, the GDPR maintains existing best practice and we should use this opportunity to evaluate our systems and procedures to ensure that we are indeed engaging in good practice.

Queries about the GDPR not covered here can be emailed to

If you are interested receiving updates from the Library on all things open access, open data and scholarly research communications, you can subscribe to the Library Blog using the Follow button or click here for further posts from us.

St George’s announces new research data repository

The Research Data Management Service has launched a research data repository for use by St George’s researchers, including our doctoral student researchers.

Figshare homepage screengrab for blog

Powered by figshare, the repository is the first phase of a pilot project to develop a shared research data management infrastructure for UK higher education. The pilot is headed by Jisc, and St George’s is proud to be one of just 13 higher education organisations included in the project. More information about this can be found on the project website.

The SGUL data repository is a digital archive for sharing, storing and preserving research content produced at St George’s. It was acquired to enable our researchers to better engage in Open Science and to respond to funder and publisher requirements for data sharing and preservation.

Researchers can use the repository to share research data, source code, posters, PowerPoint presentations, images, videos, electronic lab notebooks and a range of other digital research outputs. The repository can also be used to catalogue and link to items that are already in the public domain, but are difficult to discover, cite and measure for impact. Each deposit in the repository is provided with a persistent identifier, which allows items to be uniquely identified, cited and measured for impact.

All items deposited with us will be preserved for the lifetime of the repository.

Depositing to the repository is easy. All research staff and doctoral students are automatically registered for the service. Just log in to the repository using your institutional credentials and deposit your items following figshare’s normal deposit procedures. All deposits will be checked by a member of the research data management team before your research is published, giving you added peace of mind.

It is advisable to contact the Research Data Management Service if you intend to deposit your data in the repository to avoid any delay in publishing your research.

If you are interested receiving updates from the Library on all things open access, open data and scholarly research communications, you can subscribe to the Library Blog using the Follow button or click here for further posts from us.