Topic, definition and scope
Data Repository can be defined as a centralized location where data is stored, organized, and managed. This a system that doesn’t just serve to store files, but also makes them discoverable and usable for specific purposes.
- “Everyone has the right to share in scientific advancement and its benefits” Article 27, Universal Declaration of Human Rights
- Data discovery is a process of understanding data and extracting valuable insight from multiple data streams according to data uses and purposes.
- The European Commission’s guiding principle, “As open as possible, as closed as necessary”, has transformed how we approach the discovery and publication of scientific information.
Image: https://phaidra.univie.ac.at/download/o:1201054
To provide a clear roadmap for the students, the scope of this module is delimited to three core competencies:
- Where to look: Identifying the appropriate repository type based on the discipline.
- How to search: Leveraging rich metadata and advanced filtering.
- How to evaluate: Determining if a repository is trustworthy using quality markers like the CoreTrustSeal.
To ensure that European research is competitive and transparent, the EU (through initiatives like Horizon Europe and the European Open Science Cloud - EOSC) provides specific recommendations for the research data lifecycle:
- The FAIR principles. The EU’s primary recommendation is the implementation of FAIR Principles. For data to be discoverable within the European ecosystem, it must follow these standards:
- Findable: Data must be described with rich metadata and assigned a Persistent Identifier (PID), such as a DOI or Handle. This ensures that European “Data Harvesters” (like OpenAIRE) can index the work.
- Accessible: Even if data is sensitive (GDPR-protected), the metadata must remain publicly discoverable to notify the community of the data’s existence.
- Trusted Repositories & the EOSC. EU guidelines strongly recommend publishing in Certified Trusted Repositories. These are infrastructures that have earned quality marks like the CoreTrustSeal. By publishing in a trusted repository, your data is automatically “fed” into the European Open Science Cloud (EOSC). This creates a “web of FAIR data” where a researcher in Spain can seamlessly discover a dataset produced in The Netherlands.
- Data Management Plans (DMP) as Discovery Blueprints: This document is not just a hurdle; it is a discovery strategy. It forces researchers to decide how they will describe their data (metadata standards) and where they will host it so that it remains discoverable for at least 10 years after the project ends.
FAIR element(s)
- Findable: Data should be available in a discoverable resource (i.e. repository), have appropriate description (i.e. metadata) and have a persistent identifier (PID)
- Accessible: Data should be retrievable and understandable for both humans and machines
- Interoperable: Machines and humans can interpret and use the data in different settings and will be able to distinguish the metadata from the data file
- Reusable: The ultimate goal of FAIR is to advance the reuse of data in the future research and allow integration with other compatible data sources.
Summary of Tasks / Actions
- Discussing reproducibility: why FAIR principles are important for data discovery?
- How do you search for data? See also the FAIRsharing educational factsheet for databases
- Speaking about the process of data discovery, from developing a clear picture of the data to evaluating data quality.
- Use lesson plan in (Unit 1: Topic 3: Data Life Cycle approach to FAIR/FAIR right from the start) to go through the data life cycle in the following scenario.
- Present a researcher’s story in any life science field and set up a search strategy. The story can be something like:
“A Bio-Chemistry researcher needs some enzymology data for a research question: how enzymes are key factors to increase the rate of metabolism in the human body?”
- How did the researcher discover and access such data?
- Did the researcher list the characteristics of the data you want to discover
- Evaluate the quality of data
- Check the terms and conditions of access and use
- Let’s take the scenario above and look for any type of data you are interested about (e.g.‘mitochondrial beta-oxidation”) in different data sources:
- OpenAIRE - Research Graph
-
[OpenAIRE Open Access](https://explore.openaire.eu/search/find?resultbestaccessright=%22Open%2520Access%22\&fv0=miksa\&f0=q\&active=result) - DataCite
- Re3data.org
- Dataset Search (google.com)
- FAIRsharing
- Of these resources,
- Which one provided the most relevant data for your search terms? Which one provides facilities to refine your search ( i.e. filters)?
- Try to search for more detailed search terms. How did the search results improve?
- Is there a citation clarification for your selected data?Are there any differences in citation clarification between these data sources?
- Can you find a licence for selected data? Is there any clarification how the data can be reused?
- How can data resources make data more discoverable by linking data to publications?
- Identifying innovative search tools for data discovery: demo on how to find the data behind a publication using Europe PMC, a literature database.
- Citation, licences and copyrights help to clarify the “R” in the FAIR principles.
- How to understand database conditions and attributes when choosing a repository (FAIRsharing documentation)
- How to licence data (openaire.eu)
-
[How to Cite Datasets and Link to Publications DCC](https://www.dcc.ac.uk/guidance/how-guides/cite-datasets)
Materials / Equipment
- Internet and browser
- https://europepmc.org/
Take home tasks/preparation
- Hands-on exercise: Find the data behind a publication of your interest using Europe PMC and answer the questions:
- Could you find the data citation on the publication?
- Is the data linked to the data repository?
- Could you access the data? Is the data format machine-readable?
- Could you easily find the licensing for the data of interest?
- How do you believe the use of FAIR principles contributed for your data discovery?
Lesson content
The Data Hunting Exercise
- Set up the challenge: Choose a topic the participants could explore while looking for data on a particular repository examples can include:
- Environmental Science: Ocean acidification rates in the North Sea
- Health: Genomic Sequencing and antibiotic-resistant bacteria
- Education: Statistics on numbers of International Students in Medical Schools in Europe
- Looking for Places to Search : Provide the participants with different locations to look for this data:
- Generic Repositories: Zenodo, Figshare, DataverseNL, Dryad, Dataverse, DANS Data Station
- Domain Specific: Pangea (Earth Science), NCBI (BIO), GBIF (Biodiversity)
- Search Engines: Google Dataset Search, DataCite Commons, OpenAIRE Explore
- The Scavenger Hunt Checklist: Participants must find a data set in at least two of the categories provided and fill out this evidence check list: ****
- Persistent identifier: (PID) Can you find a DOI?
- Meta-data Richness: On a scale of 1-5, how well is the data described? (Are there column definitions, read-me files, methods?
- Interoperability: What file formats are used? (Proprietary like .xlsx. or open like .cvs)
- The Debriefing: “Comparison Gallery”
- Which repository felt most trustworthy and why?
- Did you find the same dataset in two different places (this introduces the concept of data harvesting and mirroring
- Which metadata record made you feel like you actually re-use the data right away?
Annotations: Have participants made annotations on a common document they would take home. This could also work as feedback and improvement mechanisms for this particular activity.
The “In Silico” Shortcut Exercise
Scenario: A new variant of a rare respiratory virus has emerged. Your team needs to find an existing drug that can be “repurposed” to treat it.
1. The Challenge: Wet Lab vs. Data Mining
Divide participants into two groups (or have them compare the two strategies): The strategies were led by two different groups of researchers.
- Strategy A: The Traditional Bench Scientist (Primary Data)
- Task: Synthesize 1,000 new chemical compounds and test them against live virus cultures.
- Cost: $2 Million + 3 years of clinical trials.
- Risk: High. Most compounds will be toxic or ineffective in humans.
- Data Produced: Very specific, high-resolution data for a small set of molecules.
- Strategy B: The Bioinformatician (Data Discovery & Reuse)
- Task: Use data discovery to mine the Protein Data Bank (PDB) for the virus’s structure and ChEMBL for existing FDA-approved drugs.
- Cost: $100k (mostly computing power and researcher time).
- Timeline: 3 months.
- Action: Run a “virtual screening” (docking) to see which already-approved drugs might “stick” to the virus protein.
- Data Reused: Structural biology data from a lab in Japan, chemical properties from a database in the UK, and clinical safety data from the 1990s.
2. The “Discovery” Checklist (Life Science Specific)
Have students identify which specific life-science repositories they would need to “discover” data from to succeed in Strategy B:
- Genomics: Where is the virus’s RNA sequence? (e.g., NCBI GenBank).
- Proteomics: Where is the 3D shape of the virus’s “spike” protein? (e.g., UniProt or PDB).
- Pharmacology: Where are the records of drugs already safe for humans? (e.g., DrugBank or PubChem).
3. Discussion: Why Reuse is Vital in Life Sciences
After the comparison, lead a discussion on these “Bio-Specific” benefits:
- The 3Rs (Ethics): How does reusing data reduce the need for animal testing? (If the data already exists, it is ethically questionable to repeat a painful animal experiment).
- The “N of 1” Problem: In rare disease research, there might only be 10 patients in the whole world. No single hospital can do a study. Discovery and Aggregation of data from 10 different countries is the only way to get a statistically significant result.
- Long-tail Data: A lot of life science data is hidden in “Supplemental Materials” of old papers. How does semantic annotation (Goal 2) help us find a gene mention hidden in a PDF from 2005?
Take away: Have participants note down their discussion points. These points might be valuable insight to reproduce the exercise.
Repository Speed-Dating
Objective: Match a “Data Profile” to the correct “Repository Type” based on the discovery and reuse principles learned earlier.
1.The Setup (5 minutes)
Give each participant (or small group) three “Data Profile Cards.” You can display these on a screen or print them:
- Profile A: A small spreadsheet of water temperatures from a local lake, collected over 2 weeks. Needs to be cited in a paper.
- Profile B: 500GB of high-resolution 3D protein structures of a new virus variant.
- Profile C: Sensitive patient records from a rare disease study across three hospitals (requires restricted access).
2. The “Speed Match” (5 Minutes)
Participants must “match” their profiles to the most appropriate repository from the lesson (e.g., Zenodo, PDB, DANS, or NCBI) and defend their choice based on the FAIR principles.
The Twist: For each match, they must identify one “Dealbreaker.” (e.g., “I can’t put Profile C on Zenodo because it’s open-access and the data is sensitive/un-anonymized.”)
3. The “Legacy” Handshake (5 Minutes)
To finish, each participant writes one “Note to the Future” on a post-it or a shared digital doc (like a Jamboard or Padlet).
Example: “I am depositing [Type of Data]. To make sure someone can find and reuse this in 10 years, the most important metadata tag I will include is ____________ because ____________.”
Additional resources
- The FAIR Guiding Principles for scientific data management and stewardship arrow_outward
- Lost or Found? Discovering Data Needed for Research arrow_outward
- GOFAIR Discovery Implementation Network arrow_outward
- What is Data Mining arrow_outward
- Discover - Data Management Expert Guide arrow_outward
- Citing your data - Data Management Expert Guide arrow_outward
- Data reuse and the open data citation advantage arrow_outward
- FAIR Principles arrow_outward
- Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data arrow_outward
- Find a FAIR Repository arrow_outward
- Bridging the Data Discovery Gap: User-Centric Recommendations for Research Data Repositories arrow_outward
- CORE at Open Repositories 2025: Unlocking Insights and Empowering Open Access arrow_outward