Topic, definition and scope
Repositories can be defined as centralized services where data is stored, organized, and managed. These resources doesn’t just serve to store data and other research outputs, but also make them findable, accessible and reusable, for specific purposes.
- “Everyone has the right to (…) share in scientific advancement and its benefits”. Article 27, Universal Declaration of Human Rights
- Data discovery is a process of understanding data and extracting valuable insight from multiple data streams according to data uses and purposes.
- The European Commission’s guiding principle, “As open as possible, as closed as necessary”, has transformed how we approach the discovery and publication of scientific information.
FAIR element(s)
- Findable: Data should be available in a discoverable resource (i.e. repository), have appropriate description (i.e. metadata) and have a persistent identifier (PID).
- Accessible: Data should be retrievable and understandable for both humans and machines.
- Interoperable: Machines and humans can interpret and use the data in different settings and will be able to distinguish the metadata from the data file.
- Reusable: The ultimate goal of FAIR is to advance the reuse of data in the future research and allow integration with other compatible data sources.
Summary of Tasks / Actions
- Discuss: why FAIR principles are important for data discovery?
- How do you search for data?
- Present a researcher’s hypothesis or question and set up a search strategy
- How did the researcher discover and access such data?
- Evaluate the quality of data
- Check the terms and conditions of access and reuse
- Let’s take the scenario above and look for any type of data you are interested about (e.g.‘mouse gut microbiome’) in different resources:
- You can start by searching in repository catalogues - Re3data.org, FAIRsharing - for trustworthy subject specific repositories.
- You can also try to directly find datasets through a search engine, for e.g., Dataset Search (google.com)
- In different subject or general repositories consider asking:
- Which one provided the most relevant results for your search terms? Which one provides ways to refine your search ( i.e. filters)?
- Try to search for more detailed search terms. How did the search results improve?
- Is there a citation for your selected data? Are there any differences in citation between these data sources?
- Can you find a license for the data?
- What other resources make data more discoverable by linking data repositories to publications?
- Identifying innovative search tools for data discovery: demo on how to find the data behind a publication using Europe PMC, a literature database.
- Citation, licenses and copyrights help to clarify the “R” in the FAIR principles.
- How to understand database conditions and attributes when choosing a repository
- How to licence data (openaire.eu)
-
[How to Cite Datasets and Link to Publications DCC](https://www.dcc.ac.uk/guidance/how-guides/cite-datasets)
Materials / Equipment
- Internet and browser
- https://europepmc.org/
Take home tasks/preparation
- Hands-on exercise: Find the data behind a publication of your interest using Europe PMC and answer the questions:
- Could you find the data citation in the publication?
- Is the data linked to a data repository?
- Could you access the data?
- Could you understand what the data is about? Is there documentation, for e.g., metadata or/and a README file describing the data?
- Could you easily find the license for the data of interest?
- Is the data format interoperable?
- How do you believe the use of FAIR principles contributed for your data discovery?
Lesson content
The Data Hunting Exercise
- Set up the challenge: Choose a topic the participants could explore while looking for data on a particular repository examples can include:
- Environmental Science: Ocean acidification rates in the North Sea
- Health: Genomic Sequencing and antibiotic-resistant bacteria
- Education: Statistics on numbers of International Students in Medical Schools in Europe
- Looking for Places to Search : Provide the participants with different locations to look for this data:
- Generic Repositories: Zenodo, Figshare, DataverseNL, Dryad, Dataverse, DANS Data Station
- Domain Specific: Pangea (Earth Science), NCBI (BIO), GBIF (Biodiversity)
- Search Engines: Google Dataset Search, DataCite Commons, OpenAIRE Explore
- The Scavenger Hunt Checklist: Participants must find a data set in at least two of the categories provided and fill out this evidence check list: ****
- Persistent identifier: (PID) Can you find a DOI?
- Meta-data Richness: On a scale of 1-5, how well is the data described? (Are there column definitions, read-me files, methods?
- Interoperability: What file formats are used? (Proprietary like .xlsx. or open like .cvs)
- The Debriefing: “Comparison Gallery”
- Which repository felt most trustworthy and why?
- Did you find the same dataset in two different places (this introduces the concept of data harvesting and mirroring
- Which metadata record made you feel like you actually re-use the data right away?
Annotations: Have participants made annotations on a common document they would take home. This could also work as feedback and improvement mechanisms for this particular activity.
The “In Silico” Shortcut Exercise
Scenario: A new variant of a rare respiratory virus has emerged. Your team needs to find an existing drug that can be “repurposed” to treat it.
1. The Challenge: Wet Lab vs. Data Mining
Divide participants into two groups (or have them compare the two strategies): The strategies were led by two different groups of researchers.
- Strategy A: The Traditional Bench Scientist (Primary Data)
- Task: Synthesize 1,000 new chemical compounds and test them against live virus cultures.
- Cost: $2 Million + 3 years of clinical trials.
- Risk: High. Most compounds will be toxic or ineffective in humans.
- Data Produced: Very specific, high-resolution data for a small set of molecules.
- Strategy B: The Bioinformatician (Data Discovery & Reuse)
- Task: Use data discovery to mine the Protein Data Bank (PDB) for the virus’s structure and ChEMBL for existing FDA-approved drugs.
- Cost: $100k (mostly computing power and researcher time).
- Timeline: 3 months.
- Action: Run a “virtual screening” (docking) to see which already-approved drugs might “stick” to the virus protein.
- Data Reused: Structural biology data from a lab in Japan, chemical properties from a database in the UK, and clinical safety data from the 1990s.
2. The “Discovery” Checklist (Life Science Specific)
Have students identify which specific life-science repositories they would need to “discover” data from to succeed in Strategy B:
- Genomics: Where is the virus’s RNA sequence? (e.g., NCBI GenBank).
- Proteomics: Where is the 3D shape of the virus’s “spike” protein? (e.g., UniProt or PDB).
- Pharmacology: Where are the records of drugs already safe for humans? (e.g., DrugBank or PubChem).
3. Discussion: Why Reuse is Vital in Life Sciences
After the comparison, lead a discussion on these “Bio-Specific” benefits:
- The 3Rs (Ethics): How does reusing data reduce the need for animal testing? (If the data already exists, it is ethically questionable to repeat a painful animal experiment).
- The “N of 1” Problem: In rare disease research, there might only be 10 patients in the whole world. No single hospital can do a study. Discovery and Aggregation of data from 10 different countries is the only way to get a statistically significant result.
- Long-tail Data: A lot of life science data is hidden in “Supplemental Materials” of old papers. How does semantic annotation (Goal 2) help us find a gene mention hidden in a PDF from 2005?
Take away: Have participants note down their discussion points. These points might be valuable insight to reproduce the exercise.
Repository Speed-Dating
Objective: Match a “Data Profile” to the correct “Repository Type” based on the discovery and reuse principles learned earlier.
1.The Setup (5 minutes)
Give each participant (or small group) three “Data Profile Cards.” You can display these on a screen or print them:
- Profile A: A small spreadsheet of water temperatures from a local lake, collected over 2 weeks. Needs to be cited in a paper.
- Profile B: 500GB of high-resolution 3D protein structures of a new virus variant.
- Profile C: Sensitive patient records from a rare disease study across three hospitals (requires restricted access).
2. The “Speed Match” (5 Minutes)
Participants must “match” their profiles to the most appropriate repository from the lesson (e.g., Zenodo, PDB, DANS, or NCBI) and defend their choice based on the FAIR principles.
The Twist: For each match, they must identify one “Dealbreaker.” (e.g., “I can’t put Profile C on Zenodo because it’s open-access and the data is sensitive/un-anonymized.”)
3. The “Legacy” Handshake (5 Minutes)
To finish, each participant writes one “Note to the Future” on a post-it or a shared digital doc (like a Jamboard or Padlet).
Example: “I am depositing [Type of Data]. To make sure someone can find and reuse this in 10 years, the most important metadata tag I will include is ____________ because ____________.”
Additional resources
- The FAIR Guiding Principles for scientific data management and stewardship arrow_outward
- Lost or Found? Discovering Data Needed for Research arrow_outward
- GOFAIR Discovery Implementation Network arrow_outward
- What is Data Mining arrow_outward
- Discover - Data Management Expert Guide arrow_outward
- Citing your data - Data Management Expert Guide arrow_outward
- Data reuse and the open data citation advantage arrow_outward