Skip to content Skip to footer

Data/Repository discovery

Topic, definition and scope

  • “Everyone has the right to share in scientific advancement and its benefits”
    Article 27, Universal Declaration of Human Rights
  • Data discovery is a process of understanding data and extracting valuable insight from multiple data streams according to data uses and purposes.

Image: https://phaidra.univie.ac.at/download/o:1201054


FAIR element(s)

  • Findable: Data should be available in a discoverable resource (i.e. repository), have appropriate description (i.e. metadata) and have a persistent identifier (PID)
  • Accessible: Data should be retrievable and understandable for both humans and machines
  • Interoperable: Machines and humans can interpret and use the data in different settings and will be able to distinguish the metadata from the data file
  • Reusable: The ultimate goal of FAIR is to advance the reuse of data in the future research and allow integration with other compatible data sources.

Summary of Tasks / Actions

  • Discussing reproducibility: why FAIR principles are important for data discovery?
  • How do you search for data? See also the FAIRsharing educational factsheet for databases

Research data cycle

  • Present a researcher’s story in any life science field and set up a search strategy. The story can be something like:

“A Bio-Chemistry researcher needs some enzymology data for a research question: how enzymes are key factors to increase the rate of metabolism in the human body?”

  • How did the researcher discover and access such data?
  • Did the researcher list the characteristics of the data you want to discover
  • Evaluate the quality of data
  • Check the terms and conditions of access and use

  • Let’s take the scenario above and look for any type of data you are interested about (e.g.‘mitochondrial beta-oxidation”) in different data sources:
  • Of these resources, * Which one provided the most relevant data for your search terms? Which one provides facilities to refine your search ( i.e. filters)? * Try to search for more detailed search terms. How did the search results improve? * Is there a citation clarification for your selected data?Are there any differences in citation clarification between these data sources? * Can you find a licence for selected data? Is there any clarification how the data can be reused?

  • How can data resources make data more discoverable by linking data to publications?
  • Identifying innovative search tools for data discovery: demo on how to find the data behind a publication using Europe PMC, a literature database.
  • Citation, licences and copyrights help to clarify the “R” in the FAIR principles.
    • How to understand database conditions and attributes when choosing a repository (FAIRsharing documentation)
    • How to licence data (openaire.eu)
    • [How to Cite Datasets and Link to Publications DCC](https://www.dcc.ac.uk/guidance/how-guides/cite-datasets)

Materials / Equipment


Take home tasks/preparation

  • Hands-on exercise: Find the data behind a publication of your interest using Europe PMC and answer the questions:
    • Could you find the data citation on the publication?
    • Is the data linked to the data repository?
    • Could you access the data? Is the data format machine-readable?
    • Could you easily find the licensing for the data of interest?
    • How do you believe the use of FAIR principles contributed for your data discovery?

Lesson content

LO
Activity
Time
Type
Level
Before the lesson
5

The Data Hunting Exercise

  1. Set up the challenge: Choose a topic the participants could explore while looking for data on a particular repository examples can include:
  • Environmental Science: Ocean acidification rates in the North Sea
  • Health: Genomic Sequencing and antibiotic-resistant bacteria
  • Education: Statistics on numbers of International Students in Medical Schools in Europe
  1. Looking for Places to Search : Provide the participants with different locations to look for this data:
  • Generic Repositories: Zenodo, Figshare, DataverseNL, Dryad, Dataverse, DANS Data Station
  • Domain Specific: Pangea (Earth Science) NCBI (BIO)
  • Search Engines: Google Dataset Search, DataCite Commons
  1. The Scavenger Hunt Checklist: Participants must find a data set in at least two of the categories provided and fill out this evidence check list: ****
  • Persistent identifier: (PID) Can you find a DOI?
  • Meta-data Richness: On a scale of 1-5, how well is the data described? (Are there column definitions, read-me files, methods?
  • Interoperability: What file formats are used? (Proprietary like .xlsx. or open like .cvs)
  1. The Debriefing: “Comparison Gallery”
  • Which repository felt most trustworthy and why?
  • Did you find the same dataset in two different places (this introduces the concept of data harvesting and mirroring
  • Which metadata record made you feel like you actually re-use the data right away?

Annotations: Have participants made annotations on a common document they would take home. This could also work as feedback and improvement mechanisms for this particular activity.

30
Group Exercise
During the lesson
1

The “In Silico” Shortcut Exercise

Scenario: A new variant of a rare respiratory virus has emerged. Your team needs to find an existing drug that can be “repurposed” to treat it.

1. The Challenge: Wet Lab vs. Data Mining

Divide participants into two groups (or have them compare the two strategies): The strategies were led by two different groups of researchers.

  • Strategy A: The Traditional Bench Scientist (Primary Data)
    • Task: Synthesize 1,000 new chemical compounds and test them against live virus cultures.
    • Cost: $2 Million + 3 years of clinical trials.
    • Risk: High. Most compounds will be toxic or ineffective in humans.
    • Data Produced: Very specific, high-resolution data for a small set of molecules.
  • Strategy B: The Bioinformatician (Data Discovery & Reuse)
    • Task: Use data discovery to mine the Protein Data Bank (PDB) for the virus’s structure and ChEMBL for existing FDA-approved drugs.
    • Cost: $100k (mostly computing power and researcher time).
    • Timeline: 3 months.
    • Action: Run a “virtual screening” (docking) to see which already-approved drugs might “stick” to the virus protein.
    • Data Reused: Structural biology data from a lab in Japan, chemical properties from a database in the UK, and clinical safety data from the 1990s.

2. The “Discovery” Checklist (Life Science Specific)

Have students identify which specific life-science repositories they would need to “discover” data from to succeed in Strategy B:

  1. Genomics: Where is the virus’s RNA sequence? (e.g., NCBI GenBank).
  2. Proteomics: Where is the 3D shape of the virus’s “spike” protein? (e.g., UniProt or PDB).
  3. Pharmacology: Where are the records of drugs already safe for humans? (e.g., DrugBank or PubChem).

3. Discussion: Why Reuse is Vital in Life Sciences

After the comparison, lead a discussion on these “Bio-Specific” benefits:

  • The 3Rs (Ethics): How does reusing data reduce the need for animal testing? (If the data already exists, it is ethically questionable to repeat a painful animal experiment).
  • The “N of 1” Problem: In rare disease research, there might only be 10 patients in the whole world. No single hospital can do a study. Discovery and Aggregation of data from 10 different countries is the only way to get a statistically significant result.
  • Long-tail Data: A lot of life science data is hidden in “Supplemental Materials” of old papers. How does semantic annotation (Goal 2) help us find a gene mention hidden in a PDF from 2005?

Take away: Have participants note down their discussion points. These points might be valuable insight to reproduce the exercise.

30 minutes
Working session