[Easier Mode] Metadata Generation on Datasets Given the Manuscript and Code Repository
Last updated
Last updated
Problem Statement:
Academic research produces a wealth of artifacts beyond traditional manuscripts, including code repositories, datasets, and experimental protocols. However, metadata describing these artifacts is often incomplete, inconsistent, or entirely absent. This lack of metadata hinders discoverability, reproducibility, and reuse of valuable research outputs.
Challenge:
Develop a system that can automatically generate standardized and accurate metadata for academic research data. The system should aim to extract relevant information and structure it into a machine-readable format, such as a structured JSON-LD/RDF representation.
Modes of Difficulty:
Easy Mode:
Input: A research manuscript (PDF or text), a corresponding code repository (e.g., GitHub URL), and the associated dataset (e.g., CSV, JSON, or other structured format).
Task: Generate detailed metadata for each column, including:
Column Name
Precise Definition (e.g., "Average daily temperature in Celsius")
Units of Measurement (e.g., "Celsius", "meters", "kilograms")
Data Type (e.g., "integer", "float", "string", "date")
Semantic Type (e.g., "temperature", "distance", "mass", "time")
Possible value ranges or categories.
Where the information was found within the manuscript.
Focus: Leverage the information from the manuscript and code to enrich the dataset metadata.
Critical Requirement:
Confidence Threshold: Incorrect metadata is significantly more detrimental than missing metadata. The system must prioritize accuracy over completeness. If the system's confidence in the generated metadata is below a defined threshold, it should return minimal or no metadata.
Evaluation Metrics:
Accuracy of extracted metadata fields.
Completeness of the generated metadata record.
Semantic similarity between generated metadata and ground truth (if available).
Ability to link the dataset to existing datasets with high-quality metadata (for Hard Mode).
Confidence score of generated metadata.
Desired Outcomes:
A functional system that can automatically generate metadata for academic research datasets.
A demonstration of the feasibility of zero-shot or few-shot metadata generation.
Insights into the challenges and opportunities of metadata generation in academia.