[Hard Mode] Metadata Generation on datasets with No Manuscript or Code Associated
Last updated
Last updated
Problem Statement:
Academic research produces a wealth of artifacts beyond traditional manuscripts, including code repositories, datasets, and experimental protocols. However, metadata describing these artifacts is often incomplete, inconsistent, or entirely absent. This lack of metadata hinders discoverability, reproducibility, and reuse of valuable research outputs.
Challenge:
Develop a system that can automatically generate standardized and accurate metadata for academic research data. The system should aim to extract relevant information and structure it into a machine-readable format, such as a structured JSON-LD/RDF representation.
Modes of Difficulty:
Hard Mode:
Input: Only the dataset (e.g., CSV, JSON, or other structured format).
Task: Generate metadata for the dataset and attempt to link it to existing datasets with high-quality metadata (e.g., datasets in established repositories).
Focus: Emphasize data profiling, feature extraction, and similarity matching.
Critical Requirement:
Confidence Threshold: Incorrect metadata is significantly more detrimental than missing metadata. The system must prioritize accuracy over completeness. If the system's confidence in the generated metadata is below a defined threshold, it should return minimal or no metadata.
Evaluation Metrics:
Accuracy of extracted metadata fields.
Completeness of the generated metadata record.
Semantic similarity between generated metadata and ground truth (if available).
Ability to link the dataset to existing datasets with high-quality metadata (for Hard Mode).
Confidence score of generated metadata.
Desired Outcomes:
A functional system that can automatically generate metadata for academic research datasets.
A demonstration of the feasibility of zero-shot or few-shot metadata generation.
Insights into the challenges and opportunities of metadata generation in academia.