All in One View
Content from Introduction to Data Curation
Last updated on 2026-04-12 | Edit this page
Overview
Questions
- What is data curation?
- Why is data curation important in data science?
- What are the stages of the data curation lifecycle?
- How does good curation improve research quality and reproducibility?
- What are common challenges in managing data?
Objectives
- Define data curation and its purpose
- Understand the lifecycle of curated data
- Recognize best practices for organizing and managing datasets
- Identify common metadata and documentation standards
- Appreciate the role of data curation in reproducible science
What is Data Curation?
Data curation is the process of organizing, documenting, preserving, and maintaining data so that it remains useful, understandable, and reusable over time.
It includes:
- Cleaning and validating data
- Organizing files and formats
- Creating metadata
- Preserving datasets for long-term access
Key Idea
Data curation is not just storing files — it is making data usable for future analysis.
Why Data Curation Matters
Poorly curated data can lead to:
- Lost files
- Confusing variable names
- Missing context
- Irreproducible research
Well-curated data helps:
- Ensure reproducibility
- Enable collaboration
- Improve data quality
- Support long-term preservation
The Data Curation Lifecycle
Data curation happens throughout the life of a dataset.
Core Principles of Good Data Curation
1. Organization
Use clear folder structures.
Example:
In your project folder:
project/-
You can file sub-folders with the following names:
- data_raw/
- data_clean/
- scripts/
- outputs/
- documentation/
You can for example, write scripts in a way that save the outputs in
output/. This helps maintain the continuity of your
research. One needs to keep in mind that they should be able to redo the
process of their application of research with ease.
2. Naming Conventions
Good file names should be: - Descriptive - Consistent - Machine-readable
Example: river_discharge_monthly_2024.csv
Avoid: data_new_latest2.csv
Data Cleaning vs Data Curation
These are related but different:
File Formats Matter
Choose formats that are:
- Open
- Reusable
- Non-proprietary
Preferred:
- CSV instead of XLSX
- TXT instead of DOCX for plain text
- GeoJSON instead of closed GIS formats when possible
Version Control in Data Curation
Track changes to files over time.
Methods:
- Version numbering (
v1,v2) - Git / GitHub
- Changelogs
Example: survey_cleaned_v3.csv
Tip
Never overwrite original raw data.
Keep raw data unchanged.
Backup and Preservation
Use the 3-2-1 Rule:
- 3 copies of data
- 2 different storage types
- 1 offsite backup
Example:
- Local computer
- External drive
- Cloud storage
FAIR Principles
Good curated data should be:
Common Challenges in Data Curation
- Inconsistent naming
- Missing metadata
- Lost context over time
- Proprietary formats
- Lack of backup
Real-World Example
Imagine sharing a climate dataset without:
- Units
- Dates
- Sensor details
Even accurate data becomes nearly useless without context.
Hands-On Exercise
Accessibility and Ethics in Data Curation
Remember:
- Protect sensitive data
- Remove personal identifiers
- Follow privacy guidelines
- Respect licensing restrictions
Final Takeaways
Good data curation:
- Saves time later
- Prevents mistakes
- Improves collaboration
- Makes research reproducible
- Have you ever struggled with poorly organized data?
- What curation practice would improve your current workflow most?
Content from Additional Topics
Last updated on 2026-04-15 | Edit this page
Overview
Questions
- What happens after data is created?
- How do institutions manage research data repositories?
- What is PURR and how does it support data curation?
- How is large-scale data transferred and preserved globally?
- What challenges arise when curating very large datasets?
Objectives
- Explore advanced topics in data curation
- Understand institutional repositories such as PURR
- Learn how large datasets are stored and transferred
- Recognize challenges in global-scale research data management
- Connect local data practices to international infrastructure
Beyond Basic Data Curation
Data curation does not stop at organizing files on your computer.
As projects grow larger, data must often be:
- Shared across institutions
- Archived in repositories
- Transferred across countries
- Managed in cloud environments
Modern data science depends on scalable curation systems.
1. Data Generation: Where Data Begins
Before curation begins, data must first be generated.
2. Institutional Repositories: Example of PURR
What is PURR?
PURR (Purdue University Research Repository) is Purdue University’s platform for:
- Publishing datasets
- Preserving research outputs
- Sharing reproducible workflows
You can find the website here.
PURR helps researchers:
- Store curated datasets securely
- Assign DOIs (Digital Object Identifiers)
- Share data publicly or privately
- Meet grant and publication requirements
3. Data Publishing and DOI Assignment
When curated data is deposited into repositories like PURR:
- It receives permanent identifiers
- Others can cite it in publications
- It becomes discoverable worldwide
Example citation: > Smith et al. (2025). Soil Moisture Data for Indiana Watersheds.
This turns datasets into scholarly products.
4. Large-Scale Cloud Data Transfer
Some datasets are too large for email, USB drives, or local sharing.
Examples:
- Satellite imagery archives
- Climate model simulations
- Genomics databases
- High-resolution remote sensing data
These often require:
- Cloud platforms
- Distributed storage systems
- High-speed transfer protocols
5. Global Data Transfer Systems
Large-scale research often uses tools such as:
Globus
Find it here.
Globus is widely used for:
- Secure high-volume data transfer
- Moving terabytes between institutions
- Automating research workflows
Example: A researcher in Indiana transfers 5 TB of satellite data to collaborators in Europe.
6. Cloud Storage and Distributed Infrastructure
Modern curated datasets may live in:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure
- Institutional HPC clusters
These systems support:
- Redundancy
- Backup replication
- Global accessibility
7. Challenges of Big Data Curation
Large-scale datasets introduce new problems:
8. Data Lifecycle at Global Scale
For large collaborative projects:
Generate → Process → Curate → Store → Transfer → Archive → Reuse
Unlike small projects, this cycle may involve:
- Multiple countries
- Multiple institutions
- Automated pipelines
9. Reproducibility in Shared Infrastructure
When sharing globally:
- File formats must be standardized
- Metadata must be machine-readable
- Access permissions must be managed carefully
Example standards:
- NetCDF for climate data
- HDF5 for scientific arrays
- JSON metadata schemas
10. Future of Data Curation
Emerging trends include:
- AI-assisted metadata generation
- Automated cloud archiving
- FAIR-compliant repositories
- Real-time streaming curation pipelines
Final Takeaways
Data curation today extends far beyond local folders:
- Institutions use repositories like PURR
- Large datasets require cloud infrastructure
- Global collaboration depends on scalable transfer systems
- What kinds of projects require cloud-scale curation?
- How might your own research data eventually outgrow local storage?