All in One View

Content from Introduction to Data Curation

Last updated on 2026-04-12 | Edit this page

Overview

Questions

What is data curation?
Why is data curation important in data science?
What are the stages of the data curation lifecycle?
How does good curation improve research quality and reproducibility?
What are common challenges in managing data?

Objectives

Define data curation and its purpose
Understand the lifecycle of curated data
Recognize best practices for organizing and managing datasets
Identify common metadata and documentation standards
Appreciate the role of data curation in reproducible science

What is Data Curation?

Data curation is the process of organizing, documenting, preserving, and maintaining data so that it remains useful, understandable, and reusable over time.

It includes:

Cleaning and validating data
Organizing files and formats
Creating metadata
Preserving datasets for long-term access

Callout

Key Idea

Data curation is not just storing files — it is making data usable for future analysis.

Why Data Curation Matters

Poorly curated data can lead to:

Lost files
Confusing variable names
Missing context
Irreproducible research

Well-curated data helps:

Ensure reproducibility
Enable collaboration
Improve data quality
Support long-term preservation

Example:

A dataset named final_data_v2_revised_REAL_final.csv gives little confidence or clarity.

A curated alternative: soil_moisture_2025_stationA_clean.csv

The Data Curation Lifecycle

Data curation happens throughout the life of a dataset.

Typical Lifecycle Stages:

Create / Collect
Organize
Document
Store / Backup
Preserve
Share / Publish
Reuse / Reanalyze

Callout

Important

Curation begins when data is created — not after the project ends.

Core Principles of Good Data Curation

1. Organization

Use clear folder structures.

Example:

In your project folder:

project/-

You can file sub-folders with the following names:

data_raw/
data_clean/
scripts/
outputs/
documentation/

You can for example, write scripts in a way that save the outputs in output/. This helps maintain the continuity of your research. One needs to keep in mind that they should be able to redo the process of their application of research with ease.

2. Naming Conventions

Good file names should be: - Descriptive - Consistent - Machine-readable

Example: river_discharge_monthly_2024.csv

Avoid: data_new_latest2.csv

3. Documentation

Every dataset should include documentation:

README file
Variable descriptions
Units of measurement
Data source notes

Example README includes:

Project title
Author
Date created
File descriptions

4. Metadata

Metadata = “data about data”

Examples:

Who created the dataset?
When was it collected?
What instruments were used?
What do columns mean?

Discussion

Challenge

Why is metadata essential if someone else uses your dataset five years later?

Data Cleaning vs Data Curation

These are related but different:

Data Cleaning:

Fixes errors in data i.e.,

Missing values
Typos
Duplicates

Data Curation:

Maintains long-term usability.

Documentation
Preservation
Versioning

Both are necessary.

File Formats Matter

Choose formats that are:

Open
Reusable
Non-proprietary

Preferred:

CSV instead of XLSX
TXT instead of DOCX for plain text
GeoJSON instead of closed GIS formats when possible

Version Control in Data Curation

Track changes to files over time.

Methods:

Version numbering (v1, v2)
Git / GitHub
Changelogs

Example: survey_cleaned_v3.csv

Callout

Tip

Never overwrite original raw data.

Keep raw data unchanged.

Backup and Preservation

Use the 3-2-1 Rule:

3 copies of data
2 different storage types
1 offsite backup

Example:

Local computer
External drive
Cloud storage

FAIR Principles

Good curated data should be:

F — Findable

Easy to locate

A — Accessible

Available to authorized users

I — Interoperable

Compatible with other systems

R — Reusable

Well-documented and understandable

Callout

FAIR Data = Better Science

The FAIR framework is widely used in research data management.

Common Challenges in Data Curation

Inconsistent naming
Missing metadata
Lost context over time
Proprietary formats
Lack of backup

Real-World Example

Imagine sharing a climate dataset without:

Units
Dates
Sensor details

Even accurate data becomes nearly useless without context.

Hands-On Exercise

Task:

Create a curated folder structure for a sample project.

Include:

Raw data folder
Clean data folder
README file
Metadata sheet

Tip: You can check our GitHub page on how we handled our data.

Accessibility and Ethics in Data Curation

Remember:

Protect sensitive data
Remove personal identifiers
Follow privacy guidelines
Respect licensing restrictions

Final Takeaways

Good data curation:

Saves time later
Prevents mistakes
Improves collaboration
Makes research reproducible

Discussion

Have you ever struggled with poorly organized data?
What curation practice would improve your current workflow most?

Content from Additional Topics

Last updated on 2026-04-15 | Edit this page

Overview

Questions

What happens after data is created?
How do institutions manage research data repositories?
What is PURR and how does it support data curation?
How is large-scale data transferred and preserved globally?
What challenges arise when curating very large datasets?

Objectives

Explore advanced topics in data curation
Understand institutional repositories such as PURR
Learn how large datasets are stored and transferred
Recognize challenges in global-scale research data management
Connect local data practices to international infrastructure

Beyond Basic Data Curation

Data curation does not stop at organizing files on your computer.

As projects grow larger, data must often be:

Shared across institutions
Archived in repositories
Transferred across countries
Managed in cloud environments

Modern data science depends on scalable curation systems.

1. Data Generation: Where Data Begins

Before curation begins, data must first be generated.

Common sources of data:

Scientific instruments (sensors, satellites, microscopes)
Surveys and questionnaires
Field observations
Simulations and computational models
Web APIs and streaming platforms

Example:

A weather station may generate:

Temperature every minute
Humidity every hour
Rainfall every day

Callout

Important

The way data is generated affects how it should be curated later.

2. Institutional Repositories: Example of PURR

What is PURR?

PURR (Purdue University Research Repository) is Purdue University’s platform for:

Publishing datasets
Preserving research outputs
Sharing reproducible workflows

You can find the website here.

PURR helps researchers:

Store curated datasets securely
Assign DOIs (Digital Object Identifiers)
Share data publicly or privately
Meet grant and publication requirements

Why repositories matter:

Repositories protect data from loss and make it reusable beyond the original project.

3. Data Publishing and DOI Assignment

When curated data is deposited into repositories like PURR:

It receives permanent identifiers
Others can cite it in publications
It becomes discoverable worldwide

Example citation: > Smith et al. (2025). Soil Moisture Data for Indiana Watersheds.

This turns datasets into scholarly products.

4. Large-Scale Cloud Data Transfer

Some datasets are too large for email, USB drives, or local sharing.

Examples:

Satellite imagery archives
Climate model simulations
Genomics databases
High-resolution remote sensing data

These often require:

Cloud platforms
Distributed storage systems
High-speed transfer protocols

5. Global Data Transfer Systems

Large-scale research often uses tools such as:

Globus

Find it here.

Globus is widely used for:

Secure high-volume data transfer
Moving terabytes between institutions
Automating research workflows

Example: A researcher in Indiana transfers 5 TB of satellite data to collaborators in Europe.

6. Cloud Storage and Distributed Infrastructure

Modern curated datasets may live in:

Amazon S3
Google Cloud Storage
Microsoft Azure
Institutional HPC clusters

These systems support:

Redundancy
Backup replication
Global accessibility

7. Challenges of Big Data Curation

Large-scale datasets introduce new problems:

Storage Costs

Huge datasets require expensive infrastructure.

Transfer Speed

Slow internet limits movement of terabytes.

Metadata Complexity

Larger systems require richer documentation.

Preservation Risk

Formats may become obsolete over decades.

Discussion

Challenge

Why might a 100 TB climate archive require different curation strategies than a 10 MB CSV file?

8. Data Lifecycle at Global Scale

For large collaborative projects:

Generate → Process → Curate → Store → Transfer → Archive → Reuse

Unlike small projects, this cycle may involve:

Multiple countries
Multiple institutions
Automated pipelines

9. Reproducibility in Shared Infrastructure

When sharing globally:

File formats must be standardized
Metadata must be machine-readable
Access permissions must be managed carefully

Example standards:

NetCDF for climate data
HDF5 for scientific arrays
JSON metadata schemas

10. Future of Data Curation

Emerging trends include:

AI-assisted metadata generation
Automated cloud archiving
FAIR-compliant repositories
Real-time streaming curation pipelines

Final Takeaways

Data curation today extends far beyond local folders:

Institutions use repositories like PURR
Large datasets require cloud infrastructure
Global collaboration depends on scalable transfer systems

Discussion

What kinds of projects require cloud-scale curation?
How might your own research data eventually outgrow local storage?