Additional Topics

Last updated on 2026-04-15 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • What happens after data is created?
  • How do institutions manage research data repositories?
  • What is PURR and how does it support data curation?
  • How is large-scale data transferred and preserved globally?
  • What challenges arise when curating very large datasets?

Objectives

  • Explore advanced topics in data curation
  • Understand institutional repositories such as PURR
  • Learn how large datasets are stored and transferred
  • Recognize challenges in global-scale research data management
  • Connect local data practices to international infrastructure

Beyond Basic Data Curation


Data curation does not stop at organizing files on your computer.

As projects grow larger, data must often be:

  • Shared across institutions
  • Archived in repositories
  • Transferred across countries
  • Managed in cloud environments

Modern data science depends on scalable curation systems.


1. Data Generation: Where Data Begins


Before curation begins, data must first be generated.

Common sources of data:

  • Scientific instruments (sensors, satellites, microscopes)
  • Surveys and questionnaires
  • Field observations
  • Simulations and computational models
  • Web APIs and streaming platforms

Example:

A weather station may generate:

  • Temperature every minute
  • Humidity every hour
  • Rainfall every day
Callout

Important

The way data is generated affects how it should be curated later.


2. Institutional Repositories: Example of PURR


What is PURR?

PURR (Purdue University Research Repository) is Purdue University’s platform for:

  • Publishing datasets
  • Preserving research outputs
  • Sharing reproducible workflows

You can find the website here.

PURR helps researchers:

  • Store curated datasets securely
  • Assign DOIs (Digital Object Identifiers)
  • Share data publicly or privately
  • Meet grant and publication requirements

Why repositories matter:

Repositories protect data from loss and make it reusable beyond the original project.


3. Data Publishing and DOI Assignment


When curated data is deposited into repositories like PURR:

  • It receives permanent identifiers
  • Others can cite it in publications
  • It becomes discoverable worldwide

Example citation: > Smith et al. (2025). Soil Moisture Data for Indiana Watersheds.

This turns datasets into scholarly products.


4. Large-Scale Cloud Data Transfer


Some datasets are too large for email, USB drives, or local sharing.

Examples:

  • Satellite imagery archives
  • Climate model simulations
  • Genomics databases
  • High-resolution remote sensing data

These often require:

  • Cloud platforms
  • Distributed storage systems
  • High-speed transfer protocols

5. Global Data Transfer Systems


Large-scale research often uses tools such as:

Globus

Find it here.

Globus is widely used for:

  • Secure high-volume data transfer
  • Moving terabytes between institutions
  • Automating research workflows

Example: A researcher in Indiana transfers 5 TB of satellite data to collaborators in Europe.


6. Cloud Storage and Distributed Infrastructure


Modern curated datasets may live in:

  • Amazon S3
  • Google Cloud Storage
  • Microsoft Azure
  • Institutional HPC clusters

These systems support:

  • Redundancy
  • Backup replication
  • Global accessibility

7. Challenges of Big Data Curation


Large-scale datasets introduce new problems:

Storage Costs

Huge datasets require expensive infrastructure.

Transfer Speed

Slow internet limits movement of terabytes.

Metadata Complexity

Larger systems require richer documentation.

Preservation Risk

Formats may become obsolete over decades.

Discussion

Challenge

Why might a 100 TB climate archive require different curation strategies than a 10 MB CSV file?


8. Data Lifecycle at Global Scale


For large collaborative projects:

Generate → Process → Curate → Store → Transfer → Archive → Reuse

Unlike small projects, this cycle may involve:

  • Multiple countries
  • Multiple institutions
  • Automated pipelines

9. Reproducibility in Shared Infrastructure


When sharing globally:

  • File formats must be standardized
  • Metadata must be machine-readable
  • Access permissions must be managed carefully

Example standards:

  • NetCDF for climate data
  • HDF5 for scientific arrays
  • JSON metadata schemas

10. Future of Data Curation


Emerging trends include:

  • AI-assisted metadata generation
  • Automated cloud archiving
  • FAIR-compliant repositories
  • Real-time streaming curation pipelines

Final Takeaways


Data curation today extends far beyond local folders:

  • Institutions use repositories like PURR
  • Large datasets require cloud infrastructure
  • Global collaboration depends on scalable transfer systems
Discussion
  • What kinds of projects require cloud-scale curation?
  • How might your own research data eventually outgrow local storage?