Additional Topics
Last updated on 2026-04-15 | Edit this page
Overview
Questions
- What happens after data is created?
- How do institutions manage research data repositories?
- What is PURR and how does it support data curation?
- How is large-scale data transferred and preserved globally?
- What challenges arise when curating very large datasets?
Objectives
- Explore advanced topics in data curation
- Understand institutional repositories such as PURR
- Learn how large datasets are stored and transferred
- Recognize challenges in global-scale research data management
- Connect local data practices to international infrastructure
Beyond Basic Data Curation
Data curation does not stop at organizing files on your computer.
As projects grow larger, data must often be:
- Shared across institutions
- Archived in repositories
- Transferred across countries
- Managed in cloud environments
Modern data science depends on scalable curation systems.
1. Data Generation: Where Data Begins
Before curation begins, data must first be generated.
2. Institutional Repositories: Example of PURR
What is PURR?
PURR (Purdue University Research Repository) is Purdue University’s platform for:
- Publishing datasets
- Preserving research outputs
- Sharing reproducible workflows
You can find the website here.
PURR helps researchers:
- Store curated datasets securely
- Assign DOIs (Digital Object Identifiers)
- Share data publicly or privately
- Meet grant and publication requirements
3. Data Publishing and DOI Assignment
When curated data is deposited into repositories like PURR:
- It receives permanent identifiers
- Others can cite it in publications
- It becomes discoverable worldwide
Example citation: > Smith et al. (2025). Soil Moisture Data for Indiana Watersheds.
This turns datasets into scholarly products.
4. Large-Scale Cloud Data Transfer
Some datasets are too large for email, USB drives, or local sharing.
Examples:
- Satellite imagery archives
- Climate model simulations
- Genomics databases
- High-resolution remote sensing data
These often require:
- Cloud platforms
- Distributed storage systems
- High-speed transfer protocols
5. Global Data Transfer Systems
Large-scale research often uses tools such as:
Globus
Find it here.
Globus is widely used for:
- Secure high-volume data transfer
- Moving terabytes between institutions
- Automating research workflows
Example: A researcher in Indiana transfers 5 TB of satellite data to collaborators in Europe.
6. Cloud Storage and Distributed Infrastructure
Modern curated datasets may live in:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure
- Institutional HPC clusters
These systems support:
- Redundancy
- Backup replication
- Global accessibility
7. Challenges of Big Data Curation
Large-scale datasets introduce new problems:
8. Data Lifecycle at Global Scale
For large collaborative projects:
Generate → Process → Curate → Store → Transfer → Archive → Reuse
Unlike small projects, this cycle may involve:
- Multiple countries
- Multiple institutions
- Automated pipelines
9. Reproducibility in Shared Infrastructure
When sharing globally:
- File formats must be standardized
- Metadata must be machine-readable
- Access permissions must be managed carefully
Example standards:
- NetCDF for climate data
- HDF5 for scientific arrays
- JSON metadata schemas
10. Future of Data Curation
Emerging trends include:
- AI-assisted metadata generation
- Automated cloud archiving
- FAIR-compliant repositories
- Real-time streaming curation pipelines
Final Takeaways
Data curation today extends far beyond local folders:
- Institutions use repositories like PURR
- Large datasets require cloud infrastructure
- Global collaboration depends on scalable transfer systems
- What kinds of projects require cloud-scale curation?
- How might your own research data eventually outgrow local storage?