DataPlug Logo

DataPlug
Cloud-Aware Data Partitioning

Revolutionizing data partitioning for unstructured scientific data in cloud environments.
Zero-cost re-partitioning with parallel access and elastic workloads.
Cloud-Native
Elastic Scaling
High Performance

Core Capabilities

Advanced data partitioning technology designed for modern cloud infrastructure

Cloud-Native

Specifically targets cold raw data in object storage (e.g. Amazon S3). Exploits S3 byte-range reads for parallel high-bandwidth access.

Read-Only Processing

Pre-processes data in read-only fashion. Indexes and metadata stored decoupled from raw objects, keeping cold data as-is.

Zero-Cost Partitioning

Lazy-evaluated partitions with zero-cost re-partitioning. Serializable for distributed computing environments like PySpark, Dask, or Ray.

Meet Data Cockpit

Official
An interactive IPython widget built on top of the DataPlug framework. Upload, browse, benchmark, and partition your scientific data with a beautiful interface.

Interactive Jupyter Widget

Built on top of DataPlug's cloud-aware partitioning, Data Cockpit provides an end-to-end Jupyter UI for seamless data processing.
pip install cloud_data_cockpit

What Data Cockpit Adds

Upload & Browse
Upload local files directly into any S3 bucket and browse existing datasets from AWS Open Data Registry
Explore Collections
Explore curated public and Metaspace collections for scientific data discovery
Performance Benchmarking
Run benchmarks across configurable batch sizes to discover optimal throughput
One-Click Partitioning
Partition a variety of scientific data types into chunks or batches with one click
Jupyter Integration
Integrate seamlessly into Jupyter notebooks for elastic, parallel workloads

PyRun Cloud Platform

Platform
Effortless Cloud Computing for Python. Experience true Serverless Python. Run scalable workloads for data processing, AI, and distributed computing without managing complex cloud infrastructure. Data Cockpit automatically obtains credentials for DataPlug to access your data, so you can focus purely on processing data without any configuration overhead.

Serverless Python Execution

Focus on your code, not the setup. PyRun provides an integrated environment with automated scaling and powerful framework support.

Why Choose PyRun?

Effortless Execution
Write standard Python and run it seamlessly in the cloud. PyRun automatically handles server management, scaling, and resource optimization.
Integrated & Automated
VS Code-like web interface with automatic credential management. Data Cockpit handles all AWS/S3 configurations, so you only focus on data processing.
Scalable & Versatile
Built-in, first-class support for powerful frameworks like Lithops (FaaS) and Dask. Scale from simple scripts to massively parallel computations.
Real-Time Monitoring
Gain instant insights into job performance with detailed metrics for CPU, memory, disk, network usage, and task execution timelines.

Seamless Integration with DataPlug & Data Cockpit

  1. DataPlug Integration
    Seamless integration with DataPlug for efficient data partitioning and processing
  2. Data Cockpit Interface
    Built-in Data Cockpit widget with automatic credential management for seamless data access
  3. Cloud-Native Execution
    Execute DataPlug workflows directly in the cloud with automatic scaling
  4. Real-Time Monitoring
    Monitor DataPlug and Data Cockpit operations with detailed performance metrics
Complete Workflow
  • Write Python code with DataPlug and Data Cockpit
  • Deploy to PyRun cloud platform
  • Execute with automatic scaling
  • Monitor performance in real-time
  • Scale from simple scripts to massive computations

Your Workflow with Data Cockpit

  1. Upload
    Upload your local files directly into any S3 bucket
  2. Browse
    Browse existing buckets or public datasets from AWS Open Data Registry
  3. Benchmark
    Run benchmarks across configurable batch sizes to find optimal throughput
  4. Process & Partition
    Process & partition your data with one click, displaying progress entirely in-notebook
  5. Retrieve Slices
    Retrieve partitions via get_data_slices() for downstream processing
Why Data Cockpit?
  • Built on DataPlug's Cloud-Aware Partitioning
  • Pre-processes data in read-only fashion
  • Exploits S3 byte-range reads for parallel access
  • Supports multiple scientific domains
  • Allows re-partitioning with different strategies
  • Zero-cost data movement

Supported Domains

Genomics

DNA/RNA sequencing data processing
FASTA
FASTQ
VCF

Geospatial

Spatial data and point clouds
LiDAR
Cloud-Optimized Point Cloud
COG

Metabolomics

Imaging mass spectrometry data
ImzML

Generic

Standard data formats
CSV
Raw Text

Astronomics

Astronomical measurement data
MeasurementSet

Format Examples

Explore real examples for each supported format. Each example includes working code and sample data.

Genomics

3 formats available
Genomics

FASTA

DNA/RNA sequences
View Examples
Genomics

FASTQ

Sequencing reads with quality scores
View Examples
Genomics

VCF

Variant Call Format
View Examples

Geospatial

3 formats available
Geospatial

LiDAR

Point cloud data
View Examples
Geospatial

Cloud-Optimized Point Cloud

Optimized point cloud formats
View Examples
Geospatial

COG

Cloud Optimized GeoTIFF
View Examples

Metabolomics

1 format available
Metabolomics

ImzML

Imaging mass spectrometry
View Examples

Generic

2 formats available
Generic

CSV

Comma-separated values
View Examples
Generic

Raw Text

Plain text files
View Examples

Astronomics

1 format available
Astronomics

MeasurementSet

Astronomical measurement data
View Examples

How It Works

Pre-processing
Build lightweight indexes decoupled from raw objects
Data Slicing
Create lazy-evaluated partitions with metadata
Parallel Access
Multiple workers perform HTTP GET Byte-range requests
Evaluation
Data accessed only when needed, not before
Compatible Frameworks
🔥
PySpark
Dask
🚀
Ray
🐍
Any Python
Data Formats
10+
Zero Cost
100%
Parallel Access

Ready to Get Started?

Join the community of scientists and engineers using DataPlug, Data Cockpit, and PyRun for efficient data partitioning.