DataPlug
Cloud-Aware Data Partitioning

Revolutionizing data partitioning for unstructured scientific data in cloud environments.
Zero-cost re-partitioning with parallel access and elastic workloads.

Cloud-Native

Elastic Scaling

High Performance

Core Capabilities

Advanced data partitioning technology designed for modern cloud infrastructure

Cloud-Native

Specifically targets cold raw data in object storage (e.g. Amazon S3). Exploits S3 byte-range reads for parallel high-bandwidth access.

Read-Only Processing

Pre-processes data in read-only fashion. Indexes and metadata stored decoupled from raw objects, keeping cold data as-is.

Zero-Cost Partitioning

Lazy-evaluated partitions with zero-cost re-partitioning. Serializable for distributed computing environments like PySpark, Dask, or Ray.

Meet Data Cockpit

Official

An interactive IPython widget built on top of the DataPlug framework. Upload, browse, benchmark, and partition your scientific data with a beautiful interface.

Interactive Jupyter Widget

Built on top of DataPlug's cloud-aware partitioning, Data Cockpit provides an end-to-end Jupyter UI for seamless data processing.

pip install cloud_data_cockpit

What Data Cockpit Adds

Upload & Browse

Upload local files directly into any S3 bucket and browse existing datasets from AWS Open Data Registry

Explore Collections

Explore curated public and Metaspace collections for scientific data discovery

Performance Benchmarking

Run benchmarks across configurable batch sizes to discover optimal throughput

One-Click Partitioning

Partition a variety of scientific data types into chunks or batches with one click

Jupyter Integration

Integrate seamlessly into Jupyter notebooks for elastic, parallel workloads

PyRun Cloud Platform

Platform

Effortless Cloud Computing for Python. Experience true Serverless Python. Run scalable workloads for data processing, AI, and distributed computing without managing complex cloud infrastructure. Data Cockpit automatically obtains credentials for DataPlug to access your data, so you can focus purely on processing data without any configuration overhead.

Serverless Python Execution

Focus on your code, not the setup. PyRun provides an integrated environment with automated scaling and powerful framework support.

Why Choose PyRun?

Effortless Execution

Write standard Python and run it seamlessly in the cloud. PyRun automatically handles server management, scaling, and resource optimization.

Integrated & Automated

VS Code-like web interface with automatic credential management. Data Cockpit handles all AWS/S3 configurations, so you only focus on data processing.

Scalable & Versatile

Built-in, first-class support for powerful frameworks like Lithops (FaaS) and Dask. Scale from simple scripts to massively parallel computations.

Real-Time Monitoring

Gain instant insights into job performance with detailed metrics for CPU, memory, disk, network usage, and task execution timelines.

Seamless Integration with DataPlug & Data Cockpit

DataPlug Integration
Seamless integration with DataPlug for efficient data partitioning and processing
Data Cockpit Interface
Built-in Data Cockpit widget with automatic credential management for seamless data access
Cloud-Native Execution
Execute DataPlug workflows directly in the cloud with automatic scaling
Real-Time Monitoring
Monitor DataPlug and Data Cockpit operations with detailed performance metrics

Complete Workflow

Write Python code with DataPlug and Data Cockpit
Deploy to PyRun cloud platform
Execute with automatic scaling
Monitor performance in real-time
Scale from simple scripts to massive computations

Your Workflow with Data Cockpit

Upload
Upload your local files directly into any S3 bucket
Browse
Browse existing buckets or public datasets from AWS Open Data Registry
Benchmark
Run benchmarks across configurable batch sizes to find optimal throughput
Process & Partition
Process & partition your data with one click, displaying progress entirely in-notebook
Retrieve Slices
Retrieve partitions via get_data_slices() for downstream processing

Why Data Cockpit?

Built on DataPlug's Cloud-Aware Partitioning
Pre-processes data in read-only fashion
Exploits S3 byte-range reads for parallel access
Supports multiple scientific domains
Allows re-partitioning with different strategies
Zero-cost data movement

Supported Domains

Genomics

DNA/RNA sequencing data processing

FASTA

FASTQ

VCF

Geospatial

Spatial data and point clouds

LiDAR

Cloud-Optimized Point Cloud

COG

Metabolomics

Imaging mass spectrometry data

ImzML

Generic

Standard data formats

CSV

Raw Text

Astronomics

Astronomical measurement data

MeasurementSet

Format Examples

Explore real examples for each supported format. Each example includes working code and sample data.

Genomics

3 formats available

Genomics

FASTA

DNA/RNA sequences

View Examples

Genomics

FASTQ

Sequencing reads with quality scores

View Examples

Genomics

VCF

Variant Call Format

View Examples

Geospatial

3 formats available

Geospatial

LiDAR

Point cloud data

View Examples

Geospatial

Cloud-Optimized Point Cloud

Optimized point cloud formats

View Examples

Geospatial

COG

Cloud Optimized GeoTIFF

View Examples

Metabolomics

1 format available

Metabolomics

ImzML

Imaging mass spectrometry

View Examples

Generic

2 formats available

Generic

CSV

Comma-separated values

View Examples

Generic

Raw Text

Plain text files

View Examples

Astronomics

1 format available

Astronomics

MeasurementSet

Astronomical measurement data

View Examples

How It Works

Pre-processing

Build lightweight indexes decoupled from raw objects

Data Slicing

Create lazy-evaluated partitions with metadata

Parallel Access

Multiple workers perform HTTP GET Byte-range requests

Evaluation

Data accessed only when needed, not before

Compatible Frameworks

🔥

PySpark

⚡

Dask

🚀

Ray

🐍

Any Python

Data Formats

10+

Zero Cost

100%

Parallel Access

∞

Ready to Get Started?

Join the community of scientists and engineers using DataPlug, Data Cockpit, and PyRun for efficient data partitioning.

Complete Ecosystem: DataPlug + Data Cockpit + PyRun

DataPlug provides the core engine for efficient data partitioning, Data Cockpit offers the user-friendly interface, and PyRun delivers the cloud platform for seamless execution.

DataPlugCloud-Aware Data Partitioning

Core Capabilities

Cloud-Native

Read-Only Processing

Zero-Cost Partitioning

Meet Data Cockpit

Interactive Jupyter Widget

What Data Cockpit Adds

Upload & Browse

Explore Collections

Performance Benchmarking

One-Click Partitioning

Jupyter Integration

PyRun Cloud Platform

Serverless Python Execution

Why Choose PyRun?

Effortless Execution

Integrated & Automated

Scalable & Versatile

Real-Time Monitoring

Seamless Integration with DataPlug & Data Cockpit

DataPlug Integration

Data Cockpit Interface

Cloud-Native Execution

Real-Time Monitoring

Your Workflow with Data Cockpit

Upload

Browse

Benchmark

Process & Partition

Retrieve Slices

Supported Domains

Genomics

Geospatial

Metabolomics

Generic

Astronomics

Format Examples

Genomics

FASTA

FASTQ

VCF

Geospatial

LiDAR

Cloud-Optimized Point Cloud

COG

Metabolomics

ImzML

Generic

CSV

Raw Text

Astronomics

MeasurementSet

How It Works

Ready to Get Started?

DataPlug
Cloud-Aware Data Partitioning