Additional Services
This section explains how end users can extend the default provider-datalab capabilities by deploying additional services and tools that support their daily workflows.
A Datalab environment provides a preconfigured VS Code Server with a persistent file system and access to the connected object storage, along with essential CLI tools such as git, curl, aws, or rclone.
While this already covers many data exploration and transformation needs, users often require more specialized tooling — for example, dashboards for visualization, services for experiment tracking, or out-of-process compute backends for scalable data processing.
Although many of these tools can be started directly from the integrated terminal and exposed via VS Code’s port forwarding feature, that approach tends to be fragile and transient - you must carefully manage Python environments, avoid breaking dependencies during upgrades, and remember that the terminal session lifetime is temporary.
A more robust approach is to deploy such services as native Kubernetes applications — directly from within the Datalab. Because each Datalab session has access to the Kubernetes API (depending on the operator configuration), users can deploy workloads within their assigned namespace or, when running in vCluster mode, inside a fully isolated virtual cluster with their own CRDs, RBAC rules, and controllers. This enables running even complex frameworks that typically require cluster-wide resources — for example, a Dask Gateway.
Note: The
kubectlandhelmCLIs are preinstalled as well. You can apply manifests, install Helm charts, and inspect Kubernetes resources directly from the terminal.
Example: Deploying a Dask Cluster
The following example shows how to start a simple Dask scheduler and worker deployment directly inside your Datalab namespace.
This provides a minimal distributed compute backend that you can connect to from Python via dask.distributed.Client.
Click to expand: Deploy Dask
kubectl apply -f - <<'EOF'
---
apiVersion: v1
kind: Service
metadata:
name: dask-scheduler
spec:
selector:
app: dask-scheduler
ports:
- name: tcp-scheduler
port: 8786
targetPort: 8786
- name: http-dashboard
port: 8787
targetPort: 8787
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dask-scheduler
spec:
replicas: 1
selector:
matchLabels:
app: dask-scheduler
template:
metadata:
labels:
app: dask-scheduler
spec:
containers:
- name: scheduler
image: daskdev/dask:2025.4.0
args: ["dask-scheduler", "--dashboard-address", ":8787"]
ports:
- containerPort: 8786
- containerPort: 8787
resources:
requests: {cpu: "500m", memory: "1Gi"}
limits: {cpu: "1", memory: "2Gi"}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dask-worker
spec:
replicas: 1
selector:
matchLabels:
app: dask-worker
template:
metadata:
labels:
app: dask-worker
spec:
containers:
- name: worker
image: daskdev/dask:2025.4.0
args: ["dask-worker", "tcp://dask-scheduler:8786", "--nthreads", "2", "--memory-limit", "2GB"]
resources:
requests: {cpu: "500m", memory: "1Gi"}
limits: {cpu: "1", memory: "2Gi"}
EOF
Once running, you can port-forward and use the VS Code Ports tab to explore the Dask dashboard:
kubectl port-forward svc/dask-scheduler 8787:8787
You can also deploy Dask Gateway via Helm — this is only possible in vCluster mode, since it requires cluster-wide resources such as CRDs and RBAC cluster roles:
helm repo update
helm upgrade --install dask-gateway dask/dask-gateway -n "${DEFAULT_NAMESPACE:-default}" --create-namespace --set gateway.auth.type=simple --set gateway.auth.simple.password='' --set traefik.service.type=ClusterIP --set gateway.backend.image.name=ghcr.io/dask/dask-gateway --set gateway.backend.image.tag=2025.4.0 --wait --atomic
Example: Deploying MLflow with Persistent Storage
MLflow is a popular experiment-tracking platform that complements data exploration workflows.
The following example deploys an MLflow server together with a simple SQLite backend and a PersistentVolumeClaim for artifact and metadata storage.
Note: The PVC is bound to your Datalab session.
Once the Datalab is deleted, the PVC and stored data will also be removed unless your operator configures a persistent storage backend.
Click to expand: Deploy MLflow
export BUCKET=ws-frank # replace accordingly
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: minio-creds
type: Opaque
stringData:
accessKey: "${AWS_ACCESS_KEY_ID}"
secretKey: "${AWS_SECRET_ACCESS_KEY}"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: minio-config
data:
endpoint: "${AWS_ENDPOINT_URL}"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlflow
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: mlflow
spec:
selector:
app: mlflow
ports:
- name: http
port: 5000
targetPort: 5000
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow
spec:
replicas: 1
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:latest
command: ["/bin/sh","-lc"]
args:
- |
python -m pip install --no-cache-dir --upgrade pip &&
pip install --no-cache-dir boto3 &&
exec mlflow server \
--backend-store-uri sqlite:////mlflow/mlflow.db \
--serve-artifacts \
--artifacts-destination s3://"${BUCKET}"/mlruns \
--host 0.0.0.0 --port 5000 \
--workers 2 \
--allowed-hosts '*' \
--cors-allowed-origins '*'
ports:
- containerPort: 5000
resources:
requests: { cpu: "100m", memory: "512Mi" }
limits: { cpu: "300m", memory: "2Gi" }
env:
- name: MLFLOW_S3_ENDPOINT_URL
valueFrom:
configMapKeyRef: { name: minio-config, key: endpoint }
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef: { name: minio-creds, key: accessKey }
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef: { name: minio-creds, key: secretKey }
- name: AWS_S3_FORCE_PATH_STYLE
value: "true"
- name: AWS_EC2_METADATA_DISABLED
value: "true"
volumeMounts:
- name: data
mountPath: /mlflow
readinessProbe:
httpGet: { path: "/", port: 5000 }
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet: { path: "/", port: 5000 }
initialDelaySeconds: 20
periodSeconds: 20
volumes:
- name: data
persistentVolumeClaim:
claimName: mlflow
EOF
Once running, you can port-forward and use the VS Code Ports tab to explore the MLflow UI:
kubectl port-forward svc/mlflow 5000:5000
To use MLflow in your code, you need to connect to the tracking server running at http://localhost:5000. This can be done by setting the following environment variable:
export MLFLOW_TRACKING_URI="http://127.0.0.1:5000"
Summary
In its current form, provider-datalab focuses on deploying ephemeral or stateless services on Kubernetes in a seamless and reproducible way. These services are tied to the Datalab session lifecycle, ensuring automatic cleanup and cost efficiency when sessions are terminated.
However, if your operator provides additional storage capabilities — for example:
- persistent block storage (via Kubernetes
StorageClass) - relational databases (PostgreSQL, MySQL)
- key–value stores (Redis, etcd)
then more complex, stateful workloads can also be supported. Such setups, however, come with additional maintenance effort and require clear alignment of responsibilities between operators and end users. Making these integrations easier and more declarative is planned for future releases.