Skip to content

Additional Services

This section explains how a Datalab can be extended with more services and tools. Users can deploy helpful session-adjacent applications for daily work. Operators should keep durable services - databases, stores, registries, and backup-worthy data - explicit in the platform model.

A Datalab environment provides a preconfigured VS Code Server with a persistent file system and access to the connected object storage, along with essential CLI tools such as git, curl, aws, or rclone. While this already covers many data exploration and transformation needs, users often require more specialized tooling — for example, dashboards for visualization, services for experiment tracking, or out-of-process compute backends for scalable data processing.

Although many of these tools can be started directly from the integrated terminal and exposed via VS Code’s port forwarding feature, that approach tends to be fragile and transient - you must carefully manage Python environments, avoid breaking dependencies during upgrades, and remember that the terminal session lifetime is temporary.

A more robust approach is to deploy such services as native Kubernetes applications — directly from within the Datalab. Because each Datalab session has access to the Kubernetes API (depending on the operator configuration), users can deploy workloads within their assigned namespace or, when running in vCluster mode, inside a fully isolated virtual cluster with their own CRDs, RBAC rules, and controllers. This enables running even complex frameworks that typically require cluster-wide resources — for example, a Dask Gateway.

For production-like state, prefer the declarative Datalab service fields where possible: spec.databases, spec.documentStores, spec.cacheStores, spec.vectorStores, and spec.registry. For object-storage buckets, use Provider Storage or another storage process, then pass the credentials to Provider Datalab. This keeps the durable parts visible to the operator.

Note: The kubectl and helm CLIs are preinstalled as well. You can apply manifests, install Helm charts, and inspect Kubernetes resources directly from the terminal.


Example: Deploying a Dask Cluster

The following example shows how to start a simple Dask scheduler and worker deployment directly inside your Datalab namespace. This provides a minimal distributed compute backend that you can connect to from Python via dask.distributed.Client.

Click to expand: Deploy Dask
kubectl apply -f - <<'EOF'
---
apiVersion: v1
kind: Service
metadata:
  name: dask-scheduler
spec:
  selector:
    app: dask-scheduler
  ports:
    - name: tcp-scheduler
      port: 8786
      targetPort: 8786
    - name: http-dashboard
      port: 8787
      targetPort: 8787
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dask-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dask-scheduler
  template:
    metadata:
      labels:
        app: dask-scheduler
    spec:
      containers:
        - name: scheduler
          image: daskdev/dask:2025.4.0
          args: ["dask-scheduler", "--dashboard-address", ":8787"]
          ports:
            - containerPort: 8786
            - containerPort: 8787
          resources:
            requests: {cpu: "500m", memory: "1Gi"}
            limits:   {cpu: "1",    memory: "2Gi"}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dask-worker
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dask-worker
  template:
    metadata:
      labels:
        app: dask-worker
    spec:
      containers:
        - name: worker
          image: daskdev/dask:2025.4.0
          args: ["dask-worker", "tcp://dask-scheduler:8786", "--nthreads", "2", "--memory-limit", "2GB"]
          resources:
            requests: {cpu: "500m", memory: "1Gi"}
            limits:   {cpu: "1",    memory: "2Gi"}
EOF

Once running, you can port-forward and use the VS Code Ports tab to explore the Dask dashboard:

kubectl port-forward svc/dask-scheduler 8787:8787

You can also deploy Dask Gateway via Helm — this is only possible in vCluster mode, since it requires cluster-wide resources such as CRDs and RBAC cluster roles:

helm repo update
helm upgrade --install dask-gateway dask/dask-gateway   -n "${DEFAULT_NAMESPACE:-default}"   --create-namespace   --set gateway.auth.type=simple   --set gateway.auth.simple.password=''   --set traefik.service.type=ClusterIP   --set gateway.backend.image.name=ghcr.io/dask/dask-gateway   --set gateway.backend.image.tag=2025.4.0   --wait --atomic

Example: Deploying MLflow with Persistent Storage

MLflow is a popular experiment-tracking platform that complements data exploration workflows. The following example deploys an MLflow server together with a simple SQLite backend and a PersistentVolumeClaim for artifact and metadata storage.

Note: The PVC is bound to your Datalab session. Once the Datalab is deleted, the PVC and stored data will also be removed unless your operator configures a persistent storage backend. For durable MLflow metadata or artifacts, use a managed database and an object-storage bucket provisioned outside Provider Datalab, for example with Provider Storage.

Click to expand: Deploy MLflow
export BUCKET=ws-frank # replace accordingly

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: minio-creds
type: Opaque
stringData:
  accessKey: "${AWS_ACCESS_KEY_ID}"
  secretKey: "${AWS_SECRET_ACCESS_KEY}"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: minio-config
data:
  endpoint: "${AWS_ENDPOINT_URL}"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mlflow
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow
spec:
  selector:
    app: mlflow
  ports:
    - name: http
      port: 5000
      targetPort: 5000
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
        - name: mlflow
          image: ghcr.io/mlflow/mlflow:latest
          command: ["/bin/sh","-lc"]
          args:
            - |
              python -m pip install --no-cache-dir --upgrade pip &&
              pip install --no-cache-dir boto3 &&
              exec mlflow server \
                --backend-store-uri sqlite:////mlflow/mlflow.db \
                --serve-artifacts \
                --artifacts-destination s3://"${BUCKET}"/mlruns \
                --host 0.0.0.0 --port 5000 \
                --workers 2 \
                --allowed-hosts '*' \
                --cors-allowed-origins '*'
          ports:
            - containerPort: 5000
          resources:
            requests: { cpu: "100m", memory: "512Mi" }
            limits:   { cpu: "300m", memory: "2Gi" }
          env:
            - name: MLFLOW_S3_ENDPOINT_URL
              valueFrom:
                configMapKeyRef: { name: minio-config, key: endpoint }
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef: { name: minio-creds, key: accessKey }
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef: { name: minio-creds, key: secretKey }
            - name: AWS_S3_FORCE_PATH_STYLE
              value: "true"
            - name: AWS_EC2_METADATA_DISABLED
              value: "true"
          volumeMounts:
            - name: data
              mountPath: /mlflow
          readinessProbe:
            httpGet: { path: "/", port: 5000 }
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet: { path: "/", port: 5000 }
            initialDelaySeconds: 20
            periodSeconds: 20
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: mlflow
EOF

Once running, you can port-forward and use the VS Code Ports tab to explore the MLflow UI:

kubectl port-forward svc/mlflow 5000:5000

To use MLflow in your code, you need to connect to the tracking server running at http://localhost:5000. This can be done by setting the following environment variable:

export MLFLOW_TRACKING_URI="http://127.0.0.1:5000"

Example: Using psql to access and work with a managed database

If you have created a database through the Workspace UI and your Datalab provides database credentials (via POSTGRES_<HOST>_<DATABASE>_URL_EXTERNAL environment variables), you can immediately connect to the managed PostgreSQL instance by starting a pod that drops you directly into a psql shell.

envsubst <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pgsql-debug
spec:
  restartPolicy: Never
  containers:
  - name: psql
    image: postgres:17
    stdin: true
    tty: true
    command:
      - sh
      - -lc
      - |
        exec psql "$POSTGRES_URL_EXTERNAL"
    env:
      - name: POSTGRES_URL_EXTERNAL
        value: ${POSTGRES_PG0_DEV_URL_EXTERNAL}
EOF

Now attach to the pod.

kubectl attach -it pgsql-debug

And run your psql commands.

\set ON_ERROR_STOP on
\l

DROP TABLE IF EXISTS demo;

CREATE TABLE demo (
    id serial PRIMARY KEY,
    name text,
    created_at timestamptz DEFAULT now(),
    meta jsonb
);

INSERT INTO demo (name, meta) VALUES
('alpha', '{"type":"test","value":1}'),
('beta',  '{"type":"test","value":2}'),
('gamma', '{"type":"prod","value":3}');

SELECT * FROM demo;

Summary

Additional services fall into two operational categories:

  • User-deployed, session-adjacent services such as the Dask and MLflow examples above. These are useful for exploration and can be cleaned up with the workspace lifecycle.
  • Platform-managed services requested through the Datalab spec, such as PostgreSQL databases, document stores, Redis key-value/cache stores, Qdrant vector stores, and Docker registries. These should be treated as operator-owned resources with explicit backup, monitoring, upgrade, and retirement policies. Object-storage buckets are the exception here: create them through Provider Storage or another storage process, then wire their credentials into the Datalab.

The goal is not to prevent users from experimenting. The goal is to make sure that durable state does not become invisible. If a service matters beyond the current session, promote it into the declarative platform model so the operator knows it exists and can take responsibility for it.