Provider Datalab – Usage & Concepts
This section explains how to use the provider-datalab configuration packages once they are installed. It focuses on the concepts of Sessions, Files, vclusters, Storage Secrets, Databases and the optional Keycloak integration for identity and access.
Concepts
Sessions
A Datalab claim may declare one or more spec.sessions.
- If at least one session is listed, a corresponding WorkshopSession is automatically created and will run permanently until stopped by the operator.
- If no sessions are given, no session object is pre-created. The shared runtime namespace and non-session resources can still be reconciled and tested without a
WorkshopSession.
Sessions can also be patched into the spec later if needed.
Persistence
Each Datalab session is equipped with a persistent volume for storing files, in addition to the connected object storage. This ensures that user data and session state are preserved even if the workshop pod is restarted or rescheduled by Kubernetes. Installing code libraries, handling metadata, or working with Git repositories often generates many small files that may be updated frequently. A storage class providing NFS-like capabilities is usually a good fit for these kinds of workloads, object storage abstractions are not.
The persistent volume claim (PVC) is tied to the active session and will be deleted automatically when the workshop session shuts down (for example, through a culling process when using session mode auto). This does not necessarily mean that data is lost — when the session is restarted from the same manifests, Kubernetes will recreate the PVC with the same name, reattaching it to the existing data in environments that use an NFS server or another shared storage backend, since the PVC will point to the same physical folder.
This behavior works as long as the associated StorageClass has its reclaimPolicy set to Retain (not Delete), ensuring that data is not removed externally. It also depends on maintaining a consistent link between the PVC name and the actual storage path. If the underlying storage system assigns randomized volume identifiers (such as UIDs for folder paths), the data will still remain on the storage backend after the session ends, but Kubernetes will not automatically reattach it to a new PVC — manual reassociation may then be required.
Database
Many Datalab workloads require a stateful database in addition to files and object storage, for example metadata catalogs or application backends.
Instead of running databases inside sessions, Datalabs attach to a platform-managed database cluster. The platform creates logical databases inside that cluster and provisions credentials automatically.
spec:
databases:
pg0:
names:
- dev
- prod
storage: 1Gi
backupStorage: 3Gi
pg0- target database cluster managed by the platformnames- logical databases created inside the clusterstorage- persistent storage allocationbackupStorage- space reserved for backups
The platform automatically:
- creates databases and users
- stores credentials in a Secret
- injects connection details into sessions
- performs backups
- keeps data independent from session lifecycle
This keeps compute ephemeral while database state remains durable.
If a Kubernetes gateway service is running in the cluster and enabled in the global configuration, the database can also be exposed externally. In that case, corresponding environment variables such as the external hostname or external URL are injected into the session as well.
Note: The Postgres endpoint is exposed through a gateway
TLSRoute,which requires immediate TLS with SNI (direct TLS). The PostgreSQL server and libpq-based clients (e.g. psql, psycopg) fully support this. However, some non-libpq drivers such as asyncpg do not yet implement this negotiation correctly and may fail during connection setup.
Document, Cache, and Vector Stores
For non-relational workloads, a Datalab can also provision optional document, cache, and vector stores:
spec:
documentStores:
prod:
storage: 1Gi
cacheStores:
prod:
storage: 1Gi
vectorStores:
prod:
storage: 1Gi
documentStoresprovisionsMongoDBCommunityresources (mongodbcommunity.mongodb.com/v1).cacheStoresprovisions Redis resources (redis.redis.opstreelabs.in/v1beta2).vectorStoresprovisionsQdrantClusterresources (qdrant.io/v1alpha1).- Access credentials are created as namespaced Secrets with predictable names:
- Mongo:
<store>-mongodb-auth(key:password) - Redis:
<store>-redis-auth(key:password) - Qdrant:
<store>-qdrant-auth(keys:apiKey,readApiKey)
Authentication
Provider Datalab is a building block for workspace provisioning. It can wire authentication into the runtime, but the stronger and more flexible pattern is often to delegate user authentication to the surrounding platform, especially at the ingress layer.
Multiple options are possible:
- Enable built-in runtime authentication. By default,
auth.type = credentialsuses the same credentials that are used to access the connected object storage buckets for session login. This is a simple basic-auth style option, but it ties workspace users to the credentials known by the Datalab runtime. - Set
auth.type = noneand let another platform component protect access before requests reach the workspace. This does not mean that unauthenticated access is required; it means authentication is delegated to another layer, such as the Kubernetes ingress controller.
Delegating authentication is often more flexible because users accessing a workspace do not necessarily have to exist in Keycloak. For example, a workspace ingress can be protected by oauth2-proxy with NGINX ingress annotations:
nginx.ingress.kubernetes.io/auth-url: "https://auth.acme.org/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://auth.acme.org/oauth2/start?rd=$escaped_request_uri"
Those annotations can be added by platform policy instead of being specified in every Datalab. One option is a Kyverno mutation policy:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: annotate-datalab-ingresses
spec:
background: false
rules:
- name: add-oauth2-proxy-annotations-for-datalab-domain
match:
any:
- resources:
kinds:
- Ingress
preconditions:
all:
- key: "{{ request.object.spec.rules[?ends_with(host, '.datalab.acme.org')] | length(@) }}"
operator: GreaterThan
value: 0
mutate:
patchStrategicMerge:
metadata:
annotations:
nginx.ingress.kubernetes.io/auth-url: "https://auth.acme.org/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://auth.acme.org/oauth2/start?rd=$escaped_request_uri"
Kyverno is only one way to apply this policy. The same result can be achieved with a mutating admission webhook or any other platform automation that consistently annotates the generated Ingress resources.
Keycloak-managed access is supported. When it is used, the composition automatically provisions the Keycloak client, groups, roles, role bindings, and memberships needed for the workspace.
Files and the Workshop Tab
The spec.files array is optional.
- When empty or omitted, no workshop tab is rendered in the Educates UI.
- When at least one source is defined, workshop and/or data content is mounted and the tab is enabled.
Supported sources:
- OCI image (
spec.files[].image) - Git repository (
spec.files[].git) - HTTP(S) download (
spec.files[].http)
Filters (includePaths, excludePaths, newRootPath, path) control what ends up visible.
vcluster toggle
spec.vcluster is a boolean flag.
- true → the datalab provisions a vcluster for runtime isolation.
- false → workloads run directly in the namespace.
Storage Secret
A Datalab requires credentials to an S3-compatible storage system.
Credentials are expected to exist in a Kubernetes Secret named via spec.secretName, in the same namespace as the Datalab.
This secret must include at least the access_key and access_secret. The endpoint and provider are defined in EnvironmentConfig.data.storage.
Security and Access Policy
The spec.security section controls access permissions and runtime privilege level for sessions.
Key fields:
policy— defines Pod Security Standard (restricted,baseline,privileged).privilegedenables Docker-in-Docker with 20 Gi of local storage.kubernetesAccess— whether a Kubernetes service account token is mounted inside the session.kubernetesRole— defines in-namespace RBAC level (admin,edit,view).
Resource Quotas
The spec.quota section allows per-Datalab overrides of default compute and storage budgets.
memory— memory allocation per session (default 2 Gi).storage— persistent volume size (default 1 Gi).budget— Educates resource budget profile (small,medium,large,x-large, etc.).
When unspecified, defaults from the EnvironmentConfig apply.
| Budget | CPU | Memory |
|---|---|---|
| small | 1000m | 1Gi |
| medium | 2000m | 2Gi |
| large | 4000m | 4Gi |
| x-large | 8000m | 8Gi |
| xx-large | 8000m | 12Gi |
| xxx-large | 8000m | 16Gi |
Identity and Keycloak Resources
When Keycloak-managed access is used, users listed under spec.users must already exist in Keycloak.
When a Datalab is created for that pattern, the composition automatically provisions the required Keycloak resources:
- Groups for the datalab and datalab administrators
- Group memberships for the listed users
- A dedicated OAuth2 client
- User and admin roles, plus the role bindings for the generated groups
This ensures that authentication and authorization are consistently enforced across the runtime and UI. If authentication is delegated to the ingress or another platform component, the identities allowed through that outer layer are managed by that component and do not necessarily have to be users in the Datalab Keycloak realm.
Example: Joe (no session by default)
# Joe gets a personal datalab s-joe with no pre-created session.
# He must explicitly start a session himself; nothing is running by default.
# No vcluster is provisioned and no workshop files are attached.
# Credentials to storage are expected to exist in a secret "s-joe" in the same namespace.
# A Keycloak group, role, and client are created; user "joe" must exist in Keycloak.
apiVersion: pkg.internal/v1beta1
kind: Datalab
metadata:
name: s-joe
spec:
users:
- joe
secretName: s-joe
- Joe’s Datalab exists but is idle until he launches a session.
- Useful for lightweight, on-demand environments.
- Keycloak ensures Joe is authorized to access his workspace.
Example: Jeff, Jim, and Jane (shared store validation, privileged with Docker)
# Jeff (owner), Jim (admin) and Jane (user) share a datalab s-jeff with no pre-created session.
# This is the canonical shared store-validation example: the lab stays sessionless by default.
# The lab does not use a vcluster and has no workshop files.
# Credentials to storage are expected to exist in a secret "s-jeff" in the same namespace.
# A Keycloak group, role, and client are created; users "jeff", "jim" and "jane" must exist in Keycloak.
# This configuration runs the lab in privileged mode:
# - Security policy: "privileged" → automatically enables Docker with 20 Gi workspace storage.
# - Docker registry is disabled for this shared example.
# - Session quota: increased to 6 Gi memory, 1 Gi storage, budget class "x-large".
# - Kubernetes API access is disabled (kubernetesAccess=false).
# The data component for the object storage mount and browser UI is disabled.
# Additionally, two PostgreSQL databases are provisioned for the lab: "prod" and "dev".
# Additionally, one MongoDB-backed document store is provisioned:
# - prod with 1 Gi storage
# Additionally, one Redis-backed cache store is provisioned:
# - prod with 1 Gi storage
# Additionally, one Qdrant-backed vector store is provisioned:
# - prod with 1 Gi storage
# Access credentials are generated as secrets in the runtime namespace:
# - MongoDB: <store>-mongodb-auth
# - Redis: <store>-redis-auth
# - Qdrant: <store>-qdrant-auth
apiVersion: pkg.internal/v1beta1
kind: Datalab
metadata:
name: s-jeff
spec:
users:
- jeff
- jim
- jane
userOverrides:
jim:
grantedAt: "2025-01-10T19:00:00Z"
role: admin
secretName: s-jeff
sessions: []
vcluster: false
data:
enabled: false
quota:
memory: 6Gi
storage: 1Gi
budget: x-large
files: []
security:
policy: privileged
kubernetesAccess: false
registry:
enabled: false
storage: 3Gi
documentStores:
prod:
storage: 1Gi
cacheStores:
prod:
storage: 1Gi
vectorStores:
prod:
storage: 1Gi
databases:
pg0:
names:
- dev
- prod
storage: 1Gi
backupStorage: 3Gi
- No
WorkshopSessionis pre-created for this shared example. The runtime namespace and backing services can be validated without a session pod. - Runs in privileged mode with Docker support and increased ephemeral disk (20 Gi).
- No Kubernetes API access is granted inside the environment. The shared example leaves the registry disabled.
- Access is secured through the corresponding Keycloak group and role.
Example: Jane (isolated vcluster with admin role and higher quota)
# Jane runs a datalab s-jane with a default session automatically created.
# That session will run permanently until stopped by the operator,
# and a dedicated vcluster is provisioned for runtime isolation.
# No workshop files are attached. Credentials to storage are expected
# to exist in a secret "s-jane" in the same namespace.
# A Keycloak group, role, and client are created; user "jane" must exist in Keycloak.
# This configuration explicitly overrides default resource quotas and security settings:
# - Security policy: "privileged" → automatically enables Docker with 20 Gi workspace storage.
# - Docker registry is enabled with 3 Gi storage.
# - Session quota: increased to 4 Gi memory, 40 Gi storage, budget class "x-large".
# - Kubernetes role: elevated to "admin" for full namespace permissions.
# The data component for the object storage mount and browser UI is configured as readonly.
# Additionally, one PostgreSQL database is provisioned for the lab: "analytics".
apiVersion: pkg.internal/v1beta1
kind: Datalab
metadata:
name: s-jane
spec:
users:
- jane
secretName: s-jane
sessions:
- default
vcluster: true
data:
readOnlyMount: true
quota:
memory: 4Gi
storage: 40Gi
budget: x-large
registry:
enabled: true
storage: 3Gi
security:
policy: privileged
kubernetesRole: admin
databases:
pg0:
names:
- analytics
storage: 1Gi
backupStorage: 3Gi
- Jane’s workloads run inside an isolated virtual cluster (
vcluster: true). - The lab also runs in privileged mode, which enables Docker with 20 Gi of session-local workspace storage.
- The admin role grants full control within her namespace/vcluster.
- This is the registry-enabled example, so session-backed registry behavior can be validated here.
- Suitable for advanced development or testing requiring full Kubernetes control.
- Keycloak enforces role-based access protection for this lab.
Example: John (with Git-based workshop files)
# John has a datalab s-john with a default session automatically created.
# That session will run permanently until stopped by the operator.
# No vcluster is provisioned. Workshop and data files are pulled from Git,
# enabling the workshop tab in the Educates UI.
# Credentials to storage are expected in a secret "s-john" in the same namespace.
# A Keycloak group, role, and client are created; user "john" must exist in Keycloak.
apiVersion: pkg.internal/v1beta1
kind: Datalab
metadata:
name: s-john
spec:
users:
- john
secretName: s-john
sessions:
- default
vcluster: false
files:
- git:
url: https://github.com/versioneer-tech/datalab-example
ref: origin/main
includePaths:
- /workshop/**
- /data/**
- /README.md
path: .
- Preloads workshop materials from Git.
- Activates the workshop tab in the UI for guided exercises.
- Keycloak ensures only John has access to this environment and tooling.
Verifying Provisioning
Once a Datalab claim has been applied, you can verify that the provisioning worked.
Check Composite Status
kubectl get datalabs -n workspace
You should see all Datalabs READY=True once reconciliation is complete:
NAME SYNCED READY COMPOSITION AGE
s-joe True True datalab-educates 2m
s-jeff True True datalab-educates 2m
s-jane True True datalab-educates 2m
s-john True True datalab-educates 2m
Inspect details:
kubectl describe datalab s-jeff -n workspace
Look for conditions like Ready=True and any event messages.
Find the Storage Secret
Each Datalab references a Secret in the same namespace via spec.secretName.
For example, the claim s-jeff with secretName: jeff requires a Secret named jeff.
kubectl get secret jeff -n workspace -o yaml
Decode credentials (AWS-style):
kubectl get secret jeff -n workspace -o jsonpath='{.data.AWS_ACCESS_KEY_ID}' | base64 -d; echo
kubectl get secret jeff -n workspace -o jsonpath='{.data.AWS_SECRET_ACCESS_KEY}' | base64 -d; echo
Connect to Databases
Starting with version 0.3.0, databases can be provisioned on a dedicated PostgreSQL host. Optionally, these databases can also be exposed externally using a TLSRoute, enabling secure access from outside the cluster. External exposure requires a Kubernetes Gateway Controller that operates at Layer-4, such as Envoy.
All additional users are created as regular database roles with limited privileges. Full administrative access is provided through the built-in postgres superuser account. This account can create extensions, manage schemas, and grant permissions to other users as needed.
Database credentials are managed by the PostgreSQL operator and stored as Kubernetes Secrets. To locate the credentials for database users, look for Secrets matching:
*-pguser-*. These Secrets contain the connection details required to authenticate against the corresponding PostgreSQL roles.
Summary
- A
Datalabdefines users, sessions, optional vcluster, quotas, and security policies. - Security controls combine Pod Security Standards, Kubernetes roles, and Docker privilege toggles.
- Each Datalab requires a storage credential Secret.
- For Keycloak-managed access, users must already exist in Keycloak; the Datalab provisions groups, memberships, a client, roles, and role bindings.
- For delegated access,
auth.type = noneleaves authentication to the ingress layer or another platform component. - Sessions may be long-lived (auto-created) or on-demand (user started).
- Workshop files enable the Educates UI workshop tab.
- Check
kubectl get datalabsfor readiness and confirm Secret and Keycloak resource creation where applicable.