ClickHouse Infrastructure & Setup
This document details the ClickHouse database infrastructure running in our GKE clusters. It covers the deployment architecture, tenant management, integration with Cloud SQL, and backup strategies.
Overviewβ
ClickHouse is deployed on Kubernetes (GKE) using the Altinity ClickHouse Operator. It serves as our primary analytics engine.
- Environment: Staging and Production (each in their own GCP Project).
- Cluster Topology:
- ClickHouse: 1 Shard, 2 Replicas (High Availability).
- Keeper: 3 Replicas (Zookeeper replacement for coordination).
- Operator:
altinity-clickhouse-operatormanages the cluster state. - Storage: Persistent Volumes using
premium-rwo-cmekstorage class.
Infrastructure Hierarchyβ
Our infrastructure is managed via Terraform across two main repositories:
- Base Infrastructure (
**deepdesk-gcp**):- Deploys the ClickHouse Operator.
- Deploys the ClickHouse Cluster (
ClickHouseInstallationCRD) and Keeper (ClickHouseKeeperInstallationCRD). - Sets up the underlying Kubernetes resources (Service Accounts, PVCs, ConfigMaps).
- Manages the global backup schedule.
- Account Provisioning (
**deepdesk-terraform**):- Manages tenant-specific resources (Customers).
- Creates Databases (
{account}_analytics) and Users. - Configures Federation with Cloud SQL (PostgreSQL).
- Manages Access Control Lists (ACLs) and Grants.
Architecture Componentsβ
1. The ClickHouse Cluster (deepdesk-gcp)β
The cluster is defined in the clickhouse Terraform module. Key configurations include:
- Resources: CPU and Memory requests are parameterized variables (e.g.,
cpu_requests,memory_requests). - Networking:
- Exposed via an internal Nginx Ingress at
clickhouse{region_subdomain}.deepdesk.com. - Service-to-Service communication is permitted via Istio Ambient mesh policies (namespace labeled
istio.io/dataplane-mode: ambient).
- Exposed via an internal Nginx Ingress at
- Monitoring:
- Prometheus metrics are exposed on port
9363. ServiceMonitorresources automatically scrape metrics for the ClickHouse cluster and Keeper.
- Prometheus metrics are exposed on port
2. Cloud SQL Federationβ
Each ClickHouse pod runs a Cloud SQL Proxy sidecar (implemented as an initContainer with restartPolicy: Always).
- Purpose: Allows ClickHouse to connect to our Cloud SQL (PostgreSQL) instances via
localhost:5432. - Implementation:
- A Named Collection is created for each tenant pointing to
localhost:5432. - A federated database
${account}_admin_dbis created using thePostgreSQLengine. - This allows us to query live data from Cloud SQL directly within ClickHouse for specific use cases.
- A Named Collection is created for each tenant pointing to
3. ClickHouse Keeperβ
We use ClickHouse Keeper (a C++ implementation compatible with ZooKeeper) for cluster coordination.
- It runs as a separate stateful set with 3 replicas to ensure quorum.
- Configured via the
ClickHouseKeeperInstallationCRD.
Tenant & Database Management (deepdesk-terraform)β
When a new customer account is provisioned:
- Analytics Database: A native ClickHouse database named
${account}_analyticsis created. - Federated Database: A PostgreSQL-backed database named
${account}_admin_dbis connected. - Permissions:
- The Terraform Provider for ClickHouse (
clickhousedbops) manages grants. - Specific services (e.g.,
admin_celery_worker,backend,analytics_readonly) are granted granular permissions (SELECT,INSERT,ALTER DELETE) on these databases.
- The Terraform Provider for ClickHouse (
Backup Strategyβ
Backups are automated using a custom sidecar container running Supercronic.
- Image:
deepdesk-custom-docker-images/clickhouse-backup-cron. - Tool: clickhouse-backup utility.
- Schedule: Runs hourly (
0 * * * *).
Backup Workflow (clickhouse-backups.sh)β
- Leader Only: The script executes only on the first replica (
chi-cluster-main-0-0-0) to avoid duplication. - Discovery: Iterates through all databases ending in
_analytics. - Storage: Backups are uploaded to Google Cloud Storage (GCS).
- Bucket pattern:
${project}-backups-${account}.
- Bucket pattern:
- Full vs. Incremental:
- Incremental: The default backup mode.
- Full: Triggered if:
- No previous full backup exists.
- The latest full backup is > 24 hours old.
- A "mutation" (cleanup/delete) event was detected since the last full backup (ensuring consistency).
Recoveryβ
Backups are stored in GCS and can be restored using the clickhouse-backup tool.
-
List Backups: You can list available backups by executing the command inside the
clickhouse-backupcontainer on any Clickhouse cluster pod.kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup list remote -
Restore Procedure: Restoring in a replicated environment requires specific steps to ensure consistency. You must restore the schema on all replicas but restore the data only on the first replica.
In this example, we restore the
${account}_analyticsdatabase to a new temporary database${account}_analytics_temp. This pattern allows for safe verification and manual recovery of specific data subsets without overwriting live data immediately.-
Restore Schema on Replica 0:
kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --schema --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
exit -
Restore Schema on Replica 1:
kubectl -n clickhouse exec -it chi-cluster-main-0-1-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --schema --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
exit -
Restore Data on Replica 0:
kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
exit
Disaster Recovery (Destructive Restore)β
β οΈ WARNING: This procedure DROPS existing tables and data before restoring. Use this ONLY in catastrophic scenarios where data is lost or corrupted and you need to revert to a backup completely.
-
Restore Schema on Replica 1 (with
--drop):kubectl -n clickhouse exec -it chi-cluster-main-0-1-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --schema --drop <backup_name>
exit -
Restore Schema & Data on Replica 0 (with
--drop):kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --schema --data --drop <backup_name>
exit -
Extras:
Some account specific configurations (like users, grants, etc.) are managed via Terraform in the
deepdesk-terraformrepo.The safest way to make sure these are fully applied is to:
- Go to the Cloud Build triggers page in the right GCP project (eg. staging)
- Find the trigger named deploy-accounts-infra
- Click on Run
- Set the Branch to the tag of latest release from the
deepdesk-terraformrepo for any production environments, or tomainfor staging - Set the value of the
_ACCOUNTvariable to the account that has been restored (${account}) - Click Run trigger
The above can also be achieved with the following
gcloudcommand:gcloud builds triggers run \
$(gcloud builds triggers list --filter name=deploy-accounts-infra --format json --project ${project} | jq -r '.[0].id') \
--region global \
--substitutions _ACCOUNT=${account} \
$(
if [ "${project}" = "deepdesk-nl-staging" ]; then
echo -n "--branch main"
else
echo "--tag $(git ls-remote --tags --sort="v:refname" git@github.com:Deepdesk/deepdesk-terraform.git | tail -n1 | sed 's/.*\/\|\^{}//g')"
fi
) \
--project ${project}
-
Key Repositoriesβ
| Repository | Purpose | Path |
|---|---|---|
deepdesk-gcp | Base Infrastructure & Cluster Deployment | terraform/modules/clickhouse |
deepdesk-terraform | Tenant DBs, Users & Grants | projects/analytics/, modules/database/ |
deepdesk-custom-docker-images | Backup Cron Image | clickhouse-backup-cron/ |