ClickHouse Infrastructure & Setup

This document details the ClickHouse database infrastructure running in our GKE clusters. It covers the deployment architecture, tenant management, integration with Cloud SQL, and backup strategies.

Overview

ClickHouse is deployed on Kubernetes (GKE) using the Altinity ClickHouse Operator. It serves as our primary analytics engine.

Environment: Staging and Production (each in their own GCP Project).
Cluster Topology:
- ClickHouse: 1 Shard, 2 Replicas (High Availability).
- Keeper: 3 Replicas (Zookeeper replacement for coordination).
Operator: altinity-clickhouse-operator manages the cluster state.
Storage: Persistent Volumes using premium-rwo-cmek storage class.

Infrastructure Hierarchy

Our infrastructure is managed via Terraform across two main repositories:

Base Infrastructure (**deepdesk-gcp**):
- Deploys the ClickHouse Operator.
- Deploys the ClickHouse Cluster (ClickHouseInstallation CRD) and Keeper (ClickHouseKeeperInstallation CRD).
- Sets up the underlying Kubernetes resources (Service Accounts, PVCs, ConfigMaps).
- Manages the global backup schedule.
Account Provisioning (**deepdesk-terraform**):
- Manages tenant-specific resources (Customers).
- Creates Databases ({account}_analytics) and Users.
- Configures Federation with Cloud SQL (PostgreSQL).
- Manages Access Control Lists (ACLs) and Grants.

Architecture Components

1. The ClickHouse Cluster (`deepdesk-gcp`)

The cluster is defined in the clickhouse Terraform module. Key configurations include:

Resources: CPU and Memory requests are parameterized variables (e.g., cpu_requests, memory_requests).
Networking:
- Exposed via an internal Nginx Ingress at clickhouse{region_subdomain}.deepdesk.com.
- Service-to-Service communication is permitted via Istio Ambient mesh policies (namespace labeled istio.io/dataplane-mode: ambient).
Monitoring:
- Prometheus metrics are exposed on port 9363.
- ServiceMonitor resources automatically scrape metrics for the ClickHouse cluster and Keeper.

2. Cloud SQL Federation

Each ClickHouse pod runs a Cloud SQL Proxy sidecar (implemented as an initContainer with restartPolicy: Always).

Purpose: Allows ClickHouse to connect to our Cloud SQL (PostgreSQL) instances via localhost:5432.
Implementation:
- A Named Collection is created for each tenant pointing to localhost:5432.
- A federated database ${account}_admin_db is created using the PostgreSQL engine.
- This allows us to query live data from Cloud SQL directly within ClickHouse for specific use cases.

3. ClickHouse Keeper

We use ClickHouse Keeper (a C++ implementation compatible with ZooKeeper) for cluster coordination.

It runs as a separate stateful set with 3 replicas to ensure quorum.
Configured via the ClickHouseKeeperInstallation CRD.

Tenant & Database Management (`deepdesk-terraform`)

When a new customer account is provisioned:

Analytics Database: A native ClickHouse database named ${account}_analytics is created.
Federated Database: A PostgreSQL-backed database named ${account}_admin_db is connected.
Permissions:
- The Terraform Provider for ClickHouse (clickhousedbops) manages grants.
- Specific services (e.g., admin_celery_worker, backend, analytics_readonly) are granted granular permissions (SELECT, INSERT, ALTER DELETE) on these databases.

Backup Strategy

Backups are automated using a custom sidecar container running Supercronic.

Image: deepdesk-custom-docker-images/clickhouse-backup-cron.
Tool: clickhouse-backup utility.
Schedule: Runs hourly (0 * * * *).

Backup Workflow (`clickhouse-backups.sh`)

Leader Only: The script executes only on the first replica (chi-cluster-main-0-0-0) to avoid duplication.
Discovery: Iterates through all databases ending in _analytics.
Storage: Backups are uploaded to Google Cloud Storage (GCS).
- Bucket pattern: ${project}-backups-${account}.
Full vs. Incremental:
- Incremental: The default backup mode.
- Full: Triggered if:
  - No previous full backup exists.
  - The latest full backup is > 24 hours old.
  - A "mutation" (cleanup/delete) event was detected since the last full backup (ensuring consistency).

Recovery

Backups are stored in GCS and can be restored using the clickhouse-backup tool.

List Backups: You can list available backups by executing the command inside the clickhouse-backup container on any Clickhouse cluster pod.

kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup list remote

Restore Procedure: Restoring in a replicated environment requires specific steps to ensure consistency. You must restore the schema on all replicas but restore the data only on the first replica.

In this example, we restore the ${account}_analytics database to a new temporary database ${account}_analytics_temp. This pattern allows for safe verification and manual recovery of specific data subsets without overwriting live data immediately.

Restore Schema on Replica 0:

kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --schema --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
exit

Restore Schema on Replica 1:

kubectl -n clickhouse exec -it chi-cluster-main-0-1-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --schema --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
exit

Restore Data on Replica 0:

kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
exit

Disaster Recovery (Destructive Restore)

⚠️ WARNING: This procedure DROPS existing tables and data before restoring. Use this ONLY in catastrophic scenarios where data is lost or corrupted and you need to revert to a backup completely.

Restore Schema on Replica 1 (with --drop):

kubectl -n clickhouse exec -it chi-cluster-main-0-1-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --schema --drop <backup_name>
exit

Restore Schema & Data on Replica 0 (with --drop):

kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --schema --data --drop <backup_name>
exit

Extras:

Some account specific configurations (like users, grants, etc.) are managed via Terraform in the deepdesk-terraform repo.

The safest way to make sure these are fully applied is to:
- Go to the Cloud Build triggers page in the right GCP project (eg. staging)
- Find the trigger named deploy-accounts-infra
- Click on Run
- Set the Branch to the tag of latest release from the deepdesk-terraform repo for any production environments, or to main for staging
- Set the value of the _ACCOUNT variable to the account that has been restored (${account})
- Click Run trigger
The above can also be achieved with the following gcloud command:
```
gcloud builds triggers run \
    $(gcloud builds triggers list --filter name=deploy-accounts-infra --format json --project ${project} | jq -r '.[0].id') \
    --region global \
    --substitutions _ACCOUNT=${account} \
    $(
        if [ "${project}" = "deepdesk-nl-staging" ]; then
            echo -n "--branch main"
        else
            echo "--tag $(git ls-remote --tags --sort="v:refname" git@github.com:Deepdesk/deepdesk-terraform.git | tail -n1 | sed 's/.*\/\|\^{}//g')"
        fi
    ) \
    --project ${project}
```

Key Repositories

Repository	Purpose	Path
`deepdesk-gcp`	Base Infrastructure & Cluster Deployment	`terraform/modules/clickhouse`
`deepdesk-terraform`	Tenant DBs, Users & Grants	`projects/analytics/`, `modules/database/`
`deepdesk-custom-docker-images`	Backup Cron Image	`clickhouse-backup-cron/`

Overview​

Infrastructure Hierarchy​

Architecture Components​

1. The ClickHouse Cluster (deepdesk-gcp)​

2. Cloud SQL Federation​

3. ClickHouse Keeper​

Tenant & Database Management (deepdesk-terraform)​

Backup Strategy​

Backup Workflow (clickhouse-backups.sh)​

Recovery​

Disaster Recovery (Destructive Restore)​

Key Repositories​