Skip to main content

Deepdesk GCP terraform

This page describes the deepdesk-gcp github repo. This repo contains 2 terraform projects that provision all deepdesk GCP resources that are needed to setup a 'region'. A region is one google project like deepdesk-nl-production, containing a k8s clusters that runs all deepdesk service. Resources that need to be created for every account do not belong in this repository: they should go in the 'deepdesk-terraform' repo.

Project structure​

The repository structure is rather standard: There are 2 projects (called per-project and global).

ProjectPurpose
per-projectProvisions resources that need to be created in every GCP project, like a k8s cluster, SQL instance, etc.
globalProvisions resources at the organization level, or in the deepdesk-cloud project. E.g. stuff that's only created once (static asset buckets for example)

Besides the projects/ folder, there's a modules folder containing all the shared modules used by both projects. Every major software components has its own module, to keep things logical and clean. The individual modules will be described below.

Manually performing a run​

To manually run terraform for a project, refer to the README in the repository.

Modules​

This section will describe the individual modules and any exceptions/specials that are good to know about

Analytics​

What?Provisions Analytics BQ project-level resources, mainly BQ backups and alerts, BQ custom roles
Specials?This module creates a BQ backup k8s cronjob that backup up all BQ datasets/tables in json format to a project-level bucket (<projectname>-bigquery-backups). The job runs once a week on sunday @ 0:00

Apis​

What?Enables google API's that are needed to run all deepdesk products
Specials?-

Cert-exporter​

What?Deploys the 'cert-exporter' tool into the GKE cluster, enabling prometheus to check for expired SSL certificates. Deployed using Helm
Specials?-

Cert-manager​

What?Install 'cert-manager' into the GKE cluster. This enabled automated provisioning of TLS certificates through LetsEncrypt (and other CA's) via ACME.
Specials?Uses workload-identity to gain access to google DNS for ACME DNS01 challenges.

Database​

What?A module used to create postgres database in an postgres CloudSQL instance. Username and password need be provided as variables.
Specials?After creating the database, a k8s Job is created in the 'default' namespace to revoke cloudsqlsuperuser permissions from the database user and instead grant all privileges only on the database just created. It also enabled the pgcrypto extension for the database (db-level encryption)

Database_instance​

What?Creates and configures a CloudSQL postgres instance.
Specials?- The password for the 'postgres' user is set to what's in the postgres_db_password secret in Secret Manager. This needs to be set/created before using this module. A postgres-db secret and CloudSQL proxy is created in the default namespace allowing developers easy access for DB maintenance.
- Creates an 'sql-backup' k8s cronjob that backs up all databases in SQL format to a project-level bucket (<projectname>-sql-backups). This job runs daily @ 6.00

External-dns​

What?Deploys external-dns into the GKE cluster, enabling dynamic DNS updates for services/ingresses.
Specials?Uses workload-identity to gain access to google DNS (project deepdesk-cloud)

External-secrets​

What?Deploy external-secrets into the GKE cluster, allow automatic synchronization of Secret Manager secrets into kubernetes. These secrets are then used by FluxCD to deploy deepdesk services.
Specials?Uses workload-identity to gain access to Secret Manager secrets

Flux​

What?Deploys cluster-wide FluxCD resources into the GKE cluster. (CRD's, Flux Controllers, GitRepositories, ImageAutomations and Webhook receivers)
Specials?- FluxCD doesn't supply a Helm chart, so this module contains a custom helm chart. The chart has been generated with the following command:
flux install --export --components-extra image-automation-controller,image-reflector-controller
This will output one big yaml file containing all flux resources and CRD's. The helm-chart is the split-up version of this file with some minor modifications. Modifications are marked with# DDand will need to be re-done when regenerating the helm chart.
- To allow the image automation controller to access google container registry, the service-account for this controller uses workload-identity.
- The image automation controller uses a github ssh key to write to thedeepdesk-config github repository and update the image tags.

GCR-cleanup​

What?Deploys a CronJob in GKE to clean up google container registry images / image tags. The tool used can be found here: https://github.com/GoogleCloudPlatform/gcr-cleaner
Specials?It removes images where all tags match the following regex: `"^\w7`

Generated_secret​

What?Module to generate a random password/secret and store it in Secret Manager
Specials?-

GKE_configuration​

What?Generic module for GKE configuration that does not fit anywhere else. Currently only contains creation of PodDisruptionBudgets for kube-system components. Can be removed once GKE 1.22 is live.
Specials?-

GKE_private_cluster​

What?Create a GKE private cluster with Node Auto Provisioning. Sets up permissions for the cluster service account and create firewall rules for Master β†’ Node traffic. Also sets up a BQ billing table compatible to summarize GKE cost records used in a DataStudio cost details report
Specials?- The cluster endpoint has a public IP, but is only reachable from within the VPC. There's a jumphost and VPN in place to get access from the outside.
- Each project has one cluster, using the same VPC/subnetwork IP ranges. This means there's IP overlap between projects which can lead to problems with org-wide VPN's, peerings, etc.

Image_proxy​

What?Deploys an instance of 'image proxy': a image resizing and caching proxy used to service resized (dashboard) images.
Specials?- It uses a GCP bucket as a cache to store resized images. The bucket has a lifecycle policy to remove files older than 30 days. WLID is used to access this bucket.
- Image proxy does not provide a helm chart. The module contains a custom chart in the 'chart' subdir.

Istio​

What?Deploys and configures Istio Service Mesh into the GKE cluster.
Specials?- We use the 'Istio Operator' to deploy istio. The operator itself is part of a custom helm chart that can be found in the 'chart' subdir, file templates/istio-deepdesk.yaml
- We don't use the istio gateways
- Auto injection of the sidecar is enabled, but in 'opt-in' mode. The namespace or pods need to be labeled to get auto-injection of the istio sidecar
- Istio mTLS is enforced in most namespaces, but not enabled cluster-wide (yet)
- We used to have Istio CNI enabled, but there were issues with Kubeflow pods and init container order.

Kube_downscaler​

What?This module deploys kube-downscaler, a tool to downscale deployments outside office hours to save on GKE costs.
Specials?Only enabled on staging, scaling everything down at night and on weekends

Kubeflow​

What?Deploy kubeflow-pipelines into the GKE cluster, and connects it to the google AI Pipelines console UI
Specials?- A howto updating kubeflow can be found here (document not migrated)
- Kubeflow does not offer a helmchart, so a custom chart can be found in the chart/ subdir
- Kubeflow service accounts have access to buckets and BQ through WLID
- A separate Mysql CloudSQl database intance is deployed to contain the kubeflow database

Logging​

What?Configures GCP project-level logging buckets and filters. Filter out useless logging to save on storage and ingestion costs.
Specials?-

Metabase​

What?Deploys the 'Metabase' product into GKE. Metabase is used for analytics from BQ tables/views and provides an API for scheduled reporting.
Specials?The metabase helm chart is deployed using FluxCD. The terraform bit only deploys a Kustomization to configure the FluxCD resources

Nginx​

What?Deploys the NGINX Ingress Controller. Both an internal and external controller are deployed. The internal controller can only be reached via VPN. This module also deploys the 'deepdesk.com' TLS certificate.
Specials?- External connectivity and TLS offloading is provided by a google application load balancer, deployed and configured by an 'ingress' of type 'gce' / 'gce-internal'.
- We use the community version of the NGINX controller,notthe NGINX Inc. version
- Both controllers have Istio sidecar enabled to communicate over mTLS with all services in the cluster

Nvidia​

What?Installs the Nvidia driver installer daemonset into GKE. Needed to enable GPU support for kubeflow jobs.
Specials?-

Prometheus​

What?Installs the Kube Prometheus Stack into GKE. This stack is a pre-configured collection of prometheus, grafana and alertmanager, combined into one helm-chart. It also configures log-based metrics in GCP monitoring used by some of the dashboards and alerts.
Specials?- Grafana dashboards are provisioned as json files in the 'dashboards' subdir. Placing .json files in this dir will add the dashboard to grafana
- Grafana is configured to only use google oauth based login
- All prometheus tools are exposed through the internal Nginx ingress controller (prometheus.nl.deepdesk.com, alertmanager.nl.deepdesk.com, grafana.nl.deepdesk.com, etc.)

Prometheus Adapter​

What?Installs prometheus adapter, used to calculate and serve custom metrics. Used mainly to expose a 'istio_requests_total' metric per service/pod used in horizonal pod autoscaling of GPT and backend.
Specials?-

Quota-exporter​

What?Deploys the 'quota exporter' tool into GKE. This is a prometheus exporter that serves GCP quota as prometheus metrics. Used to monitor and alert on GCP Quota
Specials?The docker image source can be found here

Services​

What?This module deploys all non-account specific services, like admin, broker, anonymizer, etc. It deploys the FluxCD Kustomization to pull in all FluxCD resources for automatic deployment.
Specials?-

Stackdriver-adapter​

What?Install the google-stackdriver metrics adapter, allowing GKE components to query GCP metrics. Used by HPA to scale based on GPU resource usage
Specials?-

VPC​

What?Deploys and configures VPC networks, subnets and google-private-access
Specials?- Enables 'private google access' for all apis except for storage.googleapi.com. Cause: host header based bucket access doesn't work using the 'private' version of the storage API. We use this to service static assets. See this google case
- For every project a small unique IP range/subnet is created to allow for proper routing via VPN. When creating a new region, theunique_subnet_numbersmap needs to be extended.
- Subnets used by GKE use the same IP range in every project, causing IP overlap. This should have been unique, but cannot be changed without re-creating the entire cluster. We work around this by deploying a jumphost into a unique subnet and doing SNAT from this subnet.
- The public IP of the NAT gateway(s) must remain constant, because it's whitelisted in customer firewalls

VPN​

What?Deploys a VPN gateway into the project and sets up WireGuard VPN. Also install the Squid proxy on the gateway to allow Cloud Build jobs to connect to GKE.
Specials?- See comments under 'VPC' about IP overlap / SNAT
- End user docs can be foundhere (document not migrated)
- Squid is installed and reachable asjumphost.deepdesk.internal from within the VPC and through VPC peerings that have peering dns enabled. We use private Cloud Build pools with a dns-enabled VPC peering so the jobs can use the jumphost as a HTTP proxy and connect to to GKE. This can't be done directly because GCP VPC peerings are not transient. See also this article: https://medium.com/deepdesk/using-google-cloud-build-with-private-gke-clusters-8c98acb1bdf