Stackable Operator for Apache Superset
The Stackable Operator for Apache Superset is an operator that can deploy and manage Apache Superset clusters on Kubernetes. Superset is a data exploration and visualization tool that connects to data sources via SQL. Store your data in Apache Druid or Trino, and manage your Druid and Trino instances with the Stackable Operators for Apache Druid or Trino. This operator helps you manage your Superset instances on Kubernetes efficiently.
Getting started
Get started using Superset with Stackable Operator by following the Getting started. It guides you through installing the Operator alongside a PostgreSQL database, connecting to your Superset instance and analyzing some preloaded example data.
Resources
The Operator manages two custom resources: The SupersetCluster and DruidConnection. It creates a number of different Kubernetes resources based on the custom resources.
Custom resources
The SupersetCluster is the main resource for the configuration of the Superset instance. The resource defines only one
role, the node. The various configuration options are explained in the
Usage guide. It helps you tune your cluster to your needs by configuring
resource usage, security,
logging and more.
DruidConnection resources link a Superset and Druid instance. It lets you define this connection in the familiar way of deploying a resource (instead of configuring the connection via the Superset UI or API). The operator configures the connection between Druid and the Superset instance.
Kubernetes resources
Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.
The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Job created for the DruidConnnection resource is not shown.
For every role group you define, the Operator creates a
StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the
main container running Superset and a sidecar container gathering metrics for Monitoring. The
Operator creates a Service for the node role as well as a single service per role group.
Additionally, a ConfigMap is created for each RoleGroup. These ConfigMaps contains two files:
log_config.py and superset_config.py which contain logging and general Superset configuration respectively.
Required external component: Metastore SQL database
Superset requires an SQL database in which to store its metadata, dashboards and users. The Getting started guides you through installing an example database with a Superset instance that you can use to get started, but is not suitable for production use. Follow the setup instructions for one of the supported databases for a production database.
Connecting to data sources
Superset does not store its own data, instead it connects to other products where data is stored. On the Stackable Platform the two commonly used choices are Apache Druid and Trino. For Druid there is a way to connect a Druid instance declaratively with a custom resource. For Trino this is on the roadmap. Have a look at the demos linked below for examples of using Superset with Druid or Trino.
Demos
Many of the Stackable demos use Superset in the stack for data visualization and explaration. The demos come in two main variants.
With Druid
The nifi-kafka-druid-earthquake-data and nifi-kafka-druid-water-level-data demos show Superset connected to Druid, exploring earthquake and water level data respectively.
With Trino
The spark-k8s-anomaly-detection-taxi-data, trino-taxi-data, trino-iceberg and data-lakehouse-iceberg-trino-spark demos all use a Trino instance on top of S3 storage that hold data to analyze. Superset is connected to Trino to analyze a variety of different datasets.