[go: up one dir, main page]

Skip to main content

Tableau Integration with Apache Spark

Architecture Overview

Integration between Tableau and Ilum allows for direct SQL interaction with distributed datasets managed by the Ilum platform. This connection utilizes the Hive Thrift Protocol exposed by Ilum SQL (Apache Kyuubi).

The data flow for this integration is as follows:

  1. Tableau Desktop initiates a connection using the kyuubi-hive-jdbc driver.
  2. The connection is routed to the Ilum SQL service (ilum-sql-thrift-binary), which acts as the Gateway.
  3. Ilum SQL manages a Spark Session for the user.
  4. SQL queries generated by Tableau are translated into Spark Jobs.
  5. Spark Executors retrieve data from the underlying storage (S3, HDFS, GCS) using metadata from the Hive Metastore.

This architecture enables Tableau to query massive datasets without moving data to an intermediate database, leveraging Spark's distributed computing engine for processing.

Server-Side Configuration

To enable external JDBC access, the Ilum cluster must be configured to expose the necessary services.

Enable Ilum SQL and Metastore

Ensure that ilum-sql (Kyuubi) and hive-metastore are enabled in your Helm release. These components are responsible for handling JDBC connections and catalog management.

helm upgrade ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-kyuubi.enabled=true \
--reuse-values

Network Exposure Strategy

Tableau requires TCP access to the Ilum SQL Thrift interface. Depending on your network topology, expose the service using one of the following methods.

Production: LoadBalancer

For production environments, expose the Thrift binary service via a cloud LoadBalancer.

helm upgrade ilum ilum/ilum \
--set ilum-kyuubi.services.thriftBinary.type="LoadBalancer" \
--reuse-values

Verify the external IP assignment:

kubectl get service ilum-sql-thrift-binary

LoadBalancer Configuration

Development: Port Forwarding

For local development or testing without an external LoadBalancer, you can tunnel the connection using kubectl.

kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009

Client-Side Configuration

Driver Installation

Tableau requires a specific JDBC driver to communicate with the Kyuubi server. Use the shaded Kyuubi Hive JDBC driver to avoid classpath conflicts.

  1. Download the Kyuubi Hive JDBC Shaded JAR from Maven Central.

  2. Place the .jar file in the driver directory specific to your operating system:

    • Windows: C:\Program Files\Tableau\Drivers
    • macOS: ~/Library/Tableau/Drivers
    • Linux: /opt/tableau/tableau_driver/jdbc
  3. Restart Tableau Desktop to load the new driver.

Establishing Connection

  1. Open Tableau Desktop.
  2. Navigate to Connect > To a Server > More... > Other Databases (JDBC).
  3. Configure the connection URL based on your exposure strategy.

Standard Connection String Syntax:

jdbc:kyuubi://<HOST>:<PORT>/<DATABASE>;<SESSION_CONF>#<SPARK_CONF>
  • Local Port-Forward: jdbc:kyuubi://localhost:10009/default
  • LoadBalancer: jdbc:kyuubi://<EXTERNAL_IP>:10009/default
  1. Dialect: Select SQL92 if prompted, though the driver typically handles dialect negotiation.
  2. Authentication:
    • If Ilum is configured with Internal Auth, use your Ilum username and password.
    • If No Authentication is configured (dev mode), use anonymous as the username and leave the password blank.

Advanced Configuration

Spark Session Tuning

You can pass Spark configuration parameters directly through the JDBC URL to tune the session resources for specific Tableau workloads. This is critical for heavy aggregation tasks.

Example: Allocating more memory to the driver and executors

jdbc:kyuubi://<IP>:10009/default;spark.driver.memory=4g;spark.executor.memory=8g;spark.executor.cores=4

Connection Optimization

  • Live vs. Extract:

    • Live Connection: Submits a Spark job for every interaction in the dashboard. Best for massive datasets where freshness is critical. Note that Spark job startup time contributes to latency.
    • Extract: Imports data into Tableau's Hyper engine. Recommended for performance on datasets that fit in memory, as it avoids Spark overhead for interactive filtering.
  • Initial SQL: Use the "Initial SQL" tab in the Tableau connection dialog to set session-specific variables or register temporary functions before queries begin.

Usage Workflow

  1. Select Schema: Upon connection, select spark_catalog and the default (or target) database.
  2. Table Selection: Drag tables onto the canvas. Ilum supports Federation, so you may see tables from multiple catalogs (Iceberg, Delta Lake, Hudi) if configured.

Tableau Data Source

  1. Visualization: Proceed to Sheet 1. Columns are mapped to Spark SQL data types.

Tableau Chart

Custom SQL

For complex analytics, prefer using Custom SQL in the data source tab rather than joining huge tables in the Tableau UI. This allows you to write optimized Spark SQL queries (e.g., pre-aggregating data) that push the computational load to the Ilum cluster rather than retrieving raw rows.

Troubleshooting

ClassNotFoundException

Symptom: Tableau fails to connect with a "driver not found" error. Resolution: Verify that the kyuubi-hive-jdbc-shaded jar is in the correct Drivers folder and that Tableau has been restarted. Ensure you are not using the standard Apache Hive driver, as it may lack specific protocol support for Kyuubi.

Connection Timeout

Symptom: Connection hangs or times out. Resolution:

  1. Check network connectivity to the LoadBalancer IP on port 10009 using telnet or nc.
  2. If using port-forward, ensure the terminal session is still active.
  3. Spark sessions may take time to initialize. Increase the connection timeout setting in Tableau if possible.