When you use Databricks as a data source with Tableau, you can provide powerful interactive analytics, bringing the contributions of your data scientists and data engineers to your business analysts by scaling to massive datasets.
生活发现【版权所有,谢谢】 console.log('可以直接使用JS,意味着md页面可以做任何事'). To add an image in markdown file the.md file and the image should be in the same directory. As in my case my.md file was in doc folder so i also moved the image into the same folder. After that write the following syntax in.md file. Learn how to use Tableau with Databricks.
Tableau provides four tools:
- Tableau Desktop: Tableau Desktop runs on local machines and can be used to connect to raw data sources and manipulate that data into views, dashboards, and so on.
- Tableau Server: Tableau Server is like Desktop except that it allows multiple users to access a centrally-managed Tableau deployment and enables data source governance.
- Tableau Online: The cloud version of Tableau Server.
- Tableau Prep: A tool for connecting to a variety of data sources and combining, shaping, and cleaning your data for Tableau Desktop.
For more information about Tableau, see these Tableau resources.
There are multiple options for connecting and authenticating to Databricks. This article focuses on Tableau Desktop and Tableau Online.
Connect to Databricks from Tableau Desktop (2019.3 and above)
- If you aren’t running a supported driver, Tableau displays a message in the connection dialog with a link to the driver download page, where you can find driver links and installation instructions. Download the Databricks ODBC driver version 2.6.15 or above.
- Continue with Connect to Databricks from Tableau Desktop and Tableau Online.
Connect to Databricks from Tableau Desktop and Tableau Online
Decide how you are going to authenticate from Tableau to Databricks.
You can authenticate using a personal access token (recommended) or a username and password.
Step 1: Get Databricks connection information
- Get the server hostname and HTTP path.
- If you authenticate using a personal access token, get a token.
Step 2: Configure Databricks connection in Tableau
Launch Tableau.
Go to the Connect > To a Server menu and select the Databricks connector.
In the Databricks dialog, enter the Server Hostname and HTTP Path from Step 1.
In the Authentication drop-down, choose your preferred authentication method:
Note
Although the dialog offers the Azure Active Directory authentication option, it is not supported on Databricks.
Provide the parameters required for the authentication method and sign in:
- Personal Access Token
- In the Password field enter the token from Step 1.
- Click Sign in.
- Username / Password
- Enter username and password and sign in.
- Click Sign in.
- Personal Access Token
Tableau Server on Linux
Edit /etc/odbcinst.ini
to include the following:
Note
Tableau Server on Linux recommends 64-bit processing architecture.
Databricks Markdown Bold
Best practices and troubleshooting
In this section:
The two fundamental actions to optimize Tableau queries are:
- Reduce the number of records being queried and visualized in a single chart or dashboard.
- Reduce the number of queries being sent by Tableau in a single chart or dashboard.
Deciding which to try first depends on your dashboard. If you have a number of different charts for individual users all in the same dashboard, it’s likely that Tableau is sending too many queries to Databricks. If you only have a couple of charts but they take a long time to load, there are probably too many records being returned by Databricks to load effectively.
Tableau performance recording, available on both Tableau Desktop and Tableau Server, can help you understand where performance bottlenecks are by identifying processes that are causing latency when you run a particular workflow or dashboard.
For instance, if query execution is the problem, you know it has to do with the data engine process or the data source that you are querying. If the visual layout is performing slowly, you know that it is the VizQL.
If the performance recording says that the latency is in executing query, it is likely that too much time is taken by Databricks returning the results or by the ODBC/Connector overlay processing the data into SQL for VizQL. When this occurs, you should analyze what you are returning and attempt to change the analytical pattern to have a dashboard per group, segment, or article instead of trying to cram everything into one dashboard and relying on Quick Filters.
If the poor performance is caused by sorting or visual layout, the problem may be the number of marks the dashboard is trying to return. Databricks can return one million records quickly, but Tableau may not be able to compute the layout and sort the results. If this is a problem, aggregate the query and drill into the lower levels. You can also try a bigger machine, since Tableau is only constrained by physical resources on the machine on which it is running.
For an in-depth tutorial on the performance recorder, see Create a Performance Recording.
In general, a workflow that runs on Tableau Desktop will not be any faster on Tableau Server. A dashboard that doesn’t execute on Tableau Desktop won’t execute on Tableau Server. This is important to keep in mind.
In fact, getting things working on Desktop is a much better troubleshooting technique, because Tableau Server has more processes to consider when you troubleshoot. And if things work in Tableau Desktop but not in Tableau Server, then you can safely narrow the issue down to the processes in Tableau Server that aren’t in Tableau Desktop.
By default, the parameters from the connection URL override those in theSimba ODBC DSN. There are two ways you can customize the ODBC configurationsfrom Tableau:
.tds
file for a single data source:- Follow the instructions in Save Data Sources to export the
.tds
file for the data source. - Find the property line
odbc-connect-string-extras='
in the.tds
file and set the parameters. For example, to enableAutoReconnect
andUseNativeQuery
, you can change the line toodbc-connect-string-extras='AutoReconnect=1,UseNativeQuery=1'
. - Reload the
.tds
file by reconnecting the connection.
The compute resource is optimized to use less heap memory for collecting large results, so it can serve more rows per fetch block than Simba ODBC’s default. Append
RowsFetchedPerBlock=100000'
to the value of theodbc-connect-string-extras
property.- Follow the instructions in Save Data Sources to export the
.tdc
file for all data sources:- If you have never created a
.tdc
file, you can add TableauTdcExample.tdc to the folderDocument/MyTableauRepository/Datasources
. - Add the file to all developers’ Tableau Desktop installations, so that it works when the dashboards are shared.
- If you have never created a
There are a number of tactical chart optimizations that can help you improve the performance of your Tableau worksheets.
For filters that don’t change often and are not meant to be interacted with, use context filters, which speed up execution time.Another good rule of thumb is to use if/else
statements instead of case/when
statements in your queries.
Tableau can push down filters into data sources, which can greatly speed up query speeds. See Filtering Across Multiple Data Sources Using a Parameter and Filter Data Across Multiple Data Sources for more information about data source push down filters.
It is best to avoid table calculations if you can because they need to scan the full dataset. For more information about table calculations, see Transform Values with Table Calculations.
Here are a number of tips and troubleshooting exercises you can apply to improve the performance of your Tableau dashboard.
A common source of issues with Tableau dashboards connected to Databricks is the use of quick filters on individual dashboards that serve a number of different users, functions, or segments. You can attach global quick filters to all of the charts on the dashboard. It’s a great feature, but one that can quickly cause problems. One global quick filter on a dashboard with five charts will cause a minimum of 10 queries to be sent to Databricks. This can spiral to greater numbers as more filters are added and can cause massive performance problems, because Spark is not built to handle many concurrent queries starting at the same exact moment. This becomes more problematic when the Databricks Workspace cluster or SQL Analytics SQL endpoint that you are using is not large enough to handle the high volume of queries.
As a first step, we recommend that you use Tableau performance recording to troubleshoot what might be causing the issue.
If the poor performance is caused by sorting or visual layout, the problem may be the number of marks the dashboard is trying to return. Databricks can return one million records quickly, but Tableau may not be able to compute the layout and sort the results. If this is a problem, aggregate the query and drill into the lower levels. You can also try a bigger machine, since Tableau is only constrained by physical resources on the machine on which it is running.
Databricks Markdown Latex
For information about drilling down in Tableau, see Drill down into the details.
Databricks Markdown Table Of Contents
In general, seeing many granular marks is often a poor analytical pattern, because it doesn’t provide insight. Drilling down from higher levels of aggregation makes more sense and reduces the number of records that need to be processed and visualized.
Use actions to optimize dashboards
To drill from group to segment to article in order to obtain the same analysis and information as the “ocean boiled” dashboard, you can use Tableau actions. Actions allow you to click a mark (for example a state on a map) and be sent to another dashboard that filters based on the state you click. This reduces the need to have many filters on one dashboard and reduces the number of records that need to be generated, because you can set an action to not generate records until it gets a predicate to filter on.
For more information, see Actions and 6 Tips to Make Your Dashboards More Performant.
Caching data is a good way to improve the performance of worksheets or dashboards.
Caching in Tableau
Tableau has four layers of caching before it goes back to the data, whether that data is in a live connection or an extract:
- Tiles: If someone is loading the exact same dashboard and nothing changes, then Tableau tries to reuse the same tiles for the charts. This is similar to Google Maps tiles.
- Model: There are mathematical calculations used to generate visualizations in the event that tiles can’t be used. Tableau Server attempts to use the same models.
- Abstract: Aggregate results of queries are stored as well. This is the third “defense” level. If a query returns Sum(Sales), Count(orders), Sum(Cost), in a previous query and a future query wants just Sum(Sales), then Tableau grabs that result and uses it.
- Native Cache: If the query is the exact same as another one, Tableau uses the same results. This is the last level of caching. If this fails, then Tableau goes to the data.
Caching frequency
Tableau has administrative settings for caching more or less often. If the server is set to Refresh Less Often, Tableau keeps data in the cache for up to 12 hours. If it is set to Refresh More Often, Tableau goes back to the data on every page refresh.
Customers who have the same dashboard being used over again—for example, “Monday morning pipeline reports”—should be on a server set to Refresh Less Often so that the dashboards all use the same cache.
Cache warming
In Tableau you can warm the cache by setting a subscription for the dashboard to be sent before you want the dashboard viewed.This is because the dashboard needs to be rendered in order to generate the image for the subscription email. SeeWarming the Tableau Server Cache Using Subscriptions.
Caching in Databricks
You can cache popular tables or critical tables before users consume Tableau dashboards to reduce the time it takes for Databricks to return the results to Tableau.
You can run scripts in the morning to SELECTCACHE
for specific tables with Delta caching on virtual machines that are optimized for caching.
Databricks Markdown Image
You can even cache specific queries that Databricks gets from the driver.