How we decided on using Apache Superset for embedded analytics
At funda, we aim to deliver innovative tools for our users. In this blog post, Stephen Willems, Lead Service Owner, explains how we selected the right analytics tool for our brokers platform. From research to proof of concept, discover why Apache Superset became our solution of choice.
Vendor selection
First as a little background: as an online marketplace we strive to supply the best market insights to sellers and buyers. The best way to provide these is by embedding analytics within the website.
We started by listing our requirements for visualizations, security, performance, and supported data sources. Based on these criteria, we created a list of vendors to research and evaluate. This list included many large, NASDAQ-listed vendors, as well as some open-source options like Apache Superset, Cube, and Metabase.
The next step in our selection process was desk research (1) on each vendor, followed by a proof of concept (PoC) (2) with an anonymized dataset and some mandatory use cases being checked on the most promising candidate(s).
- Desk research is the selection phase where each vendor is reviewed based on what’s available in their marketing and technical documentation. This often yields quick results.
- The PoC phase is a more in-depth and time-consuming practical validation of the desk research. For a commercial vendor, a Request for Quotation (RFQ) type process may be necessary due to the volume and features required. The PoC will require access to a working instance of the service. A free tier or time-limited account without an upfront ‘hard sell’ is positive.
Desk research
This kind of research can also determine other qualities or requirements that may be of interest, or are disqualifying, without much effort. For example, poor quality documentation and pricing information or barriers to accessing it (such as having to log in, go through sales staff or a formal RFQ (type process) are somewhat negative, suggesting a lack of openness and a smaller community of enterprise users.
Besides validating the basic features, our desk research quickly identified one key implementation detail that could split the field for selection. It was clear there were two implementation camps:
- Embed an iframe in the site with some authentication
- Model and present using a Software Development Kit (SDK)
To determine which camp would work for us, we needed to know what control over the design was required: was it form over function? An iframe is quick and requires minimal effort from developers but grants them limited control. The SDK route offers full control but requires more developer effort and on-going maintenance.
Our target audience in this case are real estate agents, who are sensitive to the design but arguably more interested in the data correctness, flexibility and how we can improve insights over time. So, we sided with the iframe - more function over form. But was there more to know about the iframe implementation?
An old school iframe
Beyond the user requirement there were a host of other reasons the iframe was more desirable. It also empowers the data analyst and data team to be independent. They can use tools that they can manage end-to-end. Analysts are free to change everything with drag and drop, SQL and a few clicks.
It also avoids burdening a product development team which isn't usually set up to handle data visualization, unless it’s their speciality. Finally, it saves the step of having to dashboard the data to understand it, only to then reimplement everything again in code. Yes, there’s a collection of libraries for everything nowadays but it requires the ability to code and is slower to change and experiment. There’s a reason that there are specialized data visualization tools.
See also: Decoding funda's tech stack: the reasons behind our choices
There’s also limited coupling between the host page and the iframe, so if there’s a decision to change to another solution in the future this is more straightforward.
But there’s a downside here too: it's unlikely to play well on small screen devices, and developers can struggle with iframes due to a lack of control over layout, size and inter-frame communication. That being said, we’ve arrived at what’s a very typical analytics tool and a level of independence for the analyst team which are big wins.
Deciding upon the iframe had several advantages: by eliminating the SDK-only options from the list, the ones that supported both approaches or just the iframe remained.
Other features
We continued the desk research of the remaining tools and noticed there was not a significant difference between them in their filters, aggregations and visualizations. This may be a point of contention, but unless you have a very specific use case, almost all the same chart types and (cross) filters are available in all of them.
They do differ in their data connectors and data modelling capabilities, but we already use Stitch Data and DBT for our ETL and data modelling. If we didn’t already have these tools the selection process may have taken a very different turn.
They all had similar performance which was largely related to either the cache size or the performance of the underlying data source. The same could be said about security: each of them had some form of access control settings which would fulfil most use cases.
The next major difference was commercial: we had a mix of open source and closed in our list. This is where cost, time and effort would be factors.
At what cost
Analytic services are typically charged per viewer at $10-15 p/m and per editor at $45-75 p/m, plus storage and/or computing, and often a fixed fee of thousands per month.
Unit and fixed pricing are usually negotiable at scale, which reduces the internal support. We need to support thousands of logged-in viewers and so the costs ramp up quickly and this was a key factor.
There’s also an indirect cost when you need to introduce a feature that’s not available by default. Changes may be possible at a price but often vendors do not support the creation of custom components in the iframe or many changes to the default components via the UI.
These points made the open-source options more interesting. The ability to use any feature with a very low-cost (free-ish) model is great. Additionally, being able to adjust what’s under the hood, something akin to the SDK but retaining the iframe/ traditional analytics approach is very appealing.
The cost of open source is largely determined by the maintenance hours, computing and storage, which as a tech company we’re able to assess and handle. In past cases this has been predictable; right now, we favor open source.
Open source as a solution
The desk research identified Apache Superset as being an ideal contender in the open-source category. It has most, if not all the same features as the commercial options. It’s one of the most mature in the open-source community and there’s a managed cloud version via preset.io in case you outgrow self-hosted. You’re in good company, too: these are other Superset users in the wild.
The other tools we researched in the open-source space were still too immature or had a mixed model where the embedded features were part of the commercial package, e.g. Metabase.
Proof of concept
We continued to the PoC phase of the selection process knowing that the iframe would be a good solution and open source would be the most desirable from a flexibility and cost point of view. Apache Superset was our top contender, so we moved forward with it.
We quickly set up an instance on Azure, connected a BigQuery data source and implemented visualizations that we expected to eventually use in production. All of this was straightforward.
There were some rough edges within the styling but these were fixed using CSS. This conflict between our site’s aesthetics and what’s possible in Superset was most evident here and we recognized a close collaboration between an Analyst and UX designer would be needed; this is something that was already expected based on the desk research.
The available security features allowed us to limit users to their own data, and the ability to use Jinja within SQL for dynamic query generation improved the query performance. Another feature that was quite useful was the ability to set filters via URLs.
Additionally, our developers were able to add the iframe to a host page with access control in a few hours. There were some teething pains here when following the online documentation, due to a difference between versions.
There are great resources for Superset. The preset.io documentation is largely relevant, so look there for more specific details, but be mindful that there may be subtle differences.
The PoC had confirmed what was discovered in the desk research: Apache Superset could satisfy our current use cases in terms of visualization, security, data sources, cost and flexibility.
Apache Superset selected
We quickly went from PoC to a self-hosted multi-environment setup for development, acceptance and production using the open-source helm chart. Internal support hours so far have been minimal, partly because we're fortunate to have a talented SRE Team - thanks, Max Flentge!
We now have three dashboards in production and one more on the way, which will include one adjusted visualization - proving the benefits of open source. The setup has been remarkably reliable so far; credit here goes to all the developers that have contributed to the Apache Superset project! Their work has been invaluable in making this tool both robust and versatile.
See also: Game changer: why we ditched Selenium for Playwright
Question?
Do you have a burning question for Stephen after reading this blog? Feel free to reach out to him via email.