Jeff Nappi is the Director of Software Engineering at ClearVoice, which delivers high-quality content marketing campaigns by leveraging big data. Their data analytics stack at Clearvoice includes AWS Redshift, Stitch Data and Periscope Data, as described below.
Historically, if you wanted to report against all of the business operations of your company, it was a very expensive ordeal. At ClearVoice, we needed to be able to collect data across many platforms, including our own, and combine it all into a single view of how the business was performing. We looked at several all-in-one providers and integrated business intelligence tools but found their rigid nature created a significant amount of friction that kept us from getting things done. In addition, the cost of such solutions is often much higher than a typical startup would be able to stomach.
We began our hunt for the perfect business intelligence solution back in January of 2016. We evaluated products from a variety of vendors. It was an exhaustive effort in navigating the sales process of each company, grokking the feature sets of their products and evaluating the fit for our needs.
After evaluating a few all-in-one solutions, we came to the conclusion that they had quite a few downsides. First, there was significant lock-in. Once you commit to a provider that handles both your data pipeline and your analytics interface, you'll have a hard time making changes in the future. We also have a complex product with a marketplace as well as subscription and services revenue. None of the out-of-the-box solutions are really capable of handling our complex scenarios without a significant amount of customization that would primarily take place within a proprietary system. The takeaway here is that if you have a straightforward business model -- like a traditional e-commerce site or a simple software as a service (SaaS) subscription business, you may be able to find an all-in-one solution. If you have a more complex business, though, you will get better results combining multiple services to get the best solution.
Once we had decided that we wanted to retain control over our data, we knew that we had to gather three primary pieces of our system and proceed with integrating them. We needed a data service that would handle integrating with all of the services we wanted to pull data from: our application databases, customer relationship management system (CRM), accounting software, web analytics, billing and email automation system. Second, we needed a data warehouse service that was capable of storing all of our information in a single database that was scalable and had a common SQL interface with powerful multi-node cluster performance characteristics. Lastly, we needed a visualization tool that would sit on top of the SQL-based data warehouse and give us infinite flexibility in how we report on and analyze our data in a user-friendly and efficient manner.
We'll start with the back-end and move up the stack. First, we needed to set up the actual data warehouse, and that was one of the most straightforward decisions. Each of the cloud providers has solutions with powerful query capabilities in a clustered data warehouse system. We are already heavy users of one of the major clouds, and it was a no-brainer to adopt their solution in terms of our needs. The columnar analytical database provides advanced abilities to execute queries in parallel across many nodes -- this is an important ability as your query complexity and data size grows.
Once we had the data warehouse back-end up and running, we took a close look at the available data-pipeline services. These are a new breed of providers who integrate with all of the services and databases you use and replicate your data into a data warehouse, historically a labor-intensive process involving ETL developers and manual processes. While there are several SaaS providers out there providing this, there is also a fantastic new open-source project called Singer that makes it possible to set up your own in-house solution while leveraging the community to develop sources (taps) and destinations (targets).
Finally, and most importantly to our business users and analysts, we evaluated the visualization tools. Our specific need was to be able to build dashboards with a direct SQL interface. The dashboarding tool we selected allows you to go from SQL to a beautiful chart in moments. The interface is lightning fast, the charting abilities are very powerful and they have insanely good customer service. Our providers' crowning achievement is their chat support: They are instantly responsive on chat and can guide you through any issue you have with the software. The support is so good that it's one of the key things they point you to the moment you engage with them.
So there you have it. We've been able to build a full-stack analytics solution that integrates all of the data from our organization in one place, allowing us to join results across data from every service and database we use to run our business. We can answer just about any question that comes up from our executive, finance or success teams in a matter of minutes.
If you are looking to build a data lake reporting solution for your business, I recommend you follow these steps:
1. Identify if your business model and the services you use are well supported by existing tools. Start there if so.
2. Make a list of the data sources you need to report from and what data you would like to collect from them.
3. Research vendors that provide a data-pipeline service, or use your own development resources to make use of Singer.
4. Set up a data warehouse solution for storing your data. If you have a smaller data set, you can get away with using standard open-source databases such as Postgres or MySQL.
5. Evaluate analytics tools that are the best fit for your organization. I recommend that an SQL-based tool be used to provide the most flexibility. However, there are many BI tools out there that allow data sets to be set up to leverage SQL but provide a simplified interface for putting together dashboards.
6. Circle back to your reports periodically to validate and remove any out-of-date or inaccurate data as your business evolves.