Data Preparation

Data Preparation refers to the process of integrating data sources, clean and transform data into formats that are ready for reporting and analysis.

Introduction#

Your company has a lot of data from multiple sources (multiple application databases, google sheets, excel files, 3rd party sources etc). You need to combine these sources to build reports and draw business insights.

In order to do this, these data need to be combined into a single place (database). That is the data warehouse, and the process/flow in which the data is regularly synced is called the data pipeline.

At the simple high level, there are 4 steps to build this process:

  1. Setup Data Warehouse: A database is setup and configured to be a data-warehouse
  2. Load/Import Data from multiple sources into a central database (data warehouse)
  3. Summarize/Transform Data: Data in data-warehouse are then pre-processed, summarized to aggregated tables for faster queries
  4. Data Reporting: A data reporting/BI tool is then connected to your data warehouse to build reports and dashboards

Building The Pipeline#

With Holistics Data Preparation, the steps to build the data pipeline are as follows:

1. Setup Data Warehouse#

Customer will setup a database on their own server to be a data-warehouse. This database is where all the data will be loaded into and summarized for analytics.

If your data is not big, you can start with simple database like PostgreSQL, SQL Server, etc. When your data grows, you can move to other platforms like Amazon Redshift, Google BigQuery, etc.

2. Sync Production DB to Data Warehouse#

Configure Holistics Data Transports to sync data from your production database into your data warehouse. Standard practice is to sync daily, but you can configure to sync it more regularly.

At this stage, your data warehouse is ready and already have data from your production database. You can immediately hook Holistics Reporting into the data-warehouse to build reports and dashboards.

3. Sync Spreadsheets to Data Warehouse#

If you have regular Excel, Google Sheets you want to sync into the data warehouse, Holistics Data Imports can be set up to sync these data.

4. Summarize / Transform Data#

When your data gets big or complicated, you will want to pre-summarize data. This is when you can use Holistics Data Transforms to summarize data from raw tables into summarized tables.

Let’s walk through an example after the pipeline is setup:

  • Everyday after midnight, data from application databases and different data sources are copied to your data warehouse. Your data warehouse should have all data for yesterday.
  • Right after, these data are then transformed and summarized into summarized tables
  • In the morning, when the team starts, they can log in to dashboards and see the up-to-date data for yesterday. Since the tables are summarized, accessing these reports are fast

Benefits#

Benefits of using Holistics Data Preparation:

1. Agile and Flexible#

The process is designed such that when a new report requirement comes in, Data team can easily import new tables to data warehouse and immediately build reports out of it (agile).

There is no need to analyze, plan and design the data warehouse (waterfall model) that might take weeks to implement.

2. Total Data Control#

Data from reporting to data-warehouse sits on customer’s data infrastructure; customer has total flexibility in choosing the data-warehouse technology, or when the data grows, they can easily scale up/scale down the database on their own terms.

3. No Coding Required#

Customers don’t need to write code to build data-warehouse.