What is Big Data Analytics?
In a nutshell, it means analytics like correlations, hidden patterns and other meaningful insights that are derived from the act of putting together and then examining large volumes of data.
Okay, but can u elaborate please?
Imagine you own a company that has been in operations for the last few years. Besides certain industry-specific solutions, you might have a core system or an ERP or separated components like financial and accounting system, customer relationship management (CRM) system, procurement system, human resource management system, several in-house developed systems, some competitive benchmarking data in Excel Sheets etc. You want to use data to help you in making business decisions but you realise that it is too difficult or near impossible. The systems do not talk to each other. Some are installed in servers within your premise. Some are hosted in datacentres while others reside on clouds. This is where big data becomes extremely useful. Big Data technology allows you to process large volumes of data that can come in any shape, form or size. The data has to undergo a series of transformation and cleansing process before they can be consumed by downstream analytical tools. The end result of which would be beautiful reports, charts, graphs and dashboards. With that said, visualisation and reporting is just the tip of the iceberg. Big Data would help you discover and uncover facts that you never know exist like hidden patterns, habits, trends and much more. You can also choose to add on machine learning and artificial intelligence to help you create predictive and decision engines that help suggest actions and decisions to bring your business to the next level. There are multiple routes from which you can choose to achieve your intended objectives but in order to understand which route to choose, you must first understand the components that make up the big data solution.
"Data is the new oil. It's valuable but if unrefined it cannot be used”
- CLIVE HUMBY, British Mathematician and Data Science Entrepreneur
Data is the New Oil
"Data is the new oil". "It's valuable, but if unrefined it cannot really be used". The phrase was originally coined by Clive Humby, a British mathematician and data science entrepreneur. The phrase was immortalized over the last decade as it was echoed by Gartner, The Economist, Forbes and in pretty much every article that speaks about the importance of data. From marketing and finance to logistics and product, decision-making of most large private corporations today are driven by data at all levels.
While data warehouses and data lakes may share some similar features and use cases, there are fundamental differences between the two in terms of design characteristics, data management philosophies and ideal use conditions. To understand which works best for you, we need to first understand the difference between the two.
While data warehouses and data lakes may share some similar features and use cases, there are fundamental differences between the two in terms of design characteristics, data management philosophies and ideal use conditions. To understand which works best for you, we need to first understand the difference between the two.
Data Ingestion
Data ingestion is the process of collecting data from various data sources e.g. data lakes, databases, sensors, IoT devices, SaaS Applications, social media and even manual Excel files and putting them into a targeted data repository most often a data warehouse. Knowing the type, size and complexity of data would eventually influence your decision on what to do next and how to do it. A very crucial decision to be made at this stage is whether to choose real-time data ingestion or batch data ingestion
Data Lake
A data lake is a centralized data repository system where data from a variety of sources whether structured, semi-structured or unstructured are stored as-is in their raw format usually accomplished through an ELT (Extract-Load-Transform) process. Data lakes help eliminate data silos as it can act as single landing zone for data from multiple sources. Data lakes ingest all data types in their source format without the need for structures and pre-defined data schemas. Data is aggregated and transformed only at the point of query. This data model is known as schema-on-read.
In an ELT process, data is extracted from data sources and loaded into the data lake in its raw form. The data is only transformed at the point of query.
Schema-on-Read Data Model does not require clear understanding of use cases. Compute resources are consumed at the point of query making it more cost-effective.
Data Warehouse
A data warehouse is a data management system that provides business intelligence for structured operational data. Data warehouses use predefined data schema to ingest structured data usually accomplished through an ETL (Extract-Transform-Load) process. The data source must fit into a predefined structure before it can enter the warehouse. This is also known as schema-on-write data model. The data is later connected to downstream analytical tools for Business Intelligence (BI) initiatives.
In an ETL process, data must first be extracted onto the staging layer of the data warehouse. The data is then cleaned and transformed into Dimensional Models and finally loaded onto a Data Mart (or Data Cube) for reporting, dashboarding and visualizations.
Schema-on-Write Data Model requires clear understanding of use cases as well as more time and compute resources.
When to Use Data Lake and When to Use Data Warehouse
There is no hard and fast rule, but the below can be used as a quick guide:
Use Data Warehouse when you:
Use Data Warehouse when you:
- have smaller data volumes that does not change too fast
- need periodic reports and dashboards like daily sales, weekly financial report or monthly performance reports
- deal with structured data
- have full clarity of what you want to accomplish (Data can only be loaded to the data warehouse after the use for it has been defined)
- want fast and easy access to multiple operational business users
- have larger data volumes that changes rapidly
- need real-time/near real-time data that tells you what happened one minute or five minutes ago
- deal with structured, semi-structured and unstructured data in its raw form
- do not have full clarity of use case (Data is loaded as-is in its raw form without any transformation)
- want fast and easy access to a few Data Scientists and Power Users
If you think of a data warehouse as a store of bottled water – cleansed and packaged and structured for easy consumption, the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples - James Dixon, Founder of Pentaho
The Verdict
Data Warehouse is the better option for companies that have clearly defined use cases and operational business users who need periodic reports, dashboards and visualizations. However, data warehouse can become very expensive to maintain as the volume of data increases. Data Lake is the perfect solution for a centralized repository to eliminate data silos. It is fast to deploy and perfect for large data volumes. Since data is only transformed at the point of query, only data scientists and superpower users can consume the data. Having said that, this does not mean that data from data lake cannot be consumed by analytical software, business intelligence, and others. To harness the technological potential of both the data warehouse and the data lake, large corporations can choose to build a data lake and then have the data loaded into data warehouses and data marts.