- HDInsight Essentials(Second Edition)
- Rajesh Nadipalli
- 1002字
- 2021-08-06 19:26:56
Enterprise Data Warehouse architecture
Over the last 3 decades, organizations have built EDW that consolidates data from various sources across the organization to enable business decisions, typically, related to current operational metrics and future what-if analysis for strategy decisions.
The following figure shows you a typical EDW architecture and also shows how information flows from the various source systems to the hands of business users:

Let's take a look at the stack from bottom to top.
Source systems
Typical data sources for an EDW are as follows:
- OLTP databases: These databases store data for transactional systems such as customer relationship management (CRM), Enterprise resource planning (ERP), including manufacturing, inventory, shipping, and others.
- XML and Text Files: Data is also received in the form of text files, which are generally delimited, or XML, or some other fixed format known within the organization.
Data warehouse
A data warehouse has two key subcomponents: storage and processing. Let's review these in detail.
The following are the key data stores for EDW:
- EDW: This is the heart of the complete architecture and is a relational database that hosts data from disparate sources in a consistent format such as base facts and dimensions. It is organized by the subject area/domain and preserves history for several years to enable analytics, trends, and ad hoc queries. An EDW infrastructure needs to be robust and scalable to meet the business continuity and growth requirements.
- Data marts: Each data mart is a relational database and is a subset of EDW typically, focusing on one subject area such as finance. It queries base facts from EDW and builds summarized facts and stores them as star or snowflake dimensional models.
- MDM: Master data management or MDM is a relational database that stores reference data to ensure consistent reporting across various business units of an organization. Common MDM datasets include products, customers, and accounts. MDM systems require governance to ensure reporting from various data marts that can be correlated and are consistent.
The following are the key processing mechanisms for EDW:
- ETL: Extract, Transform, and Load is a standard data warehouse design pattern that has three key steps: extract from various sources, transform to cleanse, and convert data to the information that is then loaded to various data marts for reporting. There are several tools in the marketplace such as Microsoft SQL Server Integration Services, Informatica, Pentaho, and others. ETL workflows are scheduled typically at daily frequency to update EDW facts and dimensions.
- SQL-based stored procedures: This is an alternative to using ETL tools and transform data natively using database features. Most relational databases provide custom stored procedure capabilities such as SQL Server, Oracle, and IBM.
User access
The following are typically used access mechanisms:
- BI dashboard: Business intelligence tools access the data from data marts to provide key performance indicators (KPI), scorecards, and dashboards. They allow the business to look at historical trends, current operational performance, perform what-if analysis, and predict future trends. There are several tools in the marketplace, including Microsoft SQL Server Reporting Services, Microsoft Power BI, Oracle Business Intelligence Enterprise Edition (OBIEE), SAP BusinessObjects, Tableau, and so on.
- Ad hoc analysis: IT organizations have realized the need to provide business with direct access to certain data for discovery and ad hoc analysis. Excel and several reporting tools fit this need.
- Operational reports: These are the day-to-day canned reports required to run the businesses such as daily sales collections, and customer support tickets opened and closed for the current day. These reports are generally required to be near real time and are based on one source system such as customer help desk system. There are several reporting systems such as SQL Server Reporting Services, IBM Cognos, and others that fit this need.
- Analytics: Analytical reports look for trends such as how is my customer satisfaction trending over time and the average operational overhead in dollars for support. These reports typically have a one day/week refresh cycle and collect information from multiple sources such as help desk system and PPM. Business intelligence tools typically fit this need.
Provisioning and monitoring
This area of the architecture is responsible for the following functions:
- Managing deployments of ETL code across various environments: development, test, and production
- Monitoring ingestions and jobs to ensure service-level agreements are met
- Operating procedures to recover in case of a failure
Data governance and security
Effective data governance in an enterprise ensures that data is consistent across departments and requires data stewards that identify and enforce the system of records. This requires tools, processes, and people to have high-quality data.
Security on the enterprise data warehouse affects all layers of the architecture and ensures that the right people have the right access to data. There are several technologies that enable fine-grained access at database level, table level, and row level. The filesystem can be encrypted to provide additional level of security.
Pain points of EDW
Based on an IDC research, by 2020, the digital universe will reach 40 ZB (zettabyte), which is a 50-fold growth from the beginning of 2010. (Reference: http://www.emc.com/about/news/press/2012/20121211-01.htm). Current IT data architectures based on EDW were not designed to handle this amount of data and are being stretched. The following are the key pain points of EDW:
- Scale: A data warehouse was built to handle data in terabytes (TBs) and currently business needs are reaching petabytes (PB). Typically, data warehouse DBAs archive data older than a certain date window like 5 years to address this issue. This data can be a useful information for long-term trends.
- Cost: EDW is typically an appliance-based model with proprietary engineered hardware and software supported by vendors such as Teradata, Oracle, IBM, and others. Upgrades are expensive and typically require all servers to be identical.
- Unstructured data: Data warehouses struggle to handle unstructured data such as logfiles, social media, and machine data.
- Timeliness: To produce business insights, data has to go through several transformations. Current EDW architectures are unable to meet growth and business demands for new insights on their data.
- Learning Neo4j
- Responsive Web Design by Example
- 小程序開發(fā)原理與實(shí)戰(zhàn)
- Jenkins Continuous Integration Cookbook(Second Edition)
- 區(qū)塊鏈技術(shù)與應(yīng)用
- C語言程序設(shè)計(jì)
- Clojure for Java Developers
- Illustrator CS6設(shè)計(jì)與應(yīng)用任務(wù)教程
- 計(jì)算語言學(xué)導(dǎo)論
- Mastering Object:Oriented Python(Second Edition)
- Java程序設(shè)計(jì)
- JavaScript Unit Testing
- Socket.IO Cookbook
- MySQL核心技術(shù)與最佳實(shí)踐
- Mastering R for Quantitative Finance