Building a Scalable Data Warehouse with Data Vault 2.0 - Daniel Linstedt, Michael Olschimke

Building a Scalable Data Warehouse with Data Vault 2.0 (eBook)

Daniel Linstedt, Michael Olschimke (Autoren)

eBook Download: PDF | EPUB

2015 | 1. Auflage
684 Seiten
Elsevier Science (Verlag)
978-0-12-802648-9 (ISBN)

The Data Vault was invented by Dan Linstedt at the U.S. Department of Defense, and the standard has been successfully applied to data warehousing projects at organizations of different sizes, from small to large-size corporations. Due to its simplified design, which is adapted from nature, the Data Vault 2.0 standard helps prevent typical data warehousing failures. 'Building a Scalable Data Warehouse' covers everything one needs to know to create a scalable data warehouse end to end, including a presentation of the Data Vault modeling technique, which provides the foundations to create a technical data warehouse layer. The book discusses how to build the data warehouse incrementally using the agile Data Vault 2.0 methodology. In addition, readers will learn how to create the input layer (the stage layer) and the presentation layer (data mart) of the Data Vault 2.0 architecture including implementation best practices. Drawing upon years of practical experience and using numerous examples and an easy to understand framework, Dan Linstedt and Michael Olschimke discuss: - How to load each layer using SQL Server Integration Services (SSIS), including automation of the Data Vault loading processes. - Important data warehouse technologies and practices. - Data Quality Services (DQS) and Master Data Services (MDS) in the context of the Data Vault architecture. - Provides a complete introduction to data warehousing, applications, and the business context so readers can get-up and running fast - Explains theoretical concepts and provides hands-on instruction on how to build and implement a data warehouse - Demystifies data vault modeling with beginning, intermediate, and advanced techniques - Discusses the advantages of the data vault approach over other techniques, also including the latest updates to Data Vault 2.0 and multiple improvements to Data Vault 1.0

Dan Linstedt has more than 25 years of experience in the Data Warehousing and Business Intelligence field and is internationally known for inventing the Data Vault 1.0 model and the Data Vault 2.0 System of Business Intelligence. He helps business and government organizations around the world to achieve BI excellence by applying his proven knowledge in Big Data, unstructured information management, agile methodologies and product development. He has held training classes and presented at TDWI, Teradata Partners, DAMA, Informatica, Oracle user groups and Data Modeling Zone conference. He has a background in SEI/CMMI Level 5, and has contributed architecture efforts to petabyte scale data warehouses and offers high quality on-line training and consulting services for Data Vault.

The Data Vault was invented by Dan Linstedt at the U.S. Department of Defense, and the standard has been successfully applied to data warehousing projects at organizations of different sizes, from small to large-size corporations. Due to its simplified design, which is adapted from nature, the Data Vault 2.0 standard helps prevent typical data warehousing failures. "e;Building a Scalable Data Warehouse"e; covers everything one needs to know to create a scalable data warehouse end to end, including a presentation of the Data Vault modeling technique, which provides the foundations to create a technical data warehouse layer. The book discusses how to build the data warehouse incrementally using the agile Data Vault 2.0 methodology. In addition, readers will learn how to create the input layer (the stage layer) and the presentation layer (data mart) of the Data Vault 2.0 architecture including implementation best practices. Drawing upon years of practical experience and using numerous examples and an easy to understand framework, Dan Linstedt and Michael Olschimke discuss:- How to load each layer using SQL Server Integration Services (SSIS), including automation of the Data Vault loading processes. - Important data warehouse technologies and practices. - Data Quality Services (DQS) and Master Data Services (MDS) in the context of the Data Vault architecture. - Provides a complete introduction to data warehousing, applications, and the business context so readers can get-up and running fast- Explains theoretical concepts and provides hands-on instruction on how to build and implement a data warehouse- Demystifies data vault modeling with beginning, intermediate, and advanced techniques- Discusses the advantages of the data vault approach over other techniques, also including the latest updates to Data Vault 2.0 and multiple improvements to Data Vault 1.0

Chapter 2

Scalable Data Warehouse Architecture

Abstract

Scalable data warehouses, as a desired solution to some of the problems introduced in the previous chapter, have specific architectural dimensions that are explained in this chapter, including workload, data complexity, query complexity, availability and data latency.

This chapter introduces the architecture of a Data Vault-based data warehouse, including the stage, data warehouse, and information mart layers. In addition, it shows how to use a business vault and other components of the proposed architecture.

Keywords

data complexity

query complexity

data latency

data warehouse

data vault

data architecture

Today’s data warehouse systems make it easy for analysts to access integrated data. In order to achieve this, the data warehouse development team had to process and model the data based on the requirements from the user. The best approach for developing a data warehouse is an iterative development process [1]. That means that the functionality of the data warehouse, as requested by the business users, is designed, developed, implemented and deployed in iterations (sometimes called a sprint or a cycle). In each iteration, more functionality is added to the data warehouse. This is opposite to a “big-bang” approach where all functionality is developed in one large process and finally deployed as a whole.

However, when executing the project, even when using an iterative approach, the effort (and the costs tied to it) to add another functionality usually increases because of existing dependencies that have to be taken care of.

Figure 2.1 shows that the effort to implement the first information mart is relatively low. But when implementing the second information mart, the development team has to maintain the existing solution and take care of existing dependencies, for example to data sources integrated for the first information mart or operational systems consuming information from existing tables. In order to make sure that this previously built functionality doesn’t break when deploying the new functionality for the second information mart, the old functionality has to be retested. In many cases, the existing solution needs to be refactored to maintain the functionality of the individual information marts when new sources are added to the overall solution. All these activities increase the effort for creating the second information mart and, equally, any subsequent information mart or other new functionality. This additional effort is depicted as the rise of the graph in Figure 2.1: once the first information mart is produced, the solution falls into a maintenance mode for all existing functionality. The next project implements another information mart. To implement the second information mart, the effort includes adding the new functionality and maintaining the existing functionality. Because of the dependencies, the existing functionality needs to be refactored and retested regularly.

Figure 2.1 The maintenance nightmare [2].

In other words, the extensibility of many data warehouse architectures, including those presented in Chapter 1, Introduction to Data Warehousing, is not optimal. Furthermore, typical data warehouse architectures often lack dimensions of scalability other than the described dimension of extensibility. We discuss these dimensions in the next section.

2.1. Dimensions of Scalable Data Warehouse Architectures

Business users of data warehouse systems expect to load and prepare more and more data, in terms of variety, volume, and velocity [3]. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions.

2.1.1. Workload

The enterprise data warehouse (EDW) is “by far the largest and most computationally intense business application” in a typical enterprise. EDW systems consist of huge databases, containing historical data on volumes from multiple gigabytes to terabytes of storage [4]. Successful EDW systems face two issues regarding the workload of the system: first, they experience rapidly increasing data volumes and application workloads and, second, an increasing number of concurrent users [5]. In order to meet the performance requirements, EDW systems are implemented on large-scale parallel computers, such as massively parallel processing (MPP) or symmetric multiprocessor (SMP) system environments and clusters and parallel database software. In fact, most medium- to large-size data warehouses could not be implementable without larger-scale parallel hardware and parallel database software to support them [4].

In order to handle the requested workload, there is more required than parallel hardware or parallel database software. The logical and physical design of the databases has to be optimized for the expected data volumes [6–8].

2.1.2. Data Complexity

Another dimension of enterprise data warehouse scalability is data complexity. The following factors contribute to the growth of data complexity [9]:

• Variety of data: nowadays, enterprise organizations capture more than just traditional (e.g., relational or mainframe) master or transactional data. There is an increasing amount of semi-structured data, for example emails, e-forms or HTML and XML files and unstructured data, such as document collections, social network data, images, video and sound files. Another type of data is sensor- and machine-generated data, which might require specific handling. In many cases, enterprises try to derive structured information from unstructured or semi-structured data to increase the business value of the data. While the files may have a structure, the content of the files doesn’t have one. For example, it is not possible to find the face of a specific person in a video without fully processing all frames of the video and building metadata tags to indicate where faces appear in the content.

• Volume of data: the rate at which companies generate and accumulate new data is increasing. Examples include content from Web sites or social networks, document and email collections, weblog data and machine-generated data. The increased data volume leads to much larger data sets, which can run into hundreds of terabytes or even into petabytes of data or beyond.

• Velocity of data: not only the variety and volume of data increases, but the rate at which the data is created also increases rapidly. One example is financial data from financial markets such as the stock exchange. Such data is generated at very high rates and immediately analyzed in order to respond to changes in the market. Other examples include credit card transactions data for fraud detection and sensor data or data from closed-circuit television (CCTV), which is captured for automated video and image analysis in real-time or near-real-time.

• Veracity (trustworthiness) of data: in order to have confidence in data, it must have strong data governance lineage traceability and robust data integration [10].

2.1.3. Analytical Complexity

Due to the availability of large volumes of data with high velocity and variety, businesses demand different and more complex analytical tasks to produce the insight required to solve their business problems. Some of these analyses require that the data be prepared in a fashion not foreseen by the original data warehouse developers. For example, the data that should be fed into a data mining algorithm should have different characteristics regarding the variety, volume and velocity of data.

Consider the example of retail marketing: the campaign accuracy and timeliness need to be improved when moving from retail stores to online channels where more detailed customer insights are required [11]:

• In order to determine customer segmentation and purchase behavior, the business might need historical analysis and reporting of customer demographics and purchase transactions

• Cross-sell opportunities can be identified by analyzing market baskets that show products that can be sold together

• To understand the online behavior of their customers, click-stream analysis is required. This can help to present up-sell offers to the visitors of a Web site

• Given the high amount of social network data and user-generated content, businesses tap into the data by analyzing product reviews, ratings, likes and dislikes, comments, customer service interactions, and so on.

These examples should make it clear that, in order to solve such new and complex analytical tasks, data sources of varying complexity are required. Also, mixing structured and unstructured data becomes more and more common [11].

2.1.4. Query Complexity

When business intelligence (BI) vendors select a relational database management system (RDBMS) for the storage and management of warehouse data, it is a natural choice. Relational databases provide simple data structures and high-level, set-oriented languages that make...

Erscheint lt. Verlag	15.9.2015
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Betriebssysteme / Server
	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Software Entwicklung ► User Interfaces (HCI)
	Mathematik / Informatik ► Informatik ► Theorie / Studium
	Sozialwissenschaften ► Kommunikation / Medien ► Buchhandel / Bibliothekswesen
ISBN-10	0-12-802648-0 / 0128026480
ISBN-13	978-0-12-802648-9 / 9780128026489

Haben Sie eine Frage zum Produkt?

PDF (Adobe DRM)
Größe: 142,8 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

EPUB (Adobe DRM)
Größe: 93,3 MB

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.