Next-Generation Big Data

A Practical Guide to Apache Kudu, Impala, and Spark

Butch Quinto (Autor)

Buch | Softcover

XXIII, 557 Seiten

2018
Apress (Verlag)
978-1-4842-3146-3 (ISBN)

Artikel merken

Details how to integrate popular third-party applications and platforms such as StreamSets, ZoomData, Talend, Pentaho, Cask, Oracle, and SQL Server with next-generation big data technologies such as Kudu, Impala, and Spark
First book covering Apache Kudu—a game-changer relational data store from Cloudera that will disrupt the traditional data warehouse market
Features big data use cases and case studies from some of the most successful deployments—GoPro, Mastercard, British Telecom, Navistar, oPower, Cerner, Shopzilla, and Caesars Entertainment

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies.

Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing.

Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard.

Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice
Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark
Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing
Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing
Turbocharge Spark with Alluxio, a distributed in-memory storage platform
Deploy big data in the cloud using Cloudera Director
Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark
Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks
Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling
Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard

This book is for BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics.

Butch Quinto is Director of Analytics and Information Management at Deloitte where he leads technology innovation, strategy, solutions development and delivery, business development, vendor alliance, and due diligence. He is also Technical Leader of Deloitte's ClearLight Lab, an R&D division that conducts innovative and game-changing research around advanced analytics, artificial intelligence, Internet of things, and big data. Butch has more than 20 years of experience in various technical and leadership roles in several industries including banking and finance, telecommunications, government, utilities, transportation, e-commerce, retail, technology, manufacturing, and bioinformatics. Butch is a recognized thought leader and a frequent speaker at conferences and events. He is a contributor to the Apache Spark and Apache Kudu open source projects, founder of the Cloudera Melbourne User Group, and Deloitte's Director of Alliance for Cloudera.

Chapter 1.- Introduction to the current big data landscape
* Introduction

* The Current Big Data Landscape

* Big Data in the Enterprise

Chapter 2.- Introduction to Kudu

* Introduction to Kudu

* Kudu History

* Kudu-Impala Integration

* Concepts and Terms

* Architectural Overview

* Use Cases

* Getting Started with Kudu

* Installation Guide

* Configuring Kudu

* Administering Kudu

* Troubleshooting Kudu

* Developing Applications with Kudu

* Kudu Schema Design

* Kudu Security

* Kudu Transaction Semantics

* Background Maintenance Tasks

* Kudu Configuration Reference

* Kudu Command Line Tools Reference

* Known Issues and Limitations

Chapter 3.- Introduction to Impala

* Architecture

* Parquet file format

* The impala shell

* Impala Benchmarks

* Impala SQL

* Impala Functions

o Math Functions

o String Functions

o Date and Time Functions

o Analytics Functions

* Impala User Defined Functions

Chapter 4.- High Performance Data Analysis with Impala and Kudu

* Impala and Kudu Integration

* Impala and Kudu vs Relational Data Warehouse

* Impala and Kudu Schema Design

o Data Types

o Partitioning

* Impala and Kudu Monitoring

* Impala and Kudu Performance Tuning

* Impala and Kudu Troubleshooting

* ODBC/JDBC

o Linked Server from SQL Server

o Oracle DB Link

o BI Applications

o PHP ODBC

o Java JDBC

Chapter 5.- Introduction to Spark

* Introduction

* Introduction to Functional Programming

* Introduction to Scala

* Spark Architecture

* Spark Core

* Spark SQL

* Spark Streaming

* Spark MLlib

* Spark GraphX

Chapter 6.- High-Performance Data Processing with Spark and Kudu

* Kudu and Spark Integration

o Spark and the Kudu context

o Spark and Kudu Examples

* Spark and Kudu in the Enterprise

o CSV, JSON and XML to Kudu

o Oracle and Kudu

o SQL Server and Kudu

o MySQL and Kudu

o HBase and Kudu

o Solr and Kudu

o Parquet and ORC and Kudu

o Amazon S3 and Kudu

o Spark Streaming with Kudu

* Spark and Kudu Monitoring

* Spark and Kudu Performance Tuning

* Spark and Kudu Troubleshooting

Chapter 7.- Batch and Real-time Data Ingestion and Processing

* Introduction to Batch data Ingestion

* Introduction to Real-time Data Ingestion

* StreamSets

* NIFI

* Cask CDAP

* Talend

* Pentaho

* Other Players

o Informatica Power Center

o IBM Data Stage

o SQL Server Integration Services

o Oracle Data Integrator

o Syncsort

o Snaplogic

* Native Tools

o Kafka

o Sqoop

o HDFS file commands

o Spark JDBC

o Kudu Java/C++ API

Chapter 8.- Big Data Warehousing and Business Intelligence

* Introduction to Data Warehousing and Business Intelligence

* Data Warehousing and Business Intelligence in the age of Big Data

* EDW Optimization

o ETL Offloading

o Active Archiving

o Data Consolidation

o ODS Replacement

o Data Mart Replacement

o Data Warehouse Replacement

* Data Warehousing

o Star Schema

o Snow Flake Schema

* Microsoft SQL Server 2016 Integration

o SQL Server Analysis Services

o SQL Server Reporting Services

o SQL Server Integration Services

SQL Server Polybase

SQL Server Linked Server

* Oracle 12c

o Oracle Gateway - show example

o JDBC - show example

* OBIEE - describe

Chapter 9.- Self-Service Big Data Analysis and Wrangling

* Introduction

* Zoomdata

* Tableau

* Qlik

* Power BI

* Datameer

* Trifacta

* Altyrix

* AtScale

* Hue

* Ambari Views

* Data Science Workbench

* Jupyter

* Apache Zeppelin

Chapter 10.- Distributed Big Data In-Memory Computing

* Introduction

* Alluxio

* Ignite

* Geode

* MemSQL

Chapter 11.- Big Data Governance and Management

* Cloudera Navigator

* Apache Atlas

* Informatica Metadata Manager

* Collibra

* Smartlogic

Chapter 12.- Big Data in the Cloud

* Cloudera Director

* AWS

* Azure

* Cloudera Altus

* EMR

* Azure

Chapter 13.- Big Data Use Cases

* Data Warehousing

* ETL Offloading

* Data Consolidation

* Data Archiving

* Internet of Things Platform

* Cybersecurity

* Fraud Detection

* Audit and Reporting Platform

Chapter 14.- Big Data Case Studies

* AMD - Data Warehousing

* British Telecom - Data Consolidation

* Mastercard - Anti-fraud, Advanced Search

* Cerner - Sepsis Detection

* Navistar - IoT

* Shopzilla - ETL Offloading and Data Science

* Caesars Entertainment - Customer 360

* Wargaming - Machine Learning, Recommendation Engine

“The book assumes familiarity with basic data analytics methods and tools, especially Hadoop, and presents high-level introductions to emerging technologies. It thus serves as a helpful guide for data analytics professionals seeking to keep pace with this dynamic and quickly advancing field.” (Harry J. Foxwell, Computing Reviews, April 10, 2019)

Erscheinungsdatum	15.06.2018
Zusatzinfo	326 Illustrations, black and white
Verlagsort	Berkley
Sprache	englisch
Maße	178 x 254 mm
Gewicht	1097 g
Einbandart	kartoniert
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
Themenwelt	Mathematik / Informatik ► Informatik ► Netzwerke
Schlagworte	Big Data • Data Warehouse • Impala • Kudu • Spark
ISBN-10	1-4842-3146-5 / 1484231465
ISBN-13	978-1-4842-3146-3 / 9781484231463
Zustand	Neuware