Next-Generation Big Data - Butch Quinto

Next-Generation Big Data

A Practical Guide to Apache Kudu, Impala, and Spark

(Autor)

Buch | Softcover
XXIII, 557 Seiten
2018
Apress (Verlag)
978-1-4842-3146-3 (ISBN)
48,14 inkl. MwSt
  • Details how to integrate popular third-party applications and platforms such as StreamSets, ZoomData, Talend, Pentaho, Cask, Oracle, and SQL Server with next-generation big data technologies such as Kudu, Impala, and Spark
  • First book covering Apache Kudu—a game-changer relational data store from Cloudera that will disrupt the traditional data warehouse market
  • Features big data use cases and case studies from some of the most successful deployments—GoPro, Mastercard, British Telecom, Navistar, oPower, Cerner, Shopzilla, and Caesars Entertainment

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies.

Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing.

Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard.

  • Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice
  • Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark
  • Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing
  • Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing
  • Turbocharge Spark with Alluxio, a distributed in-memory storage platform
  • Deploy big data in the cloud using Cloudera Director
  • Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark
  • Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks
  • Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling
  • Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard


This book is for BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics.

Butch Quinto is Director of Analytics and Information Management at Deloitte where he leads technology innovation, strategy, solutions development and delivery, business development, vendor alliance, and due diligence. He is also Technical Leader of Deloitte's ClearLight Lab, an R&D division that conducts innovative and game-changing research around advanced analytics, artificial intelligence, Internet of things, and big data. Butch has more than 20 years of experience in various technical and leadership roles in several industries including banking and finance, telecommunications, government, utilities, transportation, e-commerce, retail, technology, manufacturing, and bioinformatics. Butch is a recognized thought leader and a frequent speaker at conferences and events. He is a contributor to the Apache Spark and Apache Kudu open source projects, founder of the Cloudera Melbourne User Group, and Deloitte's Director of Alliance for Cloudera.

Chapter 1.- Introduction to the current big data landscape
* Introduction

* The Current Big Data Landscape

* Big Data in the Enterprise



Chapter 2.- Introduction to Kudu

* Introduction to Kudu

* Kudu History

* Kudu-Impala Integration

* Concepts and Terms

* Architectural Overview

* Use Cases

* Getting Started with Kudu

* Installation Guide

* Configuring Kudu

* Administering Kudu

* Troubleshooting Kudu

* Developing Applications with Kudu

* Kudu Schema Design

* Kudu Security

* Kudu Transaction Semantics

* Background Maintenance Tasks

* Kudu Configuration Reference

* Kudu Command Line Tools Reference

* Known Issues and Limitations



Chapter 3.- Introduction to Impala

* Architecture

* Parquet file format

* The impala shell

* Impala Benchmarks

* Impala SQL

* Impala Functions

o Math Functions

o String Functions

o Date and Time Functions

o Analytics Functions

* Impala User Defined Functions



Chapter 4.- High Performance Data Analysis with Impala and Kudu

* Impala and Kudu Integration

* Impala and Kudu vs Relational Data Warehouse

* Impala and Kudu Schema Design

o Data Types

o Partitioning

* Impala and Kudu Monitoring

* Impala and Kudu Performance Tuning

* Impala and Kudu Troubleshooting

* ODBC/JDBC

o Linked Server from SQL Server

o Oracle DB Link

o BI Applications

o PHP ODBC

o Java JDBC



Chapter 5.- Introduction to Spark

* Introduction

* Introduction to Functional Programming

* Introduction to Scala

* Spark Architecture

* Spark Core

* Spark SQL

* Spark Streaming

* Spark MLlib

* Spark GraphX



Chapter 6.- High-Performance Data Processing with Spark and Kudu

* Kudu and Spark Integration

o Spark and the Kudu context

o Spark and Kudu Examples

* Spark and Kudu in the Enterprise

o CSV, JSON and XML to Kudu

o Oracle and Kudu

o SQL Server and Kudu

o MySQL and Kudu

o HBase and Kudu

o Solr and Kudu

o Parquet and ORC and Kudu

o Amazon S3 and Kudu

o Spark Streaming with Kudu

* Spark and Kudu Monitoring

* Spark and Kudu Performance Tuning

* Spark and Kudu Troubleshooting



Chapter 7.- Batch and Real-time Data Ingestion and Processing

* Introduction to Batch data Ingestion

* Introduction to Real-time Data Ingestion

* StreamSets

* NIFI

* Cask CDAP

* Talend

* Pentaho

* Other Players

o Informatica Power Center

o IBM Data Stage

o SQL Server Integration Services

o Oracle Data Integrator

o Syncsort

o Snaplogic

* Native Tools

o Kafka

o Sqoop

o HDFS file commands

o Spark JDBC

o Kudu Java/C++ API



Chapter 8.- Big Data Warehousing and Business Intelligence

* Introduction to Data Warehousing and Business Intelligence

* Data Warehousing and Business Intelligence in the age of Big Data

* EDW Optimization

o ETL Offloading

o Active Archiving

o Data Consolidation

o ODS Replacement

o Data Mart Replacement

o Data Warehouse Replacement

* Data Warehousing

o Star Schema

o Snow Flake Schema

* Microsoft SQL Server 2016 Integration

o SQL Server Analysis Services

o SQL Server Reporting Services

o SQL Server Integration Services

SQL Server Polybase

SQL Server Linked Server

* Oracle 12c

o Oracle Gateway - show example

o JDBC - show example

* OBIEE - describe



Chapter 9.- Self-Service Big Data Analysis and Wrangling

* Introduction

* Zoomdata

* Tableau

* Qlik

* Power BI

* Datameer

* Trifacta

* Altyrix

* AtScale

* Hue

* Ambari Views

* Data Science Workbench

* Jupyter

* Apache Zeppelin



Chapter 10.- Distributed Big Data In-Memory Computing

* Introduction

* Alluxio

* Ignite

* Geode

* MemSQL



Chapter 11.- Big Data Governance and Management

* Cloudera Navigator

* Apache Atlas

* Informatica Metadata Manager

* Collibra

* Smartlogic



Chapter 12.- Big Data in the Cloud

* Cloudera Director

* AWS

* Azure

* Cloudera Altus

* EMR

* Azure



Chapter 13.- Big Data Use Cases

* Data Warehousing

* ETL Offloading

* Data Consolidation

* Data Archiving

* Internet of Things Platform

* Cybersecurity

* Fraud Detection

* Audit and Reporting Platform



Chapter 14.- Big Data Case Studies

* AMD - Data Warehousing

* British Telecom - Data Consolidation

* Mastercard - Anti-fraud, Advanced Search

* Cerner - Sepsis Detection

* Navistar - IoT

* Shopzilla - ETL Offloading and Data Science

* Caesars Entertainment - Customer 360

* Wargaming - Machine Learning, Recommendation Engine

“The book assumes familiarity with basic data analytics methods and tools, especially Hadoop, and presents high-level introductions to emerging technologies. It thus serves as a helpful guide for data analytics professionals seeking to keep pace with this dynamic and quickly advancing field.” (Harry J. Foxwell, Computing Reviews, April 10, 2019)

Erscheinungsdatum
Zusatzinfo 326 Illustrations, black and white
Verlagsort Berkley
Sprache englisch
Maße 178 x 254 mm
Gewicht 1097 g
Einbandart kartoniert
Themenwelt Informatik Datenbanken Data Warehouse / Data Mining
Mathematik / Informatik Informatik Netzwerke
Schlagworte Big Data • Data Warehouse • Impala • Kudu • Spark
ISBN-10 1-4842-3146-5 / 1484231465
ISBN-13 978-1-4842-3146-3 / 9781484231463
Zustand Neuware
Haben Sie eine Frage zum Produkt?
Mehr entdecken
aus dem Bereich
Datenanalyse für Künstliche Intelligenz

von Jürgen Cleve; Uwe Lämmel

Buch | Softcover (2024)
De Gruyter Oldenbourg (Verlag)
74,95
Auswertung von Daten mit pandas, NumPy und IPython

von Wes McKinney

Buch | Softcover (2023)
O'Reilly (Verlag)
44,90