Practical Apache Spark - Subhashini Chellappan, Dharanitharan Ganesan

Practical Apache Spark (eBook)

Using the Scala API

Subhashini Chellappan, Dharanitharan Ganesan (Autoren)

eBook Download: PDF

2018 | 1. Auflage
XVI, 288 Seiten
Apress (Verlag)
978-1-4842-3652-9 (ISBN)

On completion, you'll have knowledge of the functional programming aspects of Scala, and hands-on expertise in various Spark components. You'll also become familiar with machine learning algorithms with real-time usage.

What You Will Learn

Discover the functional programming features of Scala
Understand the complete architecture of Spark and its components
Integrate Apache Spark with Hive and Kafka
Use Spark SQL, DataFrames, and Datasets to process data using traditional SQL queries
Work with different machine learning concepts and libraries using Spark's MLlib packages

Who This Book Is For

Developers and professionals who deal with batch and stream data processing.

Subhashini Chellappan is an associate manager and technology enthusiast. She has rich experience in both academia and the software industry. She has published two books: Big Data Analytics and Pro Tableau. Her areas of interest and expertise are centered on business intelligence, big data analytics and cloud computing.

Bharath Kumar Dasa is a technology lead, with expertise in the big data space having core expertise in the complete Hadoop stack. Had worked on HDP distribution and has architected multiple data management and data life cycle auto service management projects for financial institutions. He has been working in machine learning and integration of machine learning with big data technologies for the past few years. His areas of interest and expertise are centered on big data and analytics, machine learning, data visualization and deep learning.

Dharanitharan Ganesan is a senior analyst with five years of experience in IT. He has a high level of exposure and experience in big data - Apache Hadoop, Apache Spark and various Hadoop ecosystem components. He has a proven track record of improving efficiency and productivity through the automation of various routine and administrative functions in business intelligence and big data technologies. His areas of interest and expertise are centered on machine learning algorithms, statistical modelling and predictive analysis.

Work with Apache Spark using Scala to deploy and set up single-node, multi-node, and high-availability clusters. This book discusses various components of Spark such as Spark Core, DataFrames, Datasets and SQL, Spark Streaming, Spark MLib, and R on Spark with the help of practical code snippets for each topic. Practical Apache Spark also covers the integration of Apache Spark with Kafka with examples. You'll follow a learn-to-do-by-yourself approach to learning - learn the concepts, practice the code snippets in Scala, and complete the assignments given to get an overall exposure. On completion, you'll have knowledge of the functional programming aspects of Scala, and hands-on expertise in various Spark components. You'll also become familiar with machine learning algorithms with real-time usage.What You Will LearnDiscover the functional programming features of ScalaUnderstand the completearchitecture of Spark and its componentsIntegrate Apache Spark with Hive and Kafka Use Spark SQL, DataFrames, and Datasets to process data using traditional SQL queriesWork with different machine learning concepts and libraries using Spark's MLlib packagesWho This Book Is ForDevelopers and professionals who deal with batch and stream data processing.

Subhashini Chellappan is an associate manager and technology enthusiast. She has rich experience in both academia and the software industry. She has published two books: Big Data Analytics and Pro Tableau. Her areas of interest and expertise are centered on business intelligence, big data analytics and cloud computing. Bharath Kumar Dasa is a technology lead, with expertise in the big data space having core expertise in the complete Hadoop stack. Had worked on HDP distribution and has architected multiple data management and data life cycle auto service management projects for financial institutions. He has been working in machine learning and integration of machine learning with big data technologies for the past few years. His areas of interest and expertise are centered on big data and analytics, machine learning, data visualization and deep learning. Dharanitharan Ganesan is a senior analyst with five years of experience in IT. He has a high level of exposure and experience in big data – Apache Hadoop, Apache Spark and various Hadoop ecosystem components. He has a proven track record of improving efficiency and productivity through the automation of various routine and administrative functions in business intelligence and big data technologies. His areas of interest and expertise are centered on machine learning algorithms, statistical modelling and predictive analysis.

Table of Contents 4
About the Authors 10
About the Technical Reviewers 11
Acknowledgments 12
Introduction 13
Chapter 1: Scala: Functional Programming Aspects 15
What Is Functional Programming? 16
What Is a Pure Function? 16
Example of Pure Function 17
Scala Programming Features 18
Variable Declaration and Initialization 19
Type Inference 20
Immutability 21
Lazy Evaluation 22
String Interpolation 24
String - s Interpolator 25
String - f Interpolator 26
String - raw Interpolator 27
Pattern Matching 27
Scala Class vs. Object 28
Singleton Object 29
Companion Classes and Objects 31
Case Classes 32
Pattern Matching on Case Classes 34
Scala Collections 35
Iterating Over the Collection 37
Common Methods of Collection 39
Functional Programming Aspects of Scala 41
Anonymous Functions 41
Higher Order Functions 43
Function Composition 44
Function Currying 45
Nested Functions 46
Functions with Variable Length Parameters 48
Reference Links 51
Points to Remember 51
Chapter 2: Single and Multinode Cluster Setup 52
Spark Multinode Cluster Setup 52
Recommended Platform 52
Operating System 53
Prerequisites 74
Spark Installation Steps 75
Spark Web UI 79
Spark Master UI 80
Spark Application UI 81
Stopping the Spark Cluster 83
Spark Single-Node Cluster Setup 83
Prerequisites 84
Spark Installation Steps 86
Spark Master UI 89
Points to Remember 90
Chapter 3: Introduction to Apache Spark and Spark Core 91
What Is Apache Spark? 92
Why Apache Spark? 92
Spark vs. Hadoop MapReduce 93
Apache Spark Architecture 94
Spark Components 96
Spark Core (RDD) 96
Spark SQL 96
Spark Streaming 97
MLib 97
GraphX 97
SparkR 97
Spark Shell 97
Spark Core: RDD 98
RDD Operations 100
Transformations 100
Actions 100
Creating an RDD 100
Using Parallelized Collection 100
From External Data Source 101
Creating an RDD from the Hadoop File System 102
Creating an RDD: File Partitioning 102
RDD Transformations 103
RDD Actions 107
Working with Pair RDDs 110
Direct Acylic Graph in Apache Spark 113
How DAG Works in Spark 113
How Spark Achieves Fault Tolerance Through DAG 115
Persisting RDD 116
Shared Variables 117
Broadcast Variables 118
Accumulators 118
Simple Build Tool (SBT) 119
Assignments 124
Reference Links 124
Points to Remember 125
Chapter 4: Spark SQL, DataFrames, and Datasets 126
What Is Spark SQL? 127
Datasets and DataFrames 127
Spark Session 127
Creating DataFrames 128
DataFrame Operations 129
Untyped DataFrame Operation: Select 130
Untyped DataFrame Operation: Filter 130
Untyped DataFrame Operation: Aggregate Operations 131
Running SQL Queries Programatically 132
Creating Views 132
Dataset Operations 134
Interoperating with RDDs 136
Reflection-Based Approach to Infer Schema 136
Different Data Sources 140
Generic Load and Save Functions 140
Manually Specifying Options 141
Run SQL on Files Directly 141
JDBC to External Databases 143
Working with Hive Tables 144
Building Spark SQL Application with SBT 146
Points to Remember 150
Chapter 5: Introduction to Spark Streaming 151
Data Processing 152
Streaming Data 152
Why Streaming Data Are Important 152
Introduction to Spark Streaming 152
Internal Working of Spark Streaming 153
Spark Streaming Concepts 154
Discretized Streams (DStream) 154
Streaming Context 154
DStream Operations 154
Spark Streaming Example Using TCP Socket 155
Stateful Streaming 159
Window-Based Streaming 159
Full-Session-Based Streaming 162
Streaming Applications Considerations 165
Points to Remember 166
Chapter 6: Spark Structured Streaming 167
What Is Spark Structured Streaming? 168
Spark Structured Streaming Programming Model 168
Word Count Example Using Structured Streaming 170
Creating Streaming DataFrames and Streaming Datasets 173
Operations on Streaming DataFrames/Datasets 174
Stateful Streaming: Window Operations on Event-Time 177
Stateful Streaming: Handling Late Data and Watermarking 180
Triggers 181
Fault Tolerance 183
Points to Remember 184
Chapter 7: Spark Streaming with Kafka 185
Introduction to Kafka 185
Kafka Core Concepts 186
Kafka APIs 186
Kafka Fundamental Concepts 187
Kafka Architecture 188
Kafka Topics 189
Leaders and Replicas 189
Setting Up the Kafka Cluster 190
Spark Streaming and Kafka Integration 192
Spark Structure Streaming and Kafka Integration 195
Points to Remember 197
Chapter 8: Spark Machine Learning Library 198
What Is Spark MLlib? 199
Spark MLlib APIs 199
Vectors in Scala 200
Vector Representation in Spark 202
Basic Statistics 203
Correlation 204
Hypothesis Testing 207
Extracting, Transforming, and Selecting Features 209
Feature Extractors 210
Term Frequency–Inverse Document Frequency (TF–IDF) 210
Example 212
Feature Transformers 215
Tokenizer 215
StopWordsRemover 216
StringIndexer 218
Feature Selectors 220
VectorSlicer 221
ML Pipelines 224
Pipeline Components 225
Estimators 225
Transformers 225
Pipeline Examples 225
Machine Learning Regression and Classification Algorithms 233
Regression Algorithms 233
Linear Regression 233
Classification Algorithms 238
Logistic Regression 238
Clustering Algorithms 243
K-Means Clustering 243
Points to Remember 245
Chapter 9: Working with SparkR 246
Introduction to SparkR 246
SparkDataFrame 246
SparkSession 247
Starting SparkR from RStudio 247
Creating SparkDataFrames 250
From a Local R DataFrame 250
From Other Data Sources 251
From Hive Tables 252
SparkDataFrame Operations 253
Selecting Rows and Columns 253
Grouping and Aggregation 254
Operating on Columns 256
Applying User-Defined Functions 257
Run a Given Function on a Large Data Set Using dapply or dapplyCollect 257
Running SQL Queries from SparkR 258
Machine Learning Algorithms 259
Regression and Classification Algorithms 259
Linear Regression 259
Logistic Regression 264
Decision Tree 267
Points to Remember 269
Chapter 10: Spark Real-Time Use Case 270
Data Analytics Project Architecture 271
Data Ingestion 271
Data Storage 272
Data Processing 272
Data Visualization 273
Use Cases 273
Event Detection Use Case 273
Build Procedure 279
Building the Application with SBT 280
Points to Remember 282
Index 283

Erscheint lt. Verlag	12.12.2018
Zusatzinfo	XVI, 280 p. 303 illus.
Verlagsort	Berkeley
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Informatik ► Netzwerke
	Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge
Schlagworte	Apache Spark • Big Data • Kafka • machine learning • R • Scala
ISBN-10	1-4842-3652-1 / 1484236521
ISBN-13	978-1-4842-3652-9 / 9781484236529

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 23,9 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

53,49 €