Spark (eBook)
216 Seiten
John Wiley & Sons (Verlag)
978-1-119-25404-1 (ISBN)
Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more.
Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings.
* Review Spark hardware requirements and estimate cluster size
* Gain insight from real-world production use cases
* Tighten security, schedule resources, and fine-tune performance
* Overcome common problems encountered using Spark in production
Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.
Ilya Ganelin is a data engineer working at Capital One Data Innovation Lab. Ilya is an active contributor to the core components of Apache Spark and a committer to Apache Apex. Ema Orhian is a Big Data Engineer interested in scaling algorithms. She is the main committer on jaws-spark-sql-rest, a data warehouse explorer on top of Spark SQL. Kai Sasaki is a software engineer working in distributed computing and machine learning. He is a Spark contributor who develops mainly MLlib, ML libraries. Brennon York has been a core contributor to Apache Spark since 2014 including development on GraphX and the core build environment.
Introduction xix
Chapter 1 Finishing Your Spark Job 1
Installation of the Necessary Components 2
Native Installation Using a Spark Standalone Cluster 3
The History of Distributed Computing That Led to Spark 3
Enter the Cloud 4
Understanding Resource Management 5
Using Various Formats for Storage 8
Text Files 10
Sequence Files 11
Avro Files 11
Parquet Files 12
Making Sense of Monitoring and Instrumentation 13
Spark UI 13
Spark Standalone UI 15
Metrics REST API 16
Metrics System 16
External Monitoring Tools 16
Summary 17
Chapter 2 Cluster Management 19
Background 21
Spark Components 24
Driver 25
Workers and Executors 26
Configuration 27
Spark Standalone 30
Architecture 31
Single?-Node Setup Scenario 31
Multi?-Node Setup 32
YARN 33
Architecture 35
Dynamic Resource Allocation 37
Scenario 39
Mesos 40
Setup 41
Architecture 42
Dynamic Resource Allocation 44
Basic Setup Scenario 44
Comparison 46
Summary 50
Chapter 3 Performance Tuning 53
Spark Execution Model 54
Partitioning 56
Controlling Parallelism 56
Partitioners 58
Shuffling Data 59
Shuffling and Data Partitioning 61
Operators and Shuffl ing 63
Shuffling Is Not That Bad After All 67
Serialization 67
Kryo Registrators 69
Spark Cache 69
Spark SQL Cache 73
Memory Management 73
Garbage Collection 74
Shared Variables 75
Broadcast Variables 76
Accumulators 78
Data Locality 81
Summary 82
Chapter 4 Security 83
Architecture 84
Security Manager 84
Setup Configurations 85
ACL 86
Configuration 86
Job Submission 87
Web UI 88
Network Security 95
Encryption 96
Event logging 101
Kerberos 101
Apache Sentry 102
Summary 102
Chapter 5 Fault Tolerance or Job Execution 105
Lifecycle of a Spark Job 106
Spark Master 107
Spark Driver 109
Spark Worker 111
Job Lifecycle 112
Job Scheduling 112
Scheduling within an Application 113
Scheduling with External Utilities 120
Fault Tolerance 122
Internal and External Fault Tolerance 122
Service Level Agreements (SLAs) 123
Resilient Distributed Datasets (RDDs) 124
Batch versus Streaming 130
Testing Strategies 133
Recommended Confi gurations 139
Summary 142
Chapter 6 Beyond Spark 145
Data Warehousing 146
Spark SQL CLI 147
Thrift JDBC/ODBC Server 147
Hive on Spark 148
Machine Learning 150
DataFrame 150
MLlib and ML 153
Mahout on Spark 158
Hivemall on Spark 160
External Frameworks 161
Spark Package 161
XGBoost 163
spark?-jobserver 164
Future Works 166
Integration with the Parameter Server 167
Deep Learning 175
Enterprise Usage 182
Collecting User Activity Log with Spark and Kafka 183
Real?-Time Recommendation with Spark 184
Real?-Time Categorization of Twitter Bots 186
Summary 186
Index 189
Erscheint lt. Verlag | 4.3.2016 |
---|---|
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Mathematik / Informatik ► Informatik ► Netzwerke | |
Schlagworte | Apache Spark • Big Data • Cluster (Rechnernetz) • Computer Science • Database & Data Warehousing Technologies • Datenbanken u. Data Warehousing • Informatik |
ISBN-10 | 1-119-25404-3 / 1119254043 |
ISBN-13 | 978-1-119-25404-1 / 9781119254041 |
Haben Sie eine Frage zum Produkt? |
Größe: 13,9 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich