Apache Hadoop YARN

Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline (Autoren)

Buch | Softcover

336 Seiten

2014
Addison-Wesley Educational Publishers (Verlag)
978-0-321-93450-5 (ISBN)

Titel ist leider vergriffen;
keine Neuauflage

Artikel merken

Written by Arun Murthy, lead developer for Hadoop 2.0
Details Architecture of how YARN apps are structured
Functional requirements for each element of an application are detailed
Walk-though of a sample app

»This book is a critically needed resource for the newly released Apache Hadoop 2.0, highlighting YARN as the significant breakthrough that broadens Hadoop beyond the MapReduce paradigm.« -From the Foreword by Raymie Stata, CEO of Altiscale

Apache Hadoop is right at the heart of the Big Data revolution. In the brand-new Release 2, Hadoop’s data processing has been thoroughly overhauled. The result is Apache Hadoop YARN, a generic compute fabric providing resource management at datacenter scale, and a simple method to implement distributed applications such as MapReduce to process petabytes of data on Apache Hadoop HDFS. Apache Hadoop 2 and YARN truly deserve to be called breakthroughs.

In »Apache Hadoop YARN«, key YARN developer Arun Murthy shows how to get your existing code to run on Apache Hadoop 2, and develop new applications that take absolutely full advantage of Hadoop clusters. Drawing on insights from the entire Apache Hadoop 2 team, Murthy and Dr. Douglas Eadline:

Review Apache Hadoop YARN’s goals, design, architecture, and components
Guide you through migrating existing MapReduce applications quickly and painlessly
Identify the functional requirements for each element of an Apache Hadoop 2 application
Walk you through a complete sample application project
Offer multiple examples and case studies drawn from their cutting-edge experience

Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data.

And now in Apache Hadoop(TM) YARN, two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances. YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment.

You’ll find many examples drawn from the authors’ cutting-edge experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it.

Coverage includes

YARN’s goals, design, architecture, and components—how it expands the Apache Hadoop ecosystem
Exploring YARN on a single node
Administering YARN clusters and Capacity Scheduler
Running existing MapReduce applications
Developing a large-scale clustered YARN application
Discovering new open source frameworks that run under YARN

Arun Murthy has contributed to Apache Hadoop full-time since the inception of the project in early 2006. He is a long-term Hadoop committer and a member of the Apache Hadoop Project Management Committee. Previously, he was the architect and lead of the Yahoo Hadoop MapReduce development team and was ultimately responsible, technically, for providing Hadoop MapReduce as a service for all of Yahoo--currently running on nearly 50,000 machines. Arun is the founder and architect of the Hortonworks Inc., a software company that is helping to accelerate the development and adoption of Apache Hadoop. Hortonworks was formed by the key architects and core Hadoop committers from the Yahoo! Hadoop software engineering team in June 2011. Funded by Yahoo! and Benchmark Capital, one of the preeminent technology investors, their goal is to ensure that Apache Hadoop becomes the standard platform for storing, processing, managing, and analyzing big data.

Vinod Kumar Vavilapalli has been contributing to Apache Hadoop project full-time since mid-2007. At Apache Software Foundation, he is a long-term Hadoop contributor, Hadoop committer, member of the Apache Hadoop Project Management Committee, and a foundation member. Vinod is a MapReduce and YARN go-to guy at Hortonworks Inc. For more than five years, he has been working on Hadoop. He was involved in HadoopOnDemand, Hadoop-0.20, CapacityScheduler, Hadoop security, and MapReduce, and is now a lead developer and the project lead for Apache Hadoop YARN. Before Hortonworks, he was at Yahoo!, working in the Grid team that made Hadoop what it is today, running at large scale--up to tens of thousands of nodes. Vinod loves reading books of all kinds and is passionate about using computers to change the world for better, bit by bit. He has a bachelor's degree in computer science and engineering from the Indian Institute of Technology Roorkee. He can be reached at twitter handle @tshooter.

Douglas Eadline, Ph.D., began his career as a practitioner and a chronicler of the Linux Cluster HPC revolution and now documents big data analytics. Starting with the first Beowulf How To document, Doug has written hundreds of articles, white papers, and instructional documents covering virtually all aspects of HPC computing. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld magazine, and was senior HPC editor for Linux Magazine. Currently, he is a consultant to the HPC industry and writes a monthly column in HPC Admin magazine. Both clients and readers have recognized Doug's ability to present a "technological value proposition" in a clear and accurate style. He has practical, hands-on experience in many aspects of HPC, including hardware and software design, benchmarking, storage, GPU, cloud, and parallel computing. He is the author of Hadoop Fundamentals LiveLessons (video) from Addison-Wesley.

Joseph Niemiec is a big data solutions engineer whose focus is on designing Hadoop solutions for many Fortune 1000 companies. In this position, Joseph has worked with customers to build multiple YARN applications providing a unique perspective on moving customers beyond batch processing, and has worked on YARN development directly. An avid technologist, Joseph has been focused on technology innovations since 2001. His interest in data analytics originally started in game score optimization as a teenager, and has shifted to helping customers uptake new technology innovations such as Hadoop and, most recently, building new data applications using YARN.

Jeff Markham is a solution engineer at Hortonworks Inc., the company promoting open source Hadoop. Previously, he was with VMware, Red Hat, and IBM, helping companies build distributed applications with distributed data. He has written articles on Java application development and has spoken at several conferences and to Hadoop User Groups. Jeff is a contributor to Apache Pig and Apache HDFS.

Foreword by Raymie Stata xiii
Foreword by Paul Dix xv
Preface xvii
Acknowledgments xxi
About the Authors xxv

Chapter 1: Apache Hadoop YARN: A Brief History and Rationale 1
Introduction 1
Apache Hadoop 2
Phase 0: The Era of Ad Hoc Clusters 3
Phase 1: Hadoop on Demand 3
Phase 2: Dawn of the Shared Compute Clusters 9
Phase 3: Emergence of YARN 18
Conclusion 20

Chapter 2: Apache Hadoop YARN Install Quick Start 21
Getting Started 22
Steps to Configure a Single-Node YARN Cluster 22
Run Sample MapReduce Examples 30
Wrap-up 31

Chapter 3: Apache Hadoop YARN Core Concepts 33
Beyond MapReduce 33
Apache Hadoop MapReduce 35
Apache Hadoop YARN 38
YARN Components 39
Wrap-up 42

Chapter 4: Functional Overview of YARN Components 43
Architecture Overview 43
ResourceManager 45
YARN Scheduling Components 46
Containers 49
NodeManager 49
ApplicationMaster 50
YARN Resource Model 50
Managing Application Dependencies 53
Wrap-up 57

Chapter 5: Installing Apache Hadoop YARN 59
The Basics 59
System Preparation 60
Script-based Installation of Hadoop 2 62
Script-based Uninstall 68
Configuration File Processing 68
Configuration File Settings 68
Start-up Scripts 71
Installing Hadoop with Apache Ambari 71
Wrap-up 84

Chapter 6: Apache Hadoop YARN Administration 85
Script-based Configuration 85
Monitoring Cluster Health: Nagios 90
Real-time Monitoring: Ganglia 97
Administration with Ambari 99
JVM Analysis 103
Basic YARN Administration 106
Wrap-up 114

Chapter 7: Apache Hadoop YARN Architecture Guide 115
Overview 115
ResourceManager 117
NodeManager 127
ApplicationMaster 138
YARN Containers 148
Summary for Application-writers 150
Wrap-up 151

Chapter 8: Capacity Scheduler in YARN 153
Introduction to the Capacity Scheduler 153
Capacity Scheduler Configuration 155
Queues 156
Hierarchical Queues 156
Queue Access Control 159
Capacity Management with Queues 160
User Limits 163
Reservations 166
State of the Queues 167
Limits on Applications 168
User Interface 169
Wrap-up 169

Chapter 9: MapReduce with Apache Hadoop YARN 171
Running Hadoop YARN MapReduce Examples 171
MapReduce Compatibility 181
The MapReduce ApplicationMaster 181
Calculating the Capacity of a Node 182
Changes to the Shuffle Service 184
Running Existing Hadoop Version 1 Applications 184
Running MapReduce Version 1 Existing Code 187
Advanced Features 188
Wrap-up 190

Chapter 10: Apache Hadoop YARN Application Example 191
The YARN Client 191
The ApplicationMaster 208
Wrap-up 226

Chapter 11: Using Apache Hadoop YARN Distributed-Shell 227
Using the YARN Distributed-Shell 227
Internals of the Distributed-Shell 232
Wrap-up 240

Chapter 12: Apache Hadoop YARN Frameworks 241
Distributed-Shell 241
Hadoop MapReduce 241
Apache Tez 242
Apache Giraph 242
Hoya: HBase on YARN 243
Dryad on YARN 243
Apache Spark 244
Apache Storm 244
REEF: Retainable Evaluator Execution Framework 245
Hamster: Hadoop and MPI on the Same Cluster 245
Wrap-up 245

Appendix A: Supplemental Content and Code Downloads 247
Available Downloads 247

Appendix B: YARN Installation Scripts 249
install-hadoop2.sh 249
uninstall-hadoop2.sh 256
hadoop-xml-conf.sh 258

Appendix C: YARN Administration Scripts 263
configure-hadoop2.sh 263

Appendix D: Nagios Modules 269
check_resource_manager.sh 269
check_data_node.sh 271
check_resource_manager_old_space_pct.sh 272

Appendix E: Resources and Additional Information 277

Appendix F: HDFS Quick Reference 279
Quick Command Reference 279

Index 287

Erscheint lt. Verlag	3.4.2014
Zusatzinfo	black & white illustrations, figures
Verlagsort	New Jersey
Sprache	englisch
Maße	179 x 228 mm
Gewicht	524 g
Einbandart	kartoniert
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Software Entwicklung ► SOA / Web Services
	Mathematik / Informatik ► Informatik ► Web / Internet
ISBN-10	0-321-93450-4 / 0321934504
ISBN-13	978-0-321-93450-5 / 9780321934505
Zustand	Neuware