Virtualizing Hadoop
VMWare Press (Hersteller)
978-0-13-381116-2 (ISBN)
- Keine Verlagsinformationen verfügbar
- Artikel merken
Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution.
First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices.
Finally, they bring Hadoop and virtualization together, guiding you through the decisions you'll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you'll find reliable answers for choosing your best Hadoop strategy and executing it.
Coverage includes the following:
* Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop
* Understanding YARN resource management, HDFS storage, and I/O
* Designing data ingestion, movement, and organization for modern enterprise data platforms
* Defining SQL engine strategies to meet strict SLAs
* Considering security, data isolation, and scheduling for multitenant environments
* Deploying Hadoop as a service in the cloud
* Reviewing the essential concepts, capabilities, and terminology of virtualization
* Applying current best practices, guidelines, and key metrics for Hadoop virtualization
* Managing multiple Hadoop frameworks and products as one unified system
* Virtualizing master and worker nodes to maximize availability and performance
* Installing and configuring Linux for a Hadoop environment
George J. Trujillo, Jr. is an experienced corporate executive with exceptional communication skills. He is an expert in change management with strong leadership skills, critical thinking, and data-driven decisions. George is an internationally recognized data architect, leader, and speaker in big data and cloud solutions. His background includes Big Data Architecture, Hadoop (Hortonworks, Cloudera), data governance, schema design, metadata management, security, NoSQL, and BI. He has many industry recognitions, including Oracle Recognized Double ACE, Sun Ambassador for Sun Microsystem's Application Middleware Platform, VMware Recognized vExpert, VMware Certified Instructor, MySQL's Socrates Award, and MySQL Certified DBA. His leadership in the user community includes Independent Oracle Users Group (IOUG) board of directors, president of IOUG Cloud SIG, chair for RMOUG Big Data SIG, president of RMOUG Cloud SIG, Oracle Fusion Council and Oracle Beta Leadership Council, IOUG's Elected to "Oracles of Oracle" circle, and master presenter for the IOUG's Master Series. His many job positions have included vice president of big data architecture in the financial services industry, master principal big data specialist at Hortonworks, tier one data specialist for VMware Center of Excellence, and CEO for professional services and training organization. Charles Kim is the president of Viscosity North America, a niche consulting organization specializing in big data, Oracle Exadata/RAC, and virtualization. Charles is an architect in Hadoop/big data, Linux infrastructure, cloud, virtualization, engineered systems, and Oracle clustering technologies. Charles is an author with Oracle Press, Pearson, and APress in Oracle, Hadoop, and Linux technology stacks. He holds certifications in Oracle, VMware, Red Hat Linux, and Microsoft and has more than 23 years of IT experience on mission- and business-critical systems. Charles presents regularly at VMworld, Oracle OpenWorld, IOUG, and various local/regional user group conferences. He is an Oracle ACE director, VMware vExpert, Oracle Certified DBA, Certified Exadata Specialist, and a Certified RAC Expert. Charles's books include the following: * Oracle Database 11g New Features for DBA and Developers * Linux Recipes for Oracle DBAs * Oracle Data Guard 11g Handbook * Virtualizing Business Critical Oracle Databases: Database as a Service * Oracle ASM 12c Pocket Reference Guide * Expert Exadata Handbook Charles is the president of the Cloud Computing (and Virtualization) SIG for the Independent Oracle User Group. Charles blogs regularly at the DBAExpert.com/ blog site. His LinkedIn profile is http://www.linkedin.com/in/chkim. His Twitter tag is @racdba Steven Jones is a 16-year veteran of technical training with experience in UNIX, networking, database technology, virtualization, and big data. Steven works at VMware as a VMware Certified Instructor; VCA; VCP 4, 5, 6; and vExpert 2014, 2015. He is a coauthor of Virtualize Oracle Business Critical Databases: Database Infrastructure as a Service, by Charles Kim, George Trujillo, Steven Jones, and Sudhir Balasubramanian 2014 iBooks. He was a speaker for VMworld 2013 Virtualizing Mission Critical Oracle RAC with vC Ops, San Francisco and Barcelona, and a co-speaker worldwide for VMware Education SDDC Intensive Workshop. Steven seeks to bring innovation, analogy, and narrative to understanding and mastering information technology as a service. Rommel Garcia is a senior solutions engineer at Hortonworks, a leading open source company driving the adoption of Hadoop. Rommel has spent the past few years focusing on the design, installation, and deployment of large-scale Hadoop ecosystems. He has helped organizations implement security best practices and guidelines for Hadoop platforms. He has performance tuned Hadoop clusters ranging from fast-growing startups to Fortune 100 organizations. Rommel is a nationally recognized speaker at Hadoop and big data conferences. He is also well known for his expertise in performance tuning Java applications and middle-tier platforms. He has a BS in electronics engineering and an MS degree in computer science. Rommel resides in Atlanta with his wife, Elizabeth, and his children, Mila and Braden. Justin Murray is a senior technical marketing architect at VMware. He holds a BA and a post-graduate diploma in computer science from University College Cork in Ireland. Justin has worked in software engineering, technical training, and consulting in various companies in the UK and the United States. Since 2007, he has been working with VMware's partner companies to validate and optimize big data and other next-generation application workloads on VMware vSphere.
Foreword xix
Preface xxi
Part I: Introduction to Hadoop
Chapter 1 Understanding the Big Data World 1
The Data Revolution 2
Traditional Data Systems 4
Semi-Structured and Unstructured Data 5
Causation and Correlation 7
Data Challenges 8
The Modern Data Architecture 17
Organizational Transformations 20
Industry Transformation 21
Summary 22
Chapter 2 Hadoop Fundamental Concepts 23
Types of Data in Hadoop 23
Use Cases 25
What Is Hadoop? 26
Hadoop Distributions 32
Hadoop Frameworks 32
NoSQL Databases 37
What Is NoSQL? 38
A Hadoop Cluster 42
Hadoop Software Processes 45
Hadoop Hardware Profiles 48
Roles in the Hadoop Environment 56
Summary 59
Chapter 3 YARN and HDFS 61
A Hadoop Cluster Is Distributed 61
Hadoop Directory Layouts 65
Hadoop Operating System Users 67
The Hadoop Distributed File System 67
YARN Logging 70
The NameNode 70
The DataNode 71
Block Placement 75
NameNode Configurations and Managing Metadata 77
Rack Awareness 82
Block Management 83
The Balancer 84
Maintaining Data Integrity in the Cluster 84
Quotas and Trash 92
YARN and the YARN Processing Model 93
Running Applications on YARN 101
Resource Schedulers 107
Benchmarking 112
TeraSort Benchmarking Suite 115
Summary 117
Chapter 4 The Modern Data Platform 119
Designing a Hadoop Cluster 119
Enterprise Data Movement 124
Summary 140
Chapter 5 Data Ingestion 141
Extraction, Loading, and Transformation (ELT) 141
Sqoop: Data Movement with SQL Sources 143
Flume: Streaming Data 148
Oozie: Scheduling and Workfl ow 167
Falcon: Data Lifecycle Management 172
Kafka: Real-time Data Streaming 176
Summary 186
Chapter 6 Hadoop SQL Engines 187
Where SQL Was Born 187
SQL in Hadoop 188
Hadoop SQL Engines 190
Selecting the SQL Tool For Hadoop 190
Now Getting Groovy with Hive and Pig 198
Hive 199
HCatalog 213
Pig 215
Summary 221
Chapter 7 Multitenancy in Hadoop 223
Securing the Access 224
Authentication 225
Auditing 230
Authorization 230
Data Protection 232
Isolating the Data 241
Isolating the Process 251
Summary 255
Part II: Introduction to Virtualization
Chapter 8 Virtualization Fundamentals 257
Why Virtualize Hadoop? 258
Introduction to Virtualization 261
Summary 276
References 276
Chapter 9 Best Practices for Virtualizing Hadoop 277
Running Virtualized Hadoop with Purpose and Discipline 277
The Discipline of Purpose Starts with a Clear Target 279
Virtualizing Different Tiers of Hadoop 280
Industry Best Practices 282
Summary 298
Part III: Virtualizing Hadoop
Chapter 10 Virtualizing Hadoop 299
How Are Hadoop Ecosystems Going to Be Managed? 300
Building an Enterprise Hadoop Platform That Is Agile and Flexible 301
Clarification of Terms 302
The Journey from Bare-Metal to Virtualization 303
Why Consider Virtualizing Hadoop? 304
Benefits of Virtualizing Hadoop 305
Virtualized Hadoop Can Run as Fast or Faster Than Native 306
Coordination and Cross-Purpose Specialization Is the Future 309
Barriers Can Be Organizational 310
Virtualization Is Not an All or Nothing Option 310
Rapid Provisioning and Improving Quality of Development and Test Environments 311
Improve High Availability with Virtualization 313
Use Virtualization to Leverage Hadoop Workloads 313
Hadoop in the Cloud 314
Big Data Extensions 314
The Path to Virtualization 315
The Software-Defined Data Center 316
Virtualizing the Network 318
vRealize Suite 320
Summary 321
References 322
Chapter 11 Virtualizing Hadoop Master Servers 323
Virtualizing Servers in a Hadoop Cluster 324
Virtualizing the Environment Around Hadoop 325
Virtualizing the Master Hadoop Servers 325
Virtualizing Without the SAN 330
Summary 331
Chapter 12 Virtualizing the Hadoop Worker Nodes 333
A Brief Introduction to the Worker Nodes in Hadoop 333
Deployment Models for Hadoop Clusters 335
The Combined Model 336
The Separated Model 339
Network Effects of the Data-Compute Separation 341
The Shared-Storage Approach to the Data-Compute Separated Model 343
Local Disks for the Application's Temporary Data 345
The Shared Storage Architecture Model Using Network-Attached Storage (NAS) 345
Deployment Model Summary 348
Best Practices for Virtualizing Hadoop Workers 349
Disk I/O 349
The Hadoop Virtualization Extensions (HVE) 354
Summary 357
References 358
Resources 358
Chapter 13 Deploying Hadoop as a Service in the Private Cloud 361
The Cloud Context 361
Stakeholders for Hadoop 362
Overview of the Solution Architecture 368
Summary 370
References 371
Chapter 14 Understanding the Installation of Hadoop 373
Map the Right Solutions to the Right Use Case 373
Thoughts About Installing Hadoop 374
Configuring Repositories 376
Installing HDP 2.2 378
Environment Preparation 378
Setting Up the Hadoop Configuration 389
Starting HDFS and YARN 393
Start YARN 396
Verifying MapReduce Functionality 398
Installing and Configuring Hive 400
Installing and Configuring MySQL Database 401
Installing and Configuring Hive and HCatalog 401
Summary 404
Chapter 15 Configuring Linux for Hadoop 405
Supported Linux Platforms 406
Different Deployment Models 406
Linux Golden Templates 407
Building a Linux Enterprise Hadoop Platform 408
Selecting the Linux Distribution 411
Optimal Linux Kernel Parameters and System Settings 411
epoll 411
Disable Swap Space 412
Disable Security During Install 412
IO Scheduler Tuning 414
Check Transparent Huge Pages Configuration 414
Limits.conf 414
Partition Alignment for RDMs 415
File System Considerations 416
Lazy Count Parameter for XFS 418
Mount Options 418
I/O Scheduler 419
Disk Read and Write Options 421
Storage Benchmarking 421
Java Version 422
Set Up NTP 423
Enable Jumbo Frames 424
Additional Network Considerations 425
Summary 427
Appendix A Hadoop Cluster Creation: A Prerequisite Checklist 429
Appendix B Big Data/Hadoop on VMware vSphere Reference Materials 433
Deployment Guides 433
Reference Architectures 434
Customer Case Studies 434
Performance 434
vSphere Big Data Extensions (BDE) 435
Other vSphere Features and Big Data 436
9780133811025 TOC 7/7/2015
Reihe/Serie | VMware Press Technology |
---|---|
Verlagsort | NJ |
Sprache | englisch |
Gewicht | 1 g |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
ISBN-10 | 0-13-381116-6 / 0133811166 |
ISBN-13 | 978-0-13-381116-2 / 9780133811162 |
Zustand | Neuware |
Haben Sie eine Frage zum Produkt? |