Hadoop in Practice
Manning Publications (Verlag)
978-1-61729-023-7 (ISBN)
»Hadoop in Practice« collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs.
This book assumes the reader knows the basics of Hadoop.
Topics included:
- Conceptual overview of Hadoop and MapReduce
- 85 practical, tested techniques
- Real problems, real solutions
- How to integrate MapReduce and R
Alex Holmes is a senior software engineer with extensive expertise in solving big data problems using Hadoop. He has presented at JavaOne and Jazoon and is a technical lead at VeriSign.
preface
acknowledgments
about this book
Part 1 Background and Fundamentals
1 Hadoop in a heartbeat
1.1 What is Hadoop?
1.2 Running Hadoop
1.3 Chapter summary
Part 2 Data Logistics
2 Moving data in and out of Hadoop
2.1 Key elements of ingress and egress
2.2 Moving data into Hadoop
Technique 1 Pushing system log messages into HDFS with Flume
Technique 2 An automated mechanism to copy files into HDFS
Technique 3 Scheduling regular ingress activities with Oozie
Technique 4 Database ingress with MapReduce
Technique 5 Using Sqoop to import data from MySQL
Technique 6 HBase ingress into HDFS
Technique 7 MapReduce with HBase as a data source
2.3 Moving data out of Hadoop
Technique 8 Automated file copying from HDFS
Technique 9 Using Sqoop to export data to MySQL
Technique 10 HDFS egress to HBase
Technique 11 Using HBase as a data sink in MapReduce
2.4 Chapter summary
3 Data serialization—working with text and beyond
3.1 Understanding inputs and outputs in MapReduce
3.2 Processing common serialization formats
Technique 12 MapReduce and XML
Technique 13 MapReduce and JSON
3.3 Big data serialization formats
Technique 14 Working with SequenceFiles
Technique 15 Integrating Protocol Buffers with MapReduce
Technique 16 Working with Thrift
Technique 17 Next-generation data serialization with MapReduce
3.4 Custom file formats
Technique 18 Writing input and output formats for CSV
3.5 Chapter summary
Part 3 Big Data Patterns
4 Applying MapReduce patterns to big data
4.1 Joining
Technique 19 Optimized repartition joins
Technique 20 Implementing a semi-join
4.2 Sorting
Technique 21 Implementing a secondary sort
Technique 22 Sorting keys across multiple reducers
4.3 Sampling
Technique 23 Reservoir sampling
4.5 Chapter summary
5 Streamlining HDFS for big data
5.1 Working with small files
Technique 24 Using Avro to store multiple small files
5.2 Efficient storage with compression
Technique 25 Picking the right compression codec for your data
Technique 26 Compression with HDFS, MapReduce, Pig, and Hive
Technique 27 Splittable LZOP with MapReduce, Hive, and Pig
5.3 Chapter summary
6 Diagnosing and tuning performance problems
6.1 Measuring MapReduce and your environment
6.2 Determining the cause of your performance woes
Technique 28 Investigating spikes in input data
Technique 29 Identifying map-side data skew problems
Technique 30 Determining if map tasks have an overall low throughput
Technique 31 Small files
Technique 32 Unsplittable files
Technique 33 Too few or too many reducers
Technique 34 Identifying reduce-side data skew problems
Technique 35 Determining if reduce tasks have an overall low throughput
Technique 36 Slow shuffle and sort
Technique 37 Competing jobs and scheduler throttling
Technique 38 Using stack dumps to discover unoptimized user code
Technique 39 Discovering hardware failures
Technique 40 CPU contention
Technique 41 Memory swapping
Technique 42 Disk health
Technique 43 Networking
6.3 Visualization
Technique 44 Extracting and visualizing task execution times
6.4 Tuning
Technique 45 Profiling your map and reduce tasks
Technique 46 Avoid the reducer
Technique 47 Filter and project
Technique 48 Using the combiner
Technique 49 Blazingly fast sorting with comparators
Technique 50 Collecting skewed data
Technique 51 Reduce skew mitigation
6.5 Chapter summary
Part 4 Data Science
7 Utilizing data structures and algorithms
7.1 Modeling data and solving problems with graphs
Technique 52 Find the shortest distance between two users
Technique 53 Calculating FoFs
Technique 54 Calculate PageRank over a web graph
7.2 Bloom filters
Technique 55 Parallelized Bloom filter creation in MapReduce
Technique 56 MapReduce semi-join with Bloom filters
7.3 Chapter summary
8 Integrating R and Hadoop for statistics and more
8.1 Comparing R and MapReduce integrations
8.2 R fundamentals
8.3 R and Streaming
Technique 57 Calculate the daily mean for stocks
Technique 58 Calculate the cumulative moving average for stocks
8.4 Rhipe—Client-side R and Hadoop working together
Technique 59 Calculating the CMA using Rhipe
8.5 RHadoop—a simpler integration of client-side R and Hadoop
Technique 60 Calculating CMA with RHadoop
8.6 Chapter summary
9 Predictive analytics with Mahout
9.1 Using recommenders to make product suggestions
Technique 61 Item-based recommenders using movie ratings
9.2 Classification
Technique 62 Using Mahout to train and test a spam classifier
9.3 Clustering with K-means
Technique 63 K-means with a synthetic 2D dataset
9.4 Chapter summary
Part 5 Taming the Elephant
10 Hacking with Hive
10.1 Hive fundamentals
10.2 Data analytics with Hive
Technique 64 Loading log files
Technique 65 Writing UDFs and compressed partitioned tables
Technique 66 Tuning Hive joins
10.3 Chapter summary
11 Programming pipelines with Pig
11.1 Pig fundamentals
11.2 Using Pig to find malicious actors in log data
Technique 67 Schema-rich Apache log loading
Technique 68 Reducing your data with filters and projection
Technique 69 Grouping and counting IP addresses
Technique 70 IP Geolocation using the distributed cache
Technique 71 Combining Pig with your scripts
Technique 72 Combining data in Pig
Technique 73 Sorting tuples
Technique 74 Storing data in SequenceFiles
11.3 Optimizing user workflows with Pig
Technique 75 A four-step process to working rapidly with big data
11.4 Performance
Technique 76 Pig optimizations
11.5 Chapter summary
12 Crunch and other technologies
12.1 What is Crunch?
12.2 Finding the most popular URLs in your logs
Technique 77 Crunch log parsing and basic analytics
12.3 Joins
Technique 78 Crunch’s repartition join
12.4 Cascading
12.5 Chapter summary
13 Testing and debugging
13.1 Testing
Technique 79 Unit Testing MapReduce functions, jobs, and pipelines
Technique 80 Heavyweight job testing with the LocalJobRunner
13.2 Debugging user space problems
Technique 81 Examining task logs
Technique 82 Pinpointing a problem Input Split
Technique 83 Figuring out the JVM startup arguments for a task
Technique 84 Debugging and error handling
13.3 MapReduce gotchas
Technique 85 MapReduce anti-patterns
13.4 Chapter summary
appendix A Related technologies
appendix B Hadoop built-in ingress and egress tools
appendix C HDFS dissected
appendix D Optimized MapReduce join frameworks
index
I first encountered Hadoop in the fall of 2008 when I was working on an internetcrawl and analysis project at Verisign. My team was making discoveries similar to thosethat Doug Cutting and others at Nutch had made several years earlier regarding howto efficiently store and manage terabytes of crawled and analyzed data. At the time, wewere getting by with our home-grown distributed system, but the influx of a new datastream and requirements to join that stream with our crawl data couldn’t be sup-ported by our existing system in the required timelines. After some research we came across the Hadoop project, which seemed to be aperfect fit for our needs—it supported storing large volumes of data and provided amechanism to combine them. Within a few months we’d built and deployed a Map-Reduce application encompassing a number of MapReduce jobs, woven together withour own MapReduce workflow management system onto a small cluster of 18 nodes. Itwas a revelation to observe our MapReduce jobs crunching through our data in min-utes. Of course we couldn’t anticipate the amount of time that we’d spend debuggingand performance-tuning our MapReduce jobs, not to mention the new roles we tookon as production administrators—the biggest surprise in this role was the number ofdisk failures we encountered during those first few months supporting production! As our experience and comfort level with Hadoop grew, we continued to buildmore of our functionality using Hadoop to help with our scaling challenges. We alsostarted to evangelize the use of Hadoop within our organization and helped kick-startother projects that were also facing big data challenges. The greatest challenge we faced when working with Hadoop (and specificallyMapReduce) was relearning how to solve problems with it. MapReduce is its own flavor of parallel programming, which is quite different from the in-JVM programmingthat we were accustomed to. The biggest hurdle was the first one—training our brainsto think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Man-ning Publications, 2010) covers well. After you’re used to thinking in MapReduce, the next challenge is typically relatedto the logistics of working with Hadoop, such as how to move data in and out of HDFS,and effective and efficient ways to work with data in Hadoop. These areas of Hadoophaven’t received much coverage, and that’s what attracted me to the potential of thisbook—that of going beyond the fundamental word-count Hadoop usages and cover-ing some of the more tricky and dirty aspects of Hadoop. As I’m sure many authors have experienced, I went into this project confidentlybelieving that writing this book was just a matter of transferring my experiences ontopaper. Boy, did I get a reality check, but not altogether an unpleasant one, becausewriting introduced me to new approaches and tools that ultimately helped better myown Hadoop abilities. I hope that you get as much out of reading this book as I didwriting it.
Erscheint lt. Verlag | 13.10.2012 |
---|---|
Zusatzinfo | illustrations |
Verlagsort | New York |
Sprache | englisch |
Maße | 188 x 237 mm |
Gewicht | 875 g |
Themenwelt | Mathematik / Informatik ► Informatik ► Betriebssysteme / Server |
Informatik ► Datenbanken ► Data Warehouse / Data Mining | |
Mathematik / Informatik ► Informatik ► Software Entwicklung | |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
Mathematik / Informatik ► Informatik ► Web / Internet | |
ISBN-10 | 1-61729-023-8 / 1617290238 |
ISBN-13 | 978-1-61729-023-7 / 9781617290237 |
Zustand | Neuware |
Haben Sie eine Frage zum Produkt? |
aus dem Bereich