Big Data

Principles and Best Practices of Scalable Realtime Data Systems

Nathan Marz, James Warren (Autoren)

Buch | Softcover

425 Seiten

2015
Manning Publications (Verlag)
978-1-61729-034-3 (ISBN)

Artikel merken

Introduction to big data systems
Real-time processing of web-scale data
Tools like Hadoop, Cassandra, and Storm
Extensions to traditional database skills

Services like social networks, web analytics, and intelligent e-commerce often need to manage data at a scale too big for a traditional database. As scale and demand increase, so does Complexity.

Fortunately, scalability and simplicity are not mutually exclusive- rather than using some trendy technology, a different approach is needed. Big data systems use many machines working in parallel to store and process data, which introduces fundamental challenges unfamiliar to most developers.

Big Data shows how to build these systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.

It describes a scalable, easy to understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to use them in practice, and how to deploy and operate them once they're built.

This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful.

Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. He is an engineer at Twitter. He was previously Lead Engineer at BackType, a marketing intelligence company that was acquired by Twitter in July of 2011. He is the author of two major open source projects: Storm, a distributed realtime computation system, and Cascalog, a tool for processing data on Hadoop. He is a frequent speaker and writes a blog at nathanmarz.com.

James Warren is an analytics architect at Storm8 with a background in big data processing, machine learning and scientific computing.

preface
acknowledgments
about this book

1 A new paradigm for Big Data
1.1 How this book is structured
1.2 1.2Scaling with a traditional database
1.3 NoSQL is not a panacea
1.4 First principles
1.5 Desired properties of a Big Data system
1.6 The problems with fully incremental architectures
1.7 Lambda Architecture
1.8 Recent trends in technology
1.9 Example application: SuperWebAnalytics.com
1.10 Summary
Part 1 Batch layer

2 Data model for Big Data
2.1 The properties of data
2.2 The fact-based model for representing data
2.3 Graph schemas
2.4 A complete data model for SuperWebAnalytics.com
2.5 Summary
3 Data model for Big Data: Illustration
3.1 Why a serialization framework?
3.2 Apache Thrift
3.3 Limitations of serialization frameworks
3.4 Summary
4 Data storage on the batch layer
4.1 Storage requirements for the master dataset
4.2 Choosing a storage solution for the batch layer
4.3 How distributed filesystems work
4.4 Storing a master dataset with a distributed filesystem
4.5 Vertical partitioning
4.6 Low-level nature of distributed filesystems
4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem
4.8 Summary
5 Data storage on the batch layer: Illustration
5.1 Using the Hadoop Distributed File System
5.2 Data storage in the batch layer with Pail
5.3 Storing the master dataset for SuperWebAnalytics.com
5.4 Summary
6 Batch layer
6.1 Motivating examples
6.2 Computing on the batch layer
6.3 Recomputation algorithms vs. incremental algorithms
6.4 Scalability in the batch layer
6.5 MapReduce: a paradigm for Big Data computing
6.6 Low-level nature of MapReduce
6.7 Pipe diagrams: a higher-level way of thinking about batch computation
6.8 Summary
7 Batch layer: Illustration
7.1 An illustrative example
7.2 Common pitfalls of data-processing tools
7.3 An introduction to JCascalog
7.4 Composition
7.5 Summary
8 An example batch layer: Architecture and algorithms
8.1 Design of the SuperWebAnalytics.com batch layer
8.2 Workflow overview
8.3 Ingesting new data
8.4 URL normalization
8.5 User-identifier normalization
8.6 Deduplicate pageviews
8.7 Computing batch views
8.8 Summary
9 An example batch layer: Implementation
9.1 Starting point
9.2 Preparing the workflow
9.3 Ingesting new data
9.4 URL normalization
9.5 User-identifier normalization
9.6 Deduplicate pageviews
9.7 Computing batch views
9.8 Summary

Part 2 Serving layer

10 Serving layer
10.1 Performance metrics for the serving layer
10.2 The serving layer solution to the normalization/denormalization problem
10.3 Requirements for a serving layer database
10.4 Designing a serving layer for SuperWebAnalytics.com
10.5 Contrasting with a fully incremental solution
10.6 Summary
11 Serving layer: Illustration
11.1 Basics of ElephantDB
11.2 Building the serving layer for SuperWebAnalytics.com
11.3 Summary

Part 3 Speed layer

12 Realtime views
12.1 Computing realtime views
12.2 Storing realtime views
12.3 Challenges of incremental computation
12.4 Asynchronous versus synchronous updates
12.5 Expiring realtime views
12.6 Summary
13 Realtime views: Illustration
13.1 Cassandra’s data model
13.2 Using Cassandra
13.3 Summary
14 Queuing and stream processing
14.1 Queuing
14.2 Stream processing
14.3 Higher-level, one-at-a-time stream processing
14.4 SuperWebAnalytics.com speed layer
14.5 Summary
15 Queuing and stream processing: Illustration
15.1 Defining topologies with Apache Storm
15.2 Apache Storm clusters and deployment
15.3 Guaranteeing message processing
15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer
15.5 Summary
16 Micro-batch stream processing
16.1 Achieving exactly-once semantics
16.2 Core concepts of micro-batch stream processing
16.3 Extending pipe diagrams for micro-batch processing
16.4 Finishing the speed layer for SuperWebAnalytics.com
16.5 Pageviews over time 262 n Bounce-rate analysis
16.6 Another look at the bounce-rate-analysis example
16.7 Summary
17 Micro-batch stream processing: Illustration
17.1 Using Trident
17.2 Finishing the SuperWebAnalytics.com speed layer
17.3 Fully fault-tolerant, in-memory, micro-batch processing
17.4 Summary
18 Lambda Architecture in depth
18.1 Defining data systems
18.2 Batch and serving layers
18.3 Speed layer
18.4 Query layer
18.5 Summary

index

When I first entered the world of Big Data, it felt like the Wild West of software development. Many were abandoning the relational database and its familiar comforts for NoSQL databases with highly restricted data models designed to scale to thousands of machines. The number of NoSQL databases, many of them with only minor differences between them, became overwhelming. A new project called Hadoop began to make waves, promising the ability to do deep analyses on huge amounts of data. Making sense of how to use these new tools was bewildering. At the time, I was trying to handle the scaling problems we were faced with at the company at which I worked. The architecture was intimidatingly complex—a web of sharded relational databases, queues, workers, masters, and slaves. Corruption had worked its way into the databases, and special code existed in the application to handle the corruption. Slaves were always behind. I decided to explore alternative Big Data technologies to see if there was a better design for our data architecture. One experience from my early software-engineering career deeply shaped my view of how systems should be architected. A coworker of mine had spent a few weeks collecting data from the internet onto a shared filesystem. He was waiting to collect enough data so that he could perform an analysis on it. One day while doing some routine maintenance, I accidentally deleted all of my coworker’s data, setting him behind weeks on his project. I knew I had made a big mistake, but as a new software engineer I didn’t know what the consequences would be. Was I going to get fired for being so careless? I sent out an email to the team apologizing profusely—and to my great surprise, everyone was very sympathetic. I’ll never forget when a coworker came to my desk, patted my back, and said “Congratulations. You’re now a professional software engineer.” In his joking statement lay a deep unspoken truism in software development: we don’t know how to make perfect software. Bugs can and do get deployed to production. If the application can write to the database, a bug can write to the database as well. When I set about redesigning our data architecture, this experience profoundly affected me. I knew our new architecture not only had to be scalable, tolerant to machine failure, and easy to reason about—but tolerant of human mistakes as well. My experience re-architecting that system led me down a path that caused me to question everything I thought was true about databases and data management. I came up with an architecture based on immutable data and batch computation, and I was astonished by how much simpler the new system was compared to one based solely on incremental computation. Everything became easier, including operations, evolving the system to support new features, recovering from human mistakes, and doing performance optimization. The approach was so generic that it seemed like it could be used for any data system. Something confused me though. When I looked at the rest of the industry, I saw that hardly anyone was using similar techniques. Instead, daunting amounts of complexity were embraced in the use of architectures based on huge clusters of incrementally updated databases. So many of the complexities in those architectures were either completely avoided or greatly softened by the approach I had developed. Over the next few years, I expanded on the approach and formalized it into what I dubbed the Lambda Architecture. When working on a startup called BackType, our team of five built a social media analytics product that provided a diverse set of realtime analytics on over 100 TB of data. Our small team also managed deployment, operations, and monitoring of the system on a cluster of hundreds of machines. When we showed people our product, they were astonished that we were a team of only five people. They would often ask “How can so few people do so much?” My answer was simple: “It’s not what we’re doing, but what we’re not doing.” By using the Lambda Architecture, we avoided the complexities that plague traditional architectures. By avoiding those complexities, we became dramatically more productive. The Big Data movement has only magnified the complexities that have existed in data architectures for decades. Any architecture based primarily on large databases that are updated incrementally will suffer from these complexities, causing bugs, burdensome operations, and hampered productivity. Although SQL and NoSQL databases are often painted as opposites or as duals of each other, at a fundamental level they are really the same. They encourage this same architecture with its inevitable complexities. Complexity is a vicious beast, and it will bite you regardless of whether you acknowledge it or not. This book is the result of my desire to spread the knowledge of the Lambda Architecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a journey—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun. Nathan Marz

Erscheint lt. Verlag	14.5.2015
Verlagsort	New York
Sprache	englisch
Maße	188 x 235 mm
Gewicht	554 g
Einbandart	kartoniert
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
Themenwelt	Mathematik / Informatik ► Informatik ► Web / Internet
Schlagworte	Big Data • Hadoop • NoSQL
ISBN-10	1-61729-034-3 / 1617290343
ISBN-13	978-1-61729-034-3 / 9781617290343
Zustand	Neuware