Storm Applied

Strategies for Real-Time Event Processing

Dean T. Allen, Peter Pathirana, Matthew Jankowski (Autoren)

Buch | Softcover

280 Seiten

2015
Manning Publications (Verlag)
978-1-61729-189-0 (ISBN)

Artikel merken

Immediately useful practical guide Applies Storm to real-world use cases
Takes Storm from development to a fully tuned and optimized production setup

Storm is a tool that can be used for processing "big data" in real-time. Think performing real-time analysis of all the tweets going through Twitter.

It's a lot harder to make sense out of data when it's coming at full speed.

Apache Storm's efficient stream processing capabilities are relied upon by giants like Twitter and Yahoo for swiftly extracting intelligence from their Big Data streams.

Fault tolerant guarantees of Storm make it an invaluable and versatile platform in the Big Data landscape. It integrates seamlessly with battle-tested message queuing systems (like Kafka) and NoSQL databases (like Cassandra).

Storm is built to run on the JVM but provides straightforward extensions for working with non-JVM languages like Ruby and Python.

Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.

The book starts by building a solid foundation of the Storm essentials. Then, it quickly dives into real-world case studies that will bring the novice up to speed with productionizing Storm: the knowledge needed to scale a high throughput stream processor and ensure smooth operation within a production cluster.

It moves on to teach readers how to use Trident to treat streams as batches for solving a different class of problems, and covers the tools available within the Storm open source community that are crucial for any seasoned Storm developer.

While prior experience with Storm is not necessary, acquaintance with related Big Data problem solving is helpful. Basic understanding of Java or similar JVM language and concurrency is assumed.

Sean T. Allen, Peter Pathirana, and Matthew Jankowski are leaders on the development team at TheLadders, a high-volume, search-intensive web application. Sean is a seasoned architect with an abiding interest in distributed systems. You can follow Sean on Twitter at @SeanTAllen. Peter is a lead engineer specializing in search and recommendations platform architecture. Matthew has spent the last couple of years integrating Storm into TheLadders problem domain and continues to search for new uses for Storm in solving big data problems. You can follow Matthew on Twitter at @mattjanks16.

Contents
foreword
preface
acknowledgments
about this book
about the cover illustration

Chapter 1 Introducing Storm
What is big data?
How Storm fits into the big data picture
Why you’d want to use Storm
Summary
Chapter 2 Core Storm concepts
Problem definition: GitHub commit count dashboard
Basic Storm concepts
Implementing a GitHub commit count dashboard in Storm
Summary
Chapter 3 Topology design
Approaching topology design
Problem definition: a social heat map
Precepts for mapping the solution to Storm
Initial implementation of the design
Scaling the topology
Topology design paradigms
Summary
Chapter 4 Creating robust topologies
Requirements for reliability
Problem definition: a credit card authorization system
Basic implementation of the bolts
Guaranteed message processing
Replay semantics
Summary
Chapter 5 Moving from local to remote topologies
The Storm cluster
Fail-fast philosophy for fault tolerance within a Storm cluster
Installing a Storm cluster
Getting your topology to run on a Storm cluster
The Storm UI and its role in the Storm cluster
Summary
Chapter 6 Tuning in Storm
Problem definition: Daily Deals! reborn
Initial implementation
Tuning: I wanna go fast
Latency: when external systems take their time
Storm’s metrics-collecting API
Summary
Chapter 7 Resource contention
Changing the number of worker processes running on a worker node
Changing the amount of memory allocated to worker processes (JVMs)
Figuring out which worker nodes/processes a topology is executing on
Contention for worker processes in a Storm cluster
Memory contention within a worker process (JVM)
Memory contention on a worker node
Worker node CPU contention
Worker node I/O contention
Summary
Chapter 8 Storm internals
The commit count topology revisited
Diving into the details of an executor
Routing and tasks
Knowing when Storm’s internal queues overflow
Addressing internal Storm buffers overflowing
Tweaking buffer sizes for performance gain
Summary
Chapter 9 Trident
What is Trident?
Kafka and its role with Trident
Problem definition: Internet radio
Implementing the internet radio design as a Trident topology
Accessing the persisted counts through DRPC
Mapping Trident operations to Storm primitives
Scaling a Trident topology
Summary

afterword
index

Preface At TheLadders, we’ve been using Storm since it was introduced to the world (version 0.5.x). In those early days, we implemented solutions with Storm that supported noncritical business processes. Our Storm cluster ran uninterrupted for a long time and “just worked.” Little attention was paid to this cluster, as it never really had any problems. It wasn’t until we started identifying more business cases where Storm was a good fit that we started to experience problems. Contention for resources in production, not having a great understanding of how things were working under the covers, sub-optimal performance, and a lack of visibility into the overall health of the system were all issues we struggled with. This prompted us to focus a lot of time and effort on learning much of what we present in this book. We started with gaining a solid understanding of the fundamentals of Storm, which included reading (and rereading many times) the existing Storm documentation, while also digging into the source code. We then identified some “best practices” for how we liked to design solutions using Storm. We added better monitoring, which enabled us to troubleshoot and tune our solutions in a much more efficient manner. While the documentation for the fundamentals of Storm was readily available online, we felt there was a lack of documentation for best practices in terms of dealing with Storm in a production environment. We wrote a couple of blog posts based on our experiences with Storm, and when Manning asked us to write a book about Storm, we jumped at the opportunity. We knew we had a lot of knowledge we wanted to share with the world. We hoped to help others avoid the frustrations and pitfalls we had gone through. While we knew that we wanted to share our hard-won experiences with running a production Storm cluster—tuning, debugging, and troubleshooting—what we really wanted was to impart a solid grasp of the fundamentals of Storm. We also wanted to illustrate how flexible Storm is, and how it can be used across a wide range of use cases. We knew ours were just a small sampling of the many use cases among the many companies leveraging Storm. The result of this is Storm Applied. We’ve tried to identify as many different types of use cases as possible to illustrate how Storm can be used in many scenarios. We cover the core concepts of Storm in hopes of laying a solid foundation before diving into tuning, debugging, and troubleshooting Storm in production. We hope this format works for everyone, from the beginner just getting started with Storm, to the experienced developer who has run into some of the same troubles we have. This book has been the definition of teamwork, from everyone who helped us at Manning to our colleagues at TheLadders, who very patiently and politely allowed us to test our ideas early on. We hope you are able to find this book useful, no matter your experience level with Storm. We have enjoyed writing it and continue to learn more about Storm every day.

Foreword “Backend rewrites are always hard.” That’s how ours began, with a simple statement from my brilliant and trusted colleague, Keith Bourgoin. We had been working on the original web analytics backend behind Parse.ly for over a year. We called it “PTrack”. Parse.ly uses Python, so we built our systems atop comfortable distributed computing tools that were handy in that community, such as multiprocessing and celery. Despite our mastery of these, it seemed like every three months, we’d double the amount of traffic we had to handle and hit some other limitation of those systems. There had to be a better way. So, we started the much-feared backend rewrite. This new scheme to process our data would use small Python processes that communicated via ZeroMQ. We jokingly called it “PTrack3000,” referring to the “Python3000” name given to the future version of Python by the language’s creator, when it was still a far-off pipe dream. By using ZeroMQ, we thought we could squeeze more messages per second out of each process and keep the system operationally simple. But what this setup gained in operational ease and performance, it lost in data reliability. Then, something magical happened. BackType, a startup whose progress we had tracked in the popular press,1 was acquired by Twitter. One of the first orders of business upon being acquired was to publicly release its stream processing framework, Storm, to the world. My colleague Keith studied the documentation and code in detail, and realized: Storm was exactly what we needed! It even used ZeroMQ internally (at the time) and layered on other tooling for easy parallel processing, hassle-free operations, and an extremely clever data reliability model. Though it was written in Java, it included some documentation and examples for making other languages, like Python, play nicely with the framework. So, with much glee, “PTrack9000!” (exclamation point required) was born: a new Parse.ly analytics backend powered by Storm. Nathan Marz, Storm’s original creator, spent some time cultivating the community via conferences, blog posts, and user forums.2 But in those early days of the project, you had to scrape tiny morsels of Storm knowledge from the vast web. Oh, how I wish Storm Applied, the book you’re currently reading, had already been written in 2011. Although Storm’s documentation on its design rationale was very strong, there were no practical guides on making use of Storm (especially in a production setting) when we adopted it. Frustratingly, despite a surge of popularity over the next three years, there were still no good books on the subject through the end of 2014! No one had put in the significant effort required to detail how Storm components worked, how Storm code should be written, how to tune topology performance, and how to operate these clusters in the real world. That is, until now. Sean, Matthew, and Peter decided to write Storm Applied by leveraging their hard-earned production experience at TheLadders, and it shows. This will, no doubt, become the definitive practitioner’s guide for Storm users everywhere. Through their clear prose, illuminating diagrams, and practical code examples, you’ll gain as much Storm knowledge in a few short days as it took my team several years to acquire. You will save yourself many stressful firefights, head-scratching moments, and painful code re-architectures. I’m convinced that with the newfound understanding provided by this book, the next time a colleague turns to you and says, “Backend rewrites are always hard,” you’ll be able to respond with confidence: “Not this time.” Happy hacking! ANDREW MONTALENTI COFOUNDER & CTO, PARSE.LY3 CREATOR OF STREAMPARSE, A PYTHON PACKAGE FOR STORM4

Verlagsort	New York
Sprache	englisch
Maße	190 x 234 mm
Gewicht	478 g
Einbandart	kartoniert
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
Themenwelt	Mathematik / Informatik ► Informatik ► Web / Internet
Schlagworte	Apache • Apache Webserver • Big Data • Big-Data-Analyse • data stream analysis • Java Virtual Machine (JVM) • Storm
ISBN-10	1-61729-189-7 / 1617291897
ISBN-13	978-1-61729-189-0 / 9781617291890
Zustand	Neuware