Pro Hadoop Data Analytics -  Kerry Koitzsch

Pro Hadoop Data Analytics (eBook)

Designing and Building Big Data Systems using the Hadoop Ecosystem
eBook Download: PDF
2016 | 1st ed.
XXI, 298 Seiten
Apress (Verlag)
978-1-4842-1910-2 (ISBN)
Systemvoraussetzungen
42,79 inkl. MwSt
  • Download sofort lieferbar
  • Zahlungsarten anzeigen

Learn advanced analytical techniques and leverage existing toolkits to make your analytic applications more powerful, precise, and efficient. This book provides the right combination of architecture, design, and implementation information to create analytical systems which go beyond the basics of classification, clustering, and recommendation.

In Pro Hadoop Data Analytics best practices are emphasized to ensure coherent, efficient development. A complete example system will be developed using standard third-party components which will consist of the toolkits, libraries, visualization and reporting code, as well as support glue to provide a working and extensible end-to-end system.

The book emphasizes four important topics:

  • The importance of end-to-end, flexible, configurable, high-performance data pipeline systems with analytical components as well as appropriate visualization results. 
  • Best practices and structured design principles. This will include strategic topics as well as the how to example portions.
  • The importance of mix-and-match or hybrid systems, using different analytical components in one application to accomplish application goals. The hybrid approach will be prominent in the examples.
  • Use of existing third-party libraries is key to effective development. Deep dive examples of the functionality of some of these toolkits will be showcased as you develop the example system.

What You'll Learn 

  • The what, why, and how of building big data analytic systems with the Hadoop ecosystem
  • Libraries, toolkits, and algorithms to make development easier and more effective
  • Best practices to use when building analytic systems with Hadoop, and metrics to measure performance and efficiency of components and systems
  • How to connect to standard relational databases, noSQL data sources, and more
  • Useful case studies and example components which assist you in creating your own systems
Who This Book Is For

Software engineers, architects, and data scientists with an interest in the design and implementation of big data analytical systems using Hadoop, the Hadoop ecosystem, and other associated technologies.


Kerry Koitzsch is a software engineer and student of history interested in the early history of science, particularly chemistry. He frequently publishes papers and attends conferences on scientific and historical topics, including early chemistry and alchemy, sociology of science, and other historical subjects. He has presented many lectures, talks, and demonstrations on a variety of subjects for the United States Army, the Society for Utopian Studies, American Association for Artificial Intelligence (AAAI), Association for Studies in Esotericism (ASE), and others, and has published many papers, with two books on historical subjects to be published in 2016. His most recent published work is a chapter in 'The Individual and Utopia', a collection of sociological papers, published by Ashgate Press.

He was educated at Interlochen Arts Academy, MIT, and the San Francisco Conservatory of Music. He served in the United States Army and United States Army Reserve, and is the recipient of the United States Army Achievement Medal.  For the last thirty years he has been a software engineer specializing in computer vision, machine learning, and database technologies, and currently lives and works in Sunnyvale, California.


Learn advanced analytical techniques and leverage existing tool kits to make your analytic applications more powerful, precise, and efficient. This book provides the right combination of architecture, design, and implementation information to create analytical systems that go beyond the basics of classification, clustering, and recommendation.Pro Hadoop Data Analytics emphasizes best practices to ensure coherent, efficient development. A complete example system will be developed using standard third-party components that consist of the tool kits, libraries, visualization and reporting code, as well as support glue to provide a working and extensible end-to-end system.The book also highlights the importance of end-to-end, flexible, configurable, high-performance data pipeline systems with analytical components as well as appropriate visualization results. You'll discover the importance of mix-and-match or hybrid systems, using different analytical components in one application. This hybrid approach will be prominent in the examples.What You'll Learn Build big data analytic systems with the Hadoop ecosystemUse libraries, tool kits, and algorithms to make development easier and more effectiveApply metrics to measure performance and efficiency of components and systemsConnect to standard relational databases, noSQL data sources, and moreFollow case studies with example components to create your own systemsWho This Book Is ForSoftware engineers, architects, and data scientists with an interest in the design and implementation of big data analytical systems using Hadoop, the Hadoop ecosystem, and other associated technologies.

Kerry Koitzsch is a software engineer and interested in the early history of science, particularly chemistry. He frequently publishes papers and attends conferences on scientific and historical topics, including early chemistry and alchemy, and sociology of science. He has presented many lectures, talks, and demonstrations on a variety of subjects for the United States Army, the Society for Utopian Studies, American Association for Artificial Intelligence (AAAI), Association for Studies in Esotericism (ASE), and others. He has also published several papers and written two historical books.Kerry was educated at Interlochen Arts Academy, MIT, and the San Francisco Conservatory of Music. He served in the United States Army and United States Army Reserve, and is the recipient of the United States Army Achievement Medal.  He has been a software engineer specializing in computer vision, machine learning, and database technologies for 30 years, and currently lives and works in Sunnyvale, California.

[PART I: CONCEPTS]Chapter 1: Overview: Building Data Analytic Systems with HadoopIn this chapter we discuss what analytic systems using Hadoop are, why they are important, data sources which may be used, and applications which are --- and are not suitable for a distributed system approach using Hadoop.Subtopics:1. Introduction: The Need for Distributed Analysis2. How the Hadoop Ecosystem Implements Big Data Analysis3. A Survey of the Hadoop Ecosystem4. Architectures for Building5. SummaryChapter 2: Programming Languages: A Scala and Python RefresherThis chapter consists of a concise overview of the Scala and Python programming languages, and details why these languages are important ingredients of most modern Hadoop analytical systems. The chapter is primarily aimed at Java/C++ programmers who need a quick review/introduction to the Scala and Python programming languages.Subtopics:1. Motivation: Selecting the Right Language(s) Defines the Application1. Review of Scala2. Review of Python3. Programming Applications and Examples4. SummaryChapter 3: Necessary Ingredients: Standard Toolkits for Hadoop and AnalyticsIn this chapter we describe an example system which we develop throughout the remainder of the book using standard toolkits from the Hadoop ecosystem, and other analytical toolkits in combination with development components such as Maven, openCV, Apache Mahout, and others to create a Hadoop-based system appropriate for a variety of applications.Subtopics:1. Libraries, Components, and Toolkits: A Survey2. Numerical and Statistical Libraries; R, Weka, and Others3. Hadoop Toolkits for Analysis: Mahout and Friends4. Apache Spark Libraries and Components: H20, Sparkling Water, and More5. Examples of Use and System Building6. SummaryChapter 4: Relational, noSQL, and Graph DatabasesIn this chapter we describe relational databases, such as mysql, noSQL databases such as Cassandra, and graph databases such as neo4j, how to integrate them with the Hadoop ecosystem, and how to create customized data sources and sinks using Apache Camel.Subtopics:1. Introduction to Databases: Relational, NoSQL, and Graph2. Relational Data Sources3. noSQL Data Sources: Cassandra4. Graph Databases: Neo4j5. Integrating Data with the Analytical Engine6. SummaryChapter 5: Data Pipelines and How to Construct ThemIn this chapter we describe how to construct basic data pipelines using data sources and the Hadoop ecosystem. We provide an end-to-end example of how data sources may be linked and processed using Hadoop and other analytical components, and how this is similar to a standard ETL process.Subtopics:1. The Basic Data Pipeline2. Data Sources and Sinks3. Computation and Transformation4. Visualizing and Reporting the Results5. SummaryChapter 6: Advanced Search Techniques with Hadoop, Lucene, and SolrIn this chapter we describe the structure and use of the Lucene and Solr third-party search engine components, how to use them with Hadoop, and how to develop advanced search capability customized for an analytical application.Subtopics:1. Introduction to Customized Search Engines2. Distributed Search Techniques3. Basic Examples: A Custom Search Component4. Extended Examples: Scaling, Tuning, and Customizing the Search Component5. Summary [ PART II: ARCHITECTURES AND ALGORITHMS]Chapter 7: An Overview of Analytical Techniques and AlgorithmsIn this chapter, we provide an overview of four categories of algorithm: statistical, Bayesian, ontology-driven, and hybrid algorithms which leverage the more basic algorithms found in standard libraries to perform more in-depth and accurate analyses using Hadoop.Subtopics:1. Survey of Algorithm Types2. Statistical / Numerical Techniques3. Bayesian Techniques4. Ontology Driven Algorithms5. Hybrid Algorithms: Combining Algorithm Types6. Code Examples7. SummaryChapter 8: Rule Engines, System Control, and System OrchestrationIn this chapter, we describe the Drools rule engine and how it may be used to control and orchestrate Hadoop analysis pipelines. We describe an example rule-based controller which can be used for a variety of data types and applications in combination with the Hadoop ecosystem.Subtopics:1. Introduction to Rule Systems: Drools2. Rule-Based Software System Control3. System Orchestration with Drools4. Analytical Engine Example with Rule Control5. SummaryChapter 9: Putting it All Together: Designing a Complete Analytical SystemIn this chapter, we describe an end-to-end design example, using many of the components discussed so far, as well as ‘best practices’   to use during the requirements acquisition, planning, architecting, development, and test phases of the system development project.Subtopics:1. Goals and Requirements for Analytical System Building2. Architecture3. Initial Code Framework Example4. Extended Code Framework Example5. Summary[PART III: COMPONENTS AND SYSTEMS]Chapter 10: Using Library Components for Statistical Analytics and Data MiningIn this chapter, we describe four standard statistical analysis packages: R/Weka, MLib, Mahout, and Numpy Extended. These toolkits are used to develop a data mining example using  a Hadoop cluster and a variety of the Hadoop ecosystem components to provide a dashboard-based result report.Subtopics:1. A Survey of Data Mining Techniques and Applications2. R/Weka Example3. Numpy Extended Example4. Integration with Hadoop Analytical Components5. Data Mining Example6. SummaryChapter 11: Semantic Web Technologies and Natural Language ProcessingIn this chapter, we describe the use of knowledge information sources such as taxonomies, ontologies, and grammars, why they are useful, and how to integrate them with Hadoop analytical components as well as with natural language processing components to provide an added layer of ease-of-use to an analytical system.Subtopics:1. Introduction to Semantic Web Technologies2. Semantic Web For Hadoop (Examples)3. Data Integration with Semantic Web Technologies4. Code Examples with Data Integration using  Apache Camel5. Extended Example6. SummaryChapter 12: Machine Learning Components with HadoopIn this chapter, we discuss a number of machine learning components including neural net, genetic algorithm, Markov modeling, and hybrid components, and how they may be used with the Hadoop ecosystem to provide cognitive computing elements to an analytical engine.Subtopics:1. Introduction: The Need for Machine Learning2. Machine Learning Toolkits and Hadoop3. Code Examples using Apache Mahout4. Extended Code Examples5. Neural Nets, Genetic Algorithms, and Hybrids6. SummaryChapter 13: Data Visualizers: Seeing and Interacting with the AnalysisIn this chapter, we discuss how to create data visualization components, connect them with the analytical modules of the system, and how to provide the user with the ability to interact with the charts, dashboards, and reports.Subtopics:1. Introduction to Data Visualization : The Need to See Results2. Visualizers for Simple Data: Some Examples3. Data Visualizers and Hadoop: Some Examples4. Visualizers for more than Two Dimensions (three-D examples and extended plots/charting)5. Summary: Future Directions for  Data Visualization[PART IV: CASE STUDIES AND APPLICATIONS]Chapter 14: A Case Study in Bioinformatics: Analyzing Microscope Slide DataIn this chapter, we describe an application to analyze microscopic slide data such as might be found in medical examinations of patient samples. We illustrate how a Hadoop system might be used on a small Hadoop cluster to organize, analyze, and correlate bioinformatic data.Subtopics:1. Introduction to Bioinformatics2. Analyzing Microscope Slide Data Automatically3. Basic Examples4. Extended Examples5. SummaryChapter 15: A Bayesian Analysis Software Component: Identifying Credit Card FraudIn this chapter, we describe a Bayesian analysis component plugin which may be used to analyze credit card transactions in order to identify fraudulent use of the credit card by illicit users.Subtopics:1. Introduction to Bayesian Analysis2. The Problem of Credit Fraud and Possible Solutions3. Basic Applications of the Data Models4. Examples of Fraud Detection5. SummaryChapter 16: Searching for Oil: Geological Data Analysis with MahoutIn this chapter, we describe a system which uses geospatial data, ontologies, and other semantic web information to predict where geological resources, such as oil or bauxite (aluminum ore) might be found.Subtopics:1. Introduction to the Geospatial Data Arena2. Components and Architecture^3. Data Sources for Geospatial Data4. Basic Examples and Visualizations5. Extended Examples6. SummaryChapter 17: ‘Image as Big Data’ Systems: Some Case StudiesIn this chapter, we describe the use of ‘images as big data’ and how image data may be used in combination with the Hadoop ecosystem to provide information for a variety of systems.Subtopics:1. Introduction to the Image as Big Data Concept2. Components and Architecture3. Data Sources for Imagery and How to Use Them4. The Image as Big Data Pipeline5. Examples6. SummaryChapter 18: A Generic Data Pipeline Analytical System In this chapter, we detail and end-to-end analytical system using many of the techniques we discussed throughout the book to provide an evaluation system the user may extend and edit to create her own Hadoop data analysis system.Subtopics:1. Architecture and Description of Example System2. How to obtain and run the system3. Basic examples4. Extended Examples5. How to extend the system for custom applications6. SummaryChapter 19:  Conclusions and The Future of Big Data AnalysisIn this chapter we sum up what we have learned in the previous chapters and discuss some of the developing trends in big  data analysis including ‘incubator’ projects and ‘young’ projects for data analysis, and we speculate on what the future holds for big data analysis and the Hadoop ecosystem (it can only continue to grow)Subtopics:1. Conclusions: The Current state of Hadoop Data Analytics

Erscheint lt. Verlag 29.12.2016
Zusatzinfo XXI, 298 p. 161 illus., 152 illus. in color.
Verlagsort Berkeley
Sprache englisch
Themenwelt Informatik Datenbanken Data Warehouse / Data Mining
Mathematik / Informatik Informatik Netzwerke
Mathematik / Informatik Informatik Programmiersprachen / -werkzeuge
Informatik Software Entwicklung Objektorientierung
Informatik Theorie / Studium Algorithmen
Schlagworte algorithms • Analytics • Apache Mahout • Architecture • data analytics • data visualisation • Hadoop • Lucene • machine learning • Maven • NoSQL • OpenCV • Python • Ralational Database • Scala • Solr
ISBN-10 1-4842-1910-4 / 1484219104
ISBN-13 978-1-4842-1910-2 / 9781484219102
Haben Sie eine Frage zum Produkt?
PDFPDF (Wasserzeichen)
Größe: 22,7 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasser­zeichen und ist damit für Sie persona­lisiert. Bei einer missbräuch­lichen Weiter­gabe des eBooks an Dritte ist eine Rück­ver­folgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seiten­layout eignet sich die PDF besonders für Fach­bücher mit Spalten, Tabellen und Abbild­ungen. Eine PDF kann auf fast allen Geräten ange­zeigt werden, ist aber für kleine Displays (Smart­phone, eReader) nur einge­schränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Mehr entdecken
aus dem Bereich
Achieve data excellence by unlocking the full potential of MongoDB

von Marko Aleksendric; Arek Borucki; Leandro Domingues …

eBook Download (2024)
Packt Publishing (Verlag)
53,99
A guide to developing efficient and elegant T-SQL code

von Pam Lahoud; Pedro Lopes

eBook Download (2024)
Packt Publishing (Verlag)
35,99