Data Algorithms
O'Reilly Media (Verlag)
978-1-4919-0618-7 (ISBN)
If you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce applications with Apache Hadoop or Apache Spark.
Each chapter provides a recipe for solving a massive computational problem, such as building a recommendation system. You'll learn how to implement the appropriate MapReduce solution with code that you can use in your projects.
Dr. Mahmoud Parsian covers basic design patterns, optimization techniques, and data mining and machine learning solutions for problems in bioinformatics, genomics, statistics, and social network analysis. This book also includes an overview of MapReduce, Hadoop, and Spark.
Topics include:
- Market basket analysis for a large set of transactions
- Data mining algorithms (K-means, KNN, and Naive Bayes)
- Using huge genomic data to sequence DNA and RNA
- Naive Bayes theorem and Markov chains for data and market prediction
- Recommendation algorithms and pairwise document similarity
- Linear regression, Cox regression, and Pearson correlation
- Allelic frequency and mining DNA
- Social network analysis (recommendation systems, counting triangles, sentiment analysis)
Mahmoud Parsian, Ph.D. in Computer Science, is a practicingsoftware professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he hasbeen involved in Java server-side, databases, MapReduce, anddistributed computing. Dr. Parsian currently leads Illumina'sBig Data team, which is focused on large-scale genome analyticsand distributed computing. He leads and develops scalableregression algorithms; DNA sequencing and RNA sequencing pipelinesusing Java, MapReduce, Hadoop, HBase, and Spark; and open sourcetools. He is also the author of JDBC Recipes and JDBC Metadata (bothfrom Apress).
Chapter 1Secondary Sort: Introduction
Solutions to the Secondary Sort Problem
MapReduce/Hadoop Solution to Secondary Sort
Spark Solution to Secondary Sort
Chapter 2Secondary Sort: A Detailed Example
Secondary Sorting Technique
Complete Example of Secondary Sorting
Sample Run—Old Hadoop API
Sample Run—New Hadoop API
Chapter 3Top 10 List
Top N, Formalized
MapReduce/Hadoop Implementation: Unique Keys
Spark Implementation: Unique Keys
Spark Implementation: Nonunique Keys
Spark Top 10 Solution Using takeOrdered()
MapReduce/Hadoop Top 10 Solution: Nonunique Keys
Chapter 4Left Outer Join
Left Outer Join Example
Implementation of Left Outer Join in MapReduce
Spark Implementation of Left Outer Join
Spark Implementation with leftOuterJoin()
Chapter 5Order Inversion
Example of the Order Inversion Pattern
MapReduce/Hadoop Implementation of the Order Inversion Pattern
Sample Run
Chapter 6Moving Average
Example 1: Time Series Data (Stock Prices)
Example 2: Time Series Data (URL Visits)
Formal Definition
POJO Moving Average Solutions
MapReduce/Hadoop Moving Average Solution
Chapter 7Market Basket Analysis
MBA Goals
Application Areas for MBA
Market Basket Analysis Using MapReduce
Spark Solution
Chapter 8Common Friends
Input
POJO Common Friends Solution
MapReduce Algorithm
Solution 1: Hadoop Implementation Using Text
Solution 2: Hadoop Implementation Using ArrayListOfLongsWritable
Spark Solution
Chapter 9Recommendation Engines Using MapReduce
Customers Who Bought This Item Also Bought
Frequently Bought Together
Recommend Connection
Chapter 10Content-Based Recommendation: Movies
Input
MapReduce Phase 1
MapReduce Phases 2 and 3
Movie Recommendation Implementation in Spark
Chapter 11Smarter Email Marketing with the Markov Model
Markov Chains in a Nutshell
Markov Model Using MapReduce
Spark Solution
Chapter 12K-Means Clustering
What Is K-Means Clustering?
Application Areas for Clustering
Informal K-Means Clustering Method: Partitioning Approach
K-Means Distance Function
K-Means Clustering Formalized
MapReduce Solution for K-Means Clustering
K-Means Implementation by Spark
Chapter 13k-Nearest Neighbors
kNN Classification
Distance Functions
kNN Example
An Informal kNN Algorithm
Formal kNN Algorithm
Java-like Non-MapReduce Solution for kNN
kNN Implementation in Spark
Chapter 14Naive Bayes
Training and Learning Examples
Conditional Probability
The Naive Bayes Classifier in Depth
The Naive Bayes Classifier: MapReduce Solution for Symbolic Data
The Naive Bayes Classifier: MapReduce Solution for Numeric Data
Naive Bayes Classifier Implementation in Spark
Using Spark and Mahout
Chapter 15Sentiment Analysis
Sentiment Examples
Sentiment Scores: Positive or Negative
A Simple MapReduce Sentiment Analysis Example
Sentiment Analysis in the Real World
Chapter 16Finding, Counting, and Listing All Triangles in Large Graphs
Basic Graph Concepts
Importance of Counting Triangles
MapReduce/Hadoop Solution
Spark Solution
Chapter 17K-mer Counting
Input Data for K-mer Counting
Applications of K-mer Counting
K-mer Counting Solution in MapReduce/Hadoop
K-mer Counting Solution in Spark
Chapter 18DNA Sequencing
Input Data for DNA Sequencing
Input Data Validation
DNA Sequence Alignment
MapReduce Algorithms for DNA Sequencing
Chapter 19Cox Regression
The Cox Model in a Nutshell
Cox Regression Using R
Cox Regression Application
Cox Regression POJO Solution
Input for MapReduce
Cox Regression Using MapReduce
Chapter 20Cochran-Armitage Test for Trend
Cochran-Armitage Algorithm
Application of Cochran-Armitage
MapReduce Solution
Chapter 21Allelic Frequency
Basic Definitions
Formal Problem Statement
MapReduce Solution for Allelic Frequency
MapReduce Solution, Phase 1
MapReduce Solution, Phase 2
MapReduce Solution, Phase 3
Special Handling of Chromosomes X and Y
Chapter 22The T-Test
Performing the T-Test on Biosets
MapReduce Problem Statement
Input
Expected Output
MapReduce Solution
Spark Implementation
Chapter 23Pearson Correlation
Pearson Correlation Formula
Pearson Correlation Example
Data Set for Pearson Correlation
POJO Solution for Pearson Correlation
POJO Solution Test Drive
MapReduce Solution for Pearson Correlation
Hadoop Implementation Classes
Spark Solution for Pearson Correlation
Spearman Correlation Using Spark
Chapter 24DNA Base Count
FASTA Format
FASTQ Format
MapReduce Solution: FASTA Format
Sample Run
MapReduce Solution: FASTQ Format
Spark Solution: FASTA Format
Spark Solution: FASTQ Format
Chapter 25RNA Sequencing
Data Size and Format
MapReduce Workflow
RNA Sequencing Analysis Overview
MapReduce Algorithms for RNA Sequencing
Chapter 26Gene Aggregation
Input
Output
MapReduce Solutions (Filter by Individual and by Average)
Gene Aggregation in Spark
Spark Solution: Filter by Individual
Spark Solution: Filter by Average
Chapter 27Linear Regression
Basic Definitions
Simple Example
Problem Statement
Input Data
Expected Output
MapReduce Solution Using SimpleRegression
Hadoop Implementation Classes
MapReduce Solution Using R’s Linear Model
Chapter 28MapReduce and Monoids
Introduction
Definition of Monoid
Monoidic and Non-Monoidic Examples
MapReduce Example: Not a Monoid
MapReduce Example: Monoid
Spark Example Using Monoids
Conclusion on Using Monoids
Functors and Monoids
Chapter 29The Small Files Problem
Solution 1: Merging Small Files Client-Side
Solution 2: Solving the Small Files Problem with CombineFileInputFormat
Alternative Solutions
Chapter 30Huge Cache for MapReduce
Implementation Options
Formalizing the Cache Problem
An Elegant, Scalable Solution
Implementing the LRUMap Cache
MapReduce Using the LRUMap Cache
Chapter 31The Bloom Filter
Bloom Filter Properties
A Simple Bloom Filter Example
Bloom Filters in Guava Library
Using Bloom Filters in MapReduce
Appendix Bioset
Appendix Spark RDDs
Spark Operations
Tuple
RDDs
Erscheint lt. Verlag | 28.7.2015 |
---|---|
Verlagsort | Sebastopol |
Sprache | englisch |
Maße | 177 x 232 mm |
Gewicht | 1294 g |
Einbandart | kartoniert |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Informatik ► Theorie / Studium ► Algorithmen | |
Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik | |
Schlagworte | Algorithmen • Big Data • Data Mining Algorithms • Data Structures & Algorithms • Datenanalye • Datenanalyse • Datenanlyse • Datenstrukturen und Algorithmen • Hadoop • KNN • MapReduce • MapReduce algorithms • Spark |
ISBN-10 | 1-4919-0618-9 / 1491906189 |
ISBN-13 | 978-1-4919-0618-7 / 9781491906187 |
Zustand | Neuware |
Haben Sie eine Frage zum Produkt? |
aus dem Bereich