Predictive Analytics and Data Mining - Bala Deshpande, Vijay Kotu

Predictive Analytics and Data Mining (eBook)

Concepts and Practice with RapidMiner

Bala Deshpande, Vijay Kotu (Autoren)

eBook Download: PDF | EPUB

2014 | 1. Auflage
446 Seiten
Elsevier Science (Verlag)
978-0-12-801650-3 (ISBN)

Put Predictive Analytics into ActionLearn the basics of Predictive Analysis and Data Mining through an easy to understand conceptual framework and immediately practice the concepts learned using the open source RapidMiner tool. Whether you are brand new to Data Mining or working on your tenth project, this book will show you how to analyze data, uncover hidden patterns and relationships to aid important decisions and predictions. Data Mining has become an essential tool for any enterprise that collects, stores and processes data as part of its operations. This book is ideal for business users, data analysts, business analysts, business intelligence and data warehousing professionals and for anyone who wants to learn Data Mining.You'll be able to:1. Gain the necessary knowledge of different data mining techniques, so that you can select the right technique for a given data problem and create a general purpose analytics process.2. Get up and running fast with more than two dozen commonly used powerful algorithms for predictive analytics using practical use cases.3. Implement a simple step-by-step process for predicting an outcome or discovering hidden relationships from the data using RapidMiner, an open source GUI based data mining tool Predictive analytics and Data Mining techniques covered: Exploratory Data Analysis, Visualization, Decision trees, Rule induction, k-Nearest Neighbors, Naïve Bayesian, Artificial Neural Networks, Support Vector machines, Ensemble models, Bagging, Boosting, Random Forests, Linear regression, Logistic regression, Association analysis using Apriori and FP Growth, K-Means clustering, Density based clustering, Self Organizing Maps, Text Mining, Time series forecasting, Anomaly detection and Feature selection. Implementation files can be downloaded from the book companion site at www.LearnPredictiveAnalytics.com - Demystifies data mining concepts with easy to understand language - Shows how to get up and running fast with 20 commonly used powerful techniques for predictive analysis - Explains the process of using open source RapidMiner tools - Discusses a simple 5 step process for implementing algorithms that can be used for performing predictive analytics - Includes practical use cases and examples

Vijay Kotu is Vice President of Analytics at ServiceNow. He leads the implementation of large-scale data platforms and services to support the company's enterprise business. He has led analytics organizations for over a decade with focus on data strategy, business intelligence, machine learning, experimentation, engineering, enterprise adoption, and building analytics talent. Prior to joining ServiceNow, he was Vice President of Analytics at Yahoo. He worked at Life Technologies and Adteractive where he led marketing analytics, created algorithms to optimize online purchasing behavior, and developed data platforms to manage marketing campaigns. He is a member of the Association of Computing Machinery and a member of the Advisory Board at RapidMiner.

Put Predictive Analytics into ActionLearn the basics of Predictive Analysis and Data Mining through an easy to understand conceptual framework and immediately practice the concepts learned using the open source RapidMiner tool. Whether you are brand new to Data Mining or working on your tenth project, this book will show you how to analyze data, uncover hidden patterns and relationships to aid important decisions and predictions. Data Mining has become an essential tool for any enterprise that collects, stores and processes data as part of its operations. This book is ideal for business users, data analysts, business analysts, business intelligence and data warehousing professionals and for anyone who wants to learn Data Mining.You'll be able to:1. Gain the necessary knowledge of different data mining techniques, so that you can select the right technique for a given data problem and create a general purpose analytics process.2. Get up and running fast with more than two dozen commonly used powerful algorithms for predictive analytics using practical use cases.3. Implement a simple step-by-step process for predicting an outcome or discovering hidden relationships from the data using RapidMiner, an open source GUI based data mining toolPredictive analytics and Data Mining techniques covered: Exploratory Data Analysis, Visualization, Decision trees, Rule induction, k-Nearest Neighbors, Naive Bayesian, Artificial Neural Networks, Support Vector machines, Ensemble models, Bagging, Boosting, Random Forests, Linear regression, Logistic regression, Association analysis using Apriori and FP Growth, K-Means clustering, Density based clustering, Self Organizing Maps, Text Mining, Time series forecasting, Anomaly detection and Feature selection. Implementation files can be downloaded from the book companion site at www.LearnPredictiveAnalytics.com- Demystifies data mining concepts with easy to understand language- Shows how to get up and running fast with 20 commonly used powerful techniques for predictive analysis- Explains the process of using open source RapidMiner tools- Discusses a simple 5 step process for implementing algorithms that can be used for performing predictive analytics- Includes practical use cases and examples

Front Cover 1
Predictive Analyticsand Data Mining 4
Copyright 5
Dedication 6
Contents 8
Foreword 12
Preface 16
WHY THIS BOOK? 17
WHO CAN USE THIS BOOK? 17
Acknowledgments 20
Chapter 1 -Introduction 22
1.1 WHAT DATA MINING IS 23
1.2 WHAT DATA MINING IS NOT 26
1.3 THE CASE FOR DATA MINING 27
1.4 TYPES OF DATA MINING 29
1.5 DATA MINING ALGORITHMS 31
1.6 ROADMAP FOR UPCOMING CHAPTERS 32
REFERENCES 37
Chapter 2 - Data Mining Process 38
2.1 PRIOR KNOWLEDGE 40
2.2 DATA PREPARATION 43
2.3 MODELING 48
2.4 APPLICATION 53
2.5 KNOWLEDGE 55
WHAT’S NEXT? 56
REFERENCES 56
Chapter 3 - Data Exploration 58
3.1 OBJECTIVES OF DATA EXPLORATION 59
3.2 DATA SETS 59
3.3 DESCRIPTIVE STATISTICS 62
3.4 DATA VISUALIZATION 67
3.5 ROADMAP FOR DATA EXPLORATION 80
REFERENCES 81
Chapter 4 - Classification 84
4.1 DECISION TREES 85
4.2 RULE INDUCTION 109
4.3 K-NEAREST NEIGHBORS 120
4.4 NAÏVE BAYESIAN 132
4.5 ARTIFICIAL NEURAL NETWORKS 145
4.6 SUPPORT VECTOR MACHINES 155
4.7 ENSEMBLE LEARNERS 169
REFERENCES 183
Chapter 5 - Regression Methods 186
5.1 LINEAR REGRESSION 188
5.2 LOGISTIC REGRESSION 201
CONCLUSION 213
REFERENCES 213
Chapter 6 - Association Analysis 216
6.1 CONCEPTS OF MINING ASSOCIATION RULES 218
6.2 Apriori Algorithm 223
6.3 FP-GROWTH ALGORITHM 227
CONCLUSION 236
REFERENCES 236
Chapter 7 - Clustering 238
CLUSTERING TO DESCRIBE THE DATA 238
CLUSTERING FOR PREPROCESSING 239
7.1 TYPES OF CLUSTERING TECHNIQUES 240
7.2 K-MEANS CLUSTERING 244
7.3 DBSCAN CLUSTERING 255
7.4 SELF-ORGANIZING MAPS 263
REFERENCES 275
Chapter 8 - Model Evaluation 278
8.1 CONFUSION MATRIX (OR TRUTH TABLE) 279
8.2 RECEIVER OPERATOR CHARACTERISTIC (ROC) CURVES AND AREA UNDER THE CURVE (AUC) 281
8.3 LIFT CURVES 284
8.4 EVALUATING THE PREDICTIONS: IMPLEMENTATION 285
CONCLUSION 294
REFERENCES 294
Chapter 9 - Text Mining 296
9.1 HOW TEXT MINING WORKS 298
9.2 IMPLEMENTING TEXT MINING WITH CLUSTERING AND CLASSIFICATION 305
CONCLUSION 323
REFERENCES 323
Chapter 10 - Time Series Forecasting 326
10.1 DATA-DRIVEN APPROACHES 329
10.2 MODEL-DRIVEN FORECASTING METHODS 334
CONCLUSION 347
REFERENCES 348
Chapter 11 - Anomaly Detection 350
11.1 ANOMALY DETECTION CONCEPTS 350
11.3 DENSITY-BASED OUTLIER DETECTION 359
11.4 LOCAL OUTLIER FACTOR 362
CONCLUSION 365
REFERENCES 366
Chapter 12 - Feature Selection 368
12.1 CLASSIFYING FEATURE SELECTION METHODS 369
12.2 PRINCIPAL COMPONENT ANALYSIS 370
12.3 INFORMATION THEORY–BASED FILTERING FOR NUMERIC DATA 379
CATEGORICAL DATA 381
12.5 WRAPPER-TYPE FEATURE SELECTION 384
CONCLUSION 391
REFERENCES 391
Chapter 13 - Getting Started with RapidMiner 392
13.1 USER INTERFACE AND TERMINOLOGY 393
13.2 DATA IMPORTING AND EXPORTING TOOLS 398
13.3 DATA VISUALIZATION TOOLS 403
13.4 DATA TRANSFORMATION TOOLS 407
13.5 SAMPLING AND MISSING VALUE TOOLS 413
CONCLUSION 426
REFERENCES 427
Comparison of Data Mining Algorithms 428
Index 438
A 438
B 439
C 439
D 439
E 440
F 440
G 441
H 441
I 441
K 441
L 441
M 442
Q 442
R 442
S 443
T 443
U 444
V 444
W 444
Y 444
About the Authors 446

Chapter 2

Data Mining Process

Abstract

Successfully uncovering patterns using data mining is an iterative process. Chapter 2 provides a framework to solve the data mining problem. The five-step process outlined in this chapter provides guidelines on gathering subject matter expertise; exploring the data with statistics and visualization; building a model using data mining algorithms; testing the model and deploying it in a production environment; and finally reflecting on new knowledge gained in the cycle. Over the years of evolution of data mining practices, different frameworks for the data mining process have been put forward by various academic and commercial bodies, like the Cross Industry Standard Process for Data Mining, knowledge discovery in databases, etc. These data mining frameworks exhibit common characteristics and hence we will be using a generic framework closely resembling the CRISP process.

Keywords

CRISP; KDD; data mining process; prior knowledge; modeling; data preparation; evaluation; application

The methodological discovery of useful relationships and patterns in data is enabled by a set of iterative activities known as data mining process. The standard data mining process involves (1) understanding the problem, (2) preparing the data samples, (3) developing the model, (4) applying the model on a data set to see how the model may work in real world, and (5) production deployment. Over the years of evolution of data mining practices, different frameworks for the data mining process have been put forward by various academic and commercial bodies. In this chapter, we will discuss the key steps involved in building a successful data mining solution. The framework we put forward in this chapter is synthesized from a few data mining frameworks, and is explained using a simple example data set. This chapter serves as a high-level roadmap in building deployable data mining models, and discusses the challenges faced in each step, as well as important considerations and pitfalls to avoid. Most of the concepts discussed in this chapter are reviewed later in the book with detailed explanations and examples.

One of the most popular data mining process frameworks is CRISP-DM, which is an acronym for Cross Industry Standard Process for Data Mining. This framework was developed by a consortium of many companies involved in data mining (Chapman et al., 2000). The CRISP-DM process is the most widely adopted framework for developing data mining solutions. Figure 2.1 provides a visual overview of the CRISP-DM framework. Other data mining frameworks are SEMMA, which is an acronym for Sample, Explore, Modify, Model, and Assess, developed by the SAS Institute (SAS Institute, 2013); DMAIC, which is an acronym for Define, Measure, Analyze, Improve and Control, used in Six Sigma practice (Kubiak & Benbow, 2005); and the Selection, Preprocessing, Transformation, Data Mining, Interpretation, and Evaluation framework used in the knowledge discovery in databases (KDD) process (Fayyad et al., 1996). We feel all these frameworks exhibit common characteristics and hence we will be using a generic framework closely resembling the CRISP process. As with any process framework, a data mining process recommends the performance of a certain set of tasks to achieve optimal output. The process of extracting information from the data is iterative. The steps within the data mining process are not linear and have many loops, going back and forth between steps and at times going back to the first step to redefine data mining problem statement.

Figure 2.1 CRISP data mining framework.

The data mining process presented in Figure 2.2 is a generic set of steps that is business, algorithm, and, data mining tool agnostic. The fundamental objective of any process that involves data mining is to address the analysis question. The problem at hand could be segmentation of customers or predicting climate patterns or a simple data exploration. The algorithm used to solve the business question could be automated clustering or an artificial neural network. The software tools to develop and implement the data mining algorithm used could be custom coding, IBM SPSS, SAS, R, or RapidMiner, to mention a few.

Data mining, specifically in the context of big data, has gained a lot of importance in the last few years. Perhaps the most visible and discussed part of data mining is the third step: modeling. It involves building representative models that can be derived from the sample data set and can be used for either predictions (predictive modeling) or for describing the underlying pattern in the data (descriptive or explanatory modeling). Rightfully so, there is plenty of academic and business research in this step and we have dedicated most of the book to discussing various algorithms and quantitative foundations that go with it. We specifically wish to emphasize considering data mining as an end-to-end, multistep, iterative process instead of just a model building step. Seasoned data mining practitioners can attest to the fact that the most time-consuming part of the overall data mining process is not the model building part, but the preparation of data, followed by data and business understanding. There are many data mining tools, both open source and commercial, available in the market that can automate the model building. The most commonly used tools are RapidMiner, R, Weka, SAS, SPSS, Oracle Data Miner, Salford, Statistica, etc. (Piatetsky, 2014). Asking the right business questions, gaining in-depth business understanding, sourcing and preparing the data for the data mining task, mitigating implementation considerations, and, most useful of all, gaining knowledge from the data mining process, remains crucial to the success of the data mining process. Lets get started with Step 1: Framing the data mining question and understanding the context.

Figure 2.2 Data mining process.

2.1. Prior Knowledge

Prior knowledge refers to information that is already known about a subject. The objective of data mining doesn’t emerge in isolation; it always develops on top of existing subject matter and contextual information that is already known. The prior knowledge step in the data mining process helps to define what problem we are solving, how it fits in the business context, and what data we need to solve the problem.

2.1.1. Objective

The data mining process starts with an analysis need, a question or a business objective. This is possibly the most important step in the data mining process (Shearer, 2000). Without a well-defined statement of the problem, it is impossible to come up with the right data set and pick the right data mining algorithm. Even though the data mining process is a sequential process and it is common to go back to previous steps and revise the assumptions, approach, and tactics. It is imperative to get the objective of the whole process right, even if it is exploratory data mining.

We are going to explain the data mining process using an hypothetical example. Let’s assume we are in the consumer loan business, where a loan is provisioned for individuals with the collateral of assets like a home or car, i.e., a mortgage or an auto loan. As many home owners know, an important component of the loan, for the borrower and the lender, is the interest rate at which the borrower repays the loan on top of the principal. The interest rate on a loan depends on a gamut of variables like the current federal funds rate as determined by the central bank, borrower’s credit score, income level, home value, initial deposit (down payment) amount, current assets and liabilities of the borrower, etc. The key factor here is whether the lender sees enough reward (interest on the loan) for the risk of losing the principal (borrower’s default on the loan). In an individual case, the status of default of a loan is Boolean; either one defaults or not, during the period of the loan. But, in a group of tens of thousands of borrowers, we can find the default rate—a continuous numeric variable that indicates the percentage of borrowers who default on their loans. All the variables related to the borrower like credit score, income, current liabilities, etc. are used to assess the default risk in a related group; based on this, the interest rate is determined for a loan. The business objective of this hypothetical use case is: If we know the interest rate of past borrowers with a range of credit scores, can we predict interest rate for a new borrower?

2.1.2. Subject Area

The process of data mining uncovers hidden patterns in the data set by exposing relationships between attributes. But the issue is that it uncovers a lot of patterns. False signals are a major problem in the process. It is up to the data mining practitioner to filter through the patterns and accept the ones that are valid and relevant to answer the objective question. Hence, it is essential to know the subject matter, the context, and the business process generating the data.

The lending business is one of the oldest, most prevalent, and complex of all the businesses. If the data mining objective is to predict the interest rate, then it is important to know how the lending business works, why the prediction matters, what we do once we know the predicted interest rate, what data points can be collected from borrowers, what data points cannot be collected because of regulations, what other external factors can affect the...

Erscheint lt. Verlag	27.11.2014
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Betriebssysteme / Server
	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
ISBN-10	0-12-801650-7 / 0128016507
ISBN-13	978-0-12-801650-3 / 9780128016503

Haben Sie eine Frage zum Produkt?

PDF (Adobe DRM)
Größe: 41,0 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

EPUB (Adobe DRM)
Größe: 25,5 MB

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.