Advanced Data Analytics Using Python (eBook)
XV, 186 Seiten
Apress (Verlag)
978-1-4842-3450-1 (ISBN)
- Work with data analysis techniques such as classification, clustering, regression, and forecasting
- Handle structured and unstructured data, ETL techniques, and different kinds of databases such as Neo4j, Elasticsearch, MongoDB, and MySQL
- Examine the different big data frameworks, including Hadoop and Spark
- Discover advanced machine learning concepts such as semi-supervised learning, deep learning, and NLP
Gain a broad foundation of advanced data analytics concepts and discover the recent revolution in databases such as Neo4j, Elasticsearch, and MongoDB. This book discusses how to implement ETL techniques including topical crawling, which is applied in domains such as high-frequency algorithmic trading and goal-oriented dialog systems. You'll also see examples of machine learning concepts such as semi-supervised learning, deep learning, and NLP. Advanced Data Analytics Using Python also covers important traditional data analysis techniques such as time series and principal component analysis. After reading this book you will have experience of every technical aspect of an analytics project. You'll get to know the concepts using Python code, giving you samples to use in your own projects.What You Will LearnWork with data analysis techniques such as classification, clustering, regression, and forecastingHandle structured and unstructured data, ETL techniques, and different kinds of databases such as Neo4j, Elasticsearch, MongoDB, and MySQLExamine the different big data frameworks, including Hadoop and SparkDiscover advanced machine learning concepts such as semi-supervised learning, deep learning, and NLPWho This Book Is ForData scientists and software developers interested in the field of data analytics.
Sayan Mukhopadhyay in his 13+ years industry experience has been associated with companies such as Credit-Suisse, PayPal, CA Technology, CSC, and Mphasis. He has a deep understanding of the applications of data analysis in domains such as investment banking, online payments, online advertising, IT infrastructure, and retail. His area of expertise is applied high-performance computing in distributed and data-driven environments such as real-time analysis and high-frequency trading.
Table of Contents 5
About the Author 10
About the Technical Reviewer 11
Acknowledgments 12
Chapter 1: Introduction 13
Why Python? 13
When to Avoid Using Python 14
OOP in Python 15
Calling Other Languages in Python 24
Exposing the Python Model as a Microservice 26
High-Performance API and Concurrent Programming 29
Chapter 2: ETL with Python (Structured Data) 35
MySQL 35
How to Install MySQLdb? 35
Database Connection 36
INSERT Operation 36
READ Operation 37
DELETE Operation 38
UPDATE Operation 39
COMMIT Operation 40
ROLL-BACK Operation 40
Elasticsearch 43
Connection Layer API 45
Neo4j Python Driver 46
neo4j-rest-client 47
In-Memory Database 47
MongoDB (Python Edition) 48
Import Data into the Collection 48
Create a Connection Using pymongo 49
Access Database Objects 49
Insert Data 50
Update Data 50
Remove Data 50
Pandas 50
ETL with Python (Unstructured Data) 52
E-mail Parsing 52
Topical Crawling 54
Crawling Algorithms 55
Chapter 3: Supervised Learning Using Python 61
Dimensionality Reduction with Python 61
Correlation Analysis 62
Principal Component Analysis 65
Mutual Information 68
Classifications with Python 69
Semisupervised Learning 70
Decision Tree 71
Which Attribute Comes First? 71
Random Forest Classifier 72
Naive Bayes Classifier 73
Support Vector Machine 74
Nearest Neighbor Classifier 76
Sentiment Analysis 77
Image Recognition 79
Regression with Python 79
Least Square Estimation 80
Logistic Regression 81
Classification and Regression 82
Intentionally Bias the Model to Over-Fit or Under-Fit 83
Dealing with Categorical Data 85
Chapter 4: Unsupervised Learning: Clustering 89
K-Means Clustering 90
Choosing K: The Elbow Method 94
Distance or Similarity Measure 94
Properties 94
General and Euclidean Distance 95
Squared Euclidean Distance 96
Distance Between String-Edit Distance 97
Levenshtein Distance 97
Needleman–Wunsch Algorithm 98
Similarity in the Context of Document 99
Types of Similarity 99
What Is Hierarchical Clustering? 100
Bottom-Up Approach 101
Algorithm 101
Distance Between Clusters 102
Single Linkage Method 102
Complete Linkage Method 103
Average Linkage Method 103
Top-Down Approach 104
Algorithm 104
Graph Theoretical Approach 109
How Do You Know If the Clustering Result Is Good? 109
Chapter 5: Deep Learning and Neural Networks 111
Backpropagation 112
Backpropagation Approach 112
Generalized Delta Rule 112
Update of Output Layer Weights 113
Update of Hidden Layer Weights 114
BPN Summary 115
Backpropagation Algorithm 116
Other Algorithms 118
TensorFlow 118
Recurrent Neural Network 125
Chapter 6: Time Series 132
Classification of Variation 132
Analyzing a Series Containing a Trend 132
Curve Fitting 133
Removing Trends from a Time Series 134
Analyzing a Series Containing Seasonality 135
Removing Seasonality from a Time Series 136
By Filtering 136
By Differencing 137
Transformation 137
To Stabilize the Variance 137
To Make the Seasonal Effect Additive 138
To Make the Data Distribution Normal 138
Cyclic Variation 138
Irregular Fluctuations 139
Stationary Time Series 139
Stationary Process 139
Autocorrelation and the Correlogram 140
Estimating Autocovariance and Autocorrelation Functions 140
Time-Series Analysis with Python 141
Useful Methods 142
Moving Average Process 142
Fitting Moving Average Process 143
Autoregressive Processes 144
Estimating Parameters of an AR Process 145
Mixed ARMA Models 148
Integrated ARMA Models 149
The Fourier Transform 151
An Exceptional Scenario 152
Missing Data 154
Chapter 7: Analytics at Scale 155
Hadoop 155
MapReduce Programming 155
Partitioning Function 156
Combiner Function 157
HDFS File System 169
MapReduce Design Pattern 169
Summarization Pattern 169
Filtering Pattern 170
Join Patterns 171
Spark 176
Analytics in the Cloud 178
Internet of Things 189
Index 190
Erscheint lt. Verlag | 29.3.2018 |
---|---|
Zusatzinfo | XV, 186 p. 18 illus. |
Verlagsort | Berkeley |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Informatik ► Programmiersprachen / -werkzeuge ► Python | |
Schlagworte | Analytics • Apache Spark • Deep learning • Elastic Search • Hadoop • machine learning • Neo4j • Python • Storm • Time Series |
ISBN-10 | 1-4842-3450-2 / 1484234502 |
ISBN-13 | 978-1-4842-3450-1 / 9781484234501 |
Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
Haben Sie eine Frage zum Produkt? |
Größe: 2,3 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich