Data Science - Field Cady

Data Science

The Executive Summary - A Technical Book for Non-Technical Professionals

(Autor)

Buch | Hardcover
208 Seiten
2021
John Wiley & Sons Inc (Verlag)
978-1-119-54408-1 (ISBN)
79,13 inkl. MwSt
Tap into the power of data science with this comprehensive resource for non-technical professionals

Data Science: The Executive Summary – A Technical Book for Non-Technical Professionals is a comprehensive resource for people in non-engineer roles who want to fully understand data science and analytics concepts. Accomplished data scientist and author Field Cady describes both the "business side" of data science, including what problems it solves and how it fits into an organization, and the technical side, including analytical techniques and key technologies.

Data Science: The Executive Summary covers topics like:



Assessing whether your organization needs data scientists, and what to look for when hiring them
When Big Data is the best approach to use for a project, and when it actually ties analysts’ hands
Cutting edge Artificial Intelligence, as well as classical approaches that work better for many problems
How many techniques rely on dubious mathematical idealizations, and when you can work around them

Perfect for executives who make critical decisions based on data science and analytics, as well as mangers who hire and assess the work of data scientists, Data Science: The Executive Summary also belongs on the bookshelves of salespeople and marketers who need to explain what a data analytics product does. Finally, data scientists themselves will improve their technical work with insights into the goals and constraints of the business situation.

Field Cady, is a data scientist and author in the Seattle area. Most of his career has focused on consulting, for clients of all sizes in a range of industries. More recently he focused on using AI to mine scientific literature at the Allen Institute for Artificial Intelligence. His previous book, The Data Science Handbook, was published in 2017. His work has been covered in Wired, MIT Press and the Wall Street Journal among others.

1 Introduction 1

1.1 Why Managers Need to Know About Data Science 1

1.2 The New Age of Data Literacy 2

1.3 Data-Driven Development 3

1.4 How to Use this Book 4

2 The Business Side of Data Science 7

2.1 What Is Data Science? 7

2.1.1 What Data Scientists Do 7

2.1.2 History of Data Science 9

2.1.3 Data Science Roadmap 12

2.1.4 Demystifying the Terms: Data Science, Machine Learning, Statistics, and Business Intelligence 13

2.1.4.1 Machine Learning 13

2.1.4.2 Statistics 14

2.1.4.3 Business Intelligence 15

2.1.5 What Data Scientists Don’t (Necessarily) Do 15

2.1.5.1 Working Without Data 16

2.1.5.2 Working with Data that Can’t Be Interpreted 17

2.1.5.3 Replacing Subject Matter Experts 17

2.1.5.4 Designing Mathematical Algorithms 18

2.2 Data Science in an Organization 19

2.2.1 Types of Value Added 19

2.2.1.1 Business Insights 19

2.2.1.2 Intelligent Products 19

2.2.1.3 Building Analytics Frameworks 20

2.2.1.4 Offline Batch Analytics 21

2.2.2 One-Person Shops and Data Science Teams 21

2.2.3 Related Job Roles 22

2.2.3.1 Data Engineer 22

2.2.3.2 Data Analyst 22

2.2.3.3 Software Engineer 23

2.3 Hiring Data Scientists 25

2.3.1 Do I Even Need Data Science? 26

2.3.2 The Simplest Option: Citizen Data Scientists 27

2.3.3 The Harder Option: Dedicated Data Scientists 28

2.3.4 Programming, Algorithmic Thinking, and Code Quality 28

2.3.5 Hiring Checklist 31

2.3.6 Data Science Salaries 32

2.3.7 Bad Hires and Red Flags 32

2.3.8 Advice with Data Science Consultants 34

2.4 Management Failure Cases 36

2.4.1 Using Them as Devs 36

2.4.2 Inadequate Data 36

2.4.3 Using Them as Graph Monkeys 37

2.4.4 Nebulous Questions 37

2.4.5 Laundry Lists of Questions Without Prioritization 38

3 Working with Modern Data 41

3.1 Unstructured Data and Passive Collection 41

3.2 Data Types and Sources 42

3.3 Data Formats 43

3.3.1 CSV Files 43

3.3.2 JSON Files 44

3.3.3 XML and HTML 46

3.4 Databases 47

3.4.1 Relational Databases and Document Stores 48

3.4.2 Database Operations 49

3.5 Data Analytics Software Architectures 50

3.5.1 Shared Storage 51

3.5.2 Shared Relational Database 52

3.5.3 Document Store+Analytics RDB 52

3.5.4 Storage+Parallel Processing 53

4 Telling the Story, Summarizing Data 55

4.1 Choosing What to Measure 56

4.2 Outliers, Visualizations, and the Limits of Summary Statistics: A Picture IsWorth a Thousand Numbers 58

4.3 Experiments, Correlation, and Causality 60

4.4 Summarizing One Number 62

4.5 Key Properties to Assess: Central Tendency, Spread, and Heavy Tails 63

4.5.1 Measuring Central Tendency 63

4.5.1.1 Mean 63

4.5.1.2 Median 64

4.5.1.3 Mode 65

4.5.2 Measuring Spread 65

4.5.2.1 Standard Deviation 65

4.5.2.2 Percentiles 66

4.5.3 Advanced Material: Managing Heavy Tails 67

4.6 Summarizing Two Numbers: Correlations and Scatterplots 68

4.6.1 Correlations 68

4.6.1.1 Pearson Correlation 71

4.6.1.2 Ordinal Correlations 71

4.6.2 Mutual Information 72

4.7 Advanced Material: Fitting a Line or Curve 72

4.7.1 Effects of Outliers 75

4.7.2 Optimization and Choosing Cost Functions 76

4.8 Statistics: How to Not Fool Yourself 77

4.8.1 The Central Concept: The p-Value 78

4.8.2 Reality Check: Picking a Null Hypothesis and Modeling Assumptions 80

4.8.3 Advanced Material: Parameter Estimation and Confidence Intervals 81

4.8.4 Advanced Material: Statistical TestsWorth Knowing 82

4.8.4.1 𝜒2-Test 83

4.8.4.2 T-test 83

4.8.4.3 Fisher’s Exact Test 84

4.8.4.4 Multiple Hypothesis Testing 84

4.8.5 Bayesian Statistics 85

4.9 Advanced Material: Probability Distributions Worth Knowing 86

4.9.1 Probability Distributions: Discrete and Continuous 87

4.9.2 Flipping Coins: Bernoulli Distribution 89

4.9.3 Adding Coin Flips: Binomial Distribution 89

4.9.4 Throwing Darts: Uniform Distribution 91

4.9.5 Bell-Shaped Curves: Normal Distribution 91

4.9.6 Heavy Tails 101: Log-Normal Distribution 92

4.9.7 Waiting Around: Exponential Distribution and the Geometric Distribution 93

4.9.8 Time to Failure: Weibull Distribution 94

4.9.9 Counting Events: Poisson Distribution 95

5 Machine Learning 101

5.1 Supervised Learning, Unsupervised Learning, and Binary Classifiers 102

5.1.1 Reality Check: Getting Labeled Data and Assuming Independence 103

5.1.2 Feature Extraction and the Limitations of Machine Learning 104

5.1.3 Overfitting 105

5.1.4 Cross-Validation Strategies 106

5.2 Measuring Performance 107

5.2.1 Confusion Matrices 108

5.2.2 ROC Curves 108

5.2.3 Area Under the ROC Curve 110

5.2.4 Selecting Classification Cutoffs 110

5.2.5 Other Performance Metrics 111

5.2.6 Lift Curves 112

5.3 Advanced Material: Important Classifiers 113

5.3.1 Decision Trees 113

5.3.2 Random Forests 115

5.3.3 Ensemble Classifiers 116

5.3.4 Support Vector Machines 116

5.3.5 Logistic Regression 119

5.3.6 Lasso Regression 121

5.3.7 Naive Bayes 121

5.3.8 Neural Nets 123

5.4 Structure of the Data: Unsupervised Learning 124

5.4.1 The Curse of Dimensionality 125

5.4.2 Principal Component Analysis and Factor Analysis 125

5.4.2.1 Scree Plots and Understanding Dimensionality 128

5.4.2.2 Factor Analysis 128

5.4.2.3 Limitations of PCA 129

5.4.3 Clustering 129

5.4.3.1 Real-World Assessment of Clusters 130

5.4.3.2 k-means Clustering 131

5.4.3.3 Advanced Material: Other Clustering Algorithms 132

5.4.3.4 Advanced Material: Evaluating Cluster Quality 133

5.5 Learning as You Go: Reinforcement Learning 135

5.5.1 Multi-Armed Bandits and 𝜀-Greedy Algorithms 136

5.5.2 Markov Decision Processes and Q-Learning 137

6 Knowing the Tools 141

6.1 A Note on Learning to Code 141

6.2 Cheat Sheet 142

6.3 Parts of the Data Science Ecosystem 143

6.3.1 Scripting Languages 144

6.3.2 Technical Computing Languages 145

6.3.2.1 Python’s Technical Computing Stack 145

6.3.2.2 R 146

6.3.2.3 Matlab and Octave 146

6.3.2.4 Mathematica 147

6.3.2.5 SAS 147

6.3.2.6 Julia 147

6.3.3 Visualization 147

6.3.3.1 Tableau 148

6.3.3.2 Excel 148

6.3.3.3 D3.js 148

6.3.4 Databases 148

6.3.5 Big Data 149

6.3.5.1 Types of Big Data Technologies 150

6.3.5.2 Spark 151

6.3.6 Advanced Material: The Map-Reduce Paradigm 151

6.4 Advanced Material: Database Query Crash Course 153

6.4.1 Basic Queries 153

6.4.2 Groups and Aggregations 154

6.4.3 Joins 156

6.4.4 Nesting Queries 157

7 Deep Learning and Artificial Intelligence 161

7.1 Overview of AI 161

7.1.1 Don’t Fear the Skynet: Strong and Weak AI 161

7.1.2 System 1 and System 2 162

7.2 Neural Networks 164

7.2.1 What Neural Nets Can and Can’t Do 164

7.2.2 Enough Boilerplate: What’s a Neural Net? 165

7.2.3 Convolutional Neural Nets 166

7.2.4 Advanced Material: Training Neural Networks 167

7.2.4.1 Manual Versus Automatic Feature Extraction 168

7.2.4.2 Dataset Sizes and Data Augmentation 168

7.2.4.3 Batches and Epochs 169

7.2.4.4 Transfer Learning 170

7.2.4.5 Feature Extraction 171

7.2.4.6 Word Embeddings 171

7.3 Natural Language Processing 172

7.3.1 The Great Divide: Language Versus Statistics 172

7.3.2 Save Yourself Some Trouble: Consider Regular Expressions 173

7.3.3 Software and Datasets 174

7.3.4 Key Issue: Vectorization 175

7.3.5 Bag-of-Words 175

7.4 Knowledge Bases and Graphs 177

Postscript 181

Index 183

Erscheinungsdatum
Verlagsort New York
Sprache englisch
Maße 147 x 229 mm
Gewicht 431 g
Themenwelt Mathematik / Informatik Mathematik
Wirtschaft Betriebswirtschaft / Management
ISBN-10 1-119-54408-4 / 1119544084
ISBN-13 978-1-119-54408-1 / 9781119544081
Zustand Neuware
Haben Sie eine Frage zum Produkt?
Mehr entdecken
aus dem Bereich
Von Logik und Mengenlehre bis Zahlen, Algebra, Graphen und …

von Bernd Baumgarten

Buch | Softcover (2024)
De Gruyter Oldenbourg (Verlag)
74,95
fundiert, vielseitig, praxisnah

von Friedhelm Padberg; Christiane Benz

Buch | Softcover (2021)
Springer Berlin (Verlag)
32,99
Analysis und Lineare Algebra mit Querverbindungen

von Tilo Arens; Rolf Busam; Frank Hettlich; Christian Karpfinger …

Buch | Hardcover (2022)
Springer Spektrum (Verlag)
64,99