Beginning Data Science in R -  Thomas Mailund

Beginning Data Science in R (eBook)

Data Analysis, Visualization, and Modelling for the Data Scientist
eBook Download: PDF
2017 | 1st ed.
XXVII, 352 Seiten
Apress (Verlag)
978-1-4842-2671-1 (ISBN)
Systemvoraussetzungen
62,99 inkl. MwSt
  • Download sofort lieferbar
  • Zahlungsarten anzeigen
Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. This book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R.

Beginning Data Science in R details how data science is a combination of statistics, computational science, and machine learning. You'll see how to efficiently structure and mine data to extract useful patterns and build mathematical models. This requires computational methods and programming, and R is an ideal programming language for this. 

This book is based on a number of lecture notes for classes the author has taught on data science and statistical programming using the R programming language. Modern data analysis requires computational skills and usually a minimum of programming. 

What You Will Learn
  • Perform data science and analytics using statistics and the R programming language
  • Visualize and explore data, including working with large data sets found in big data
  • Build an R package
  • Test and check your code
  • Practice version control
  • Profile and optimize your code

Who This Book Is For

Those with some data science or analytics background, but not necessarily experience with the R programming language.

<



Thomas Mailund is an associate professor in bioinformatics at Aarhus University, Denmark. His background is in math and computer science but for the last decade his main focus has been on genetics and evolutionary studies, particularly comparative genomics, speciation, and gene flow between emerging species.


Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. This book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R.Beginning Data Science in R details how data science is a combination of statistics, computational science, and machine learning. You'll see how to efficiently structure and mine data to extract useful patterns and build mathematical models. This requires computational methods and programming, and R is an ideal programming language for this. This book is based on a number of lecture notes for classes the author has taught on data science and statistical programming using the R programming language. Modern data analysis requires computational skills and usually a minimum of programming. What You Will LearnPerform data science and analytics using statistics and the R programming languageVisualize and explore data, including working with large data sets found in big dataBuild an R packageTest and check your codePractice version controlProfile and optimize your codeWho This Book Is ForThose with some data science or analytics background, but not necessarily experience with the R programming language.

Thomas Mailund is an associate professor in bioinformatics at Aarhus University, Denmark. His background is in math and computer science but for the last decade his main focus has been on genetics and evolutionary studies, particularly comparative genomics, speciation, and gene flow between emerging species.

Contents at a Glance 4
Contents 5
About the Author 16
About the Technical Reviewer 17
Acknowledgments 18
Introduction 19
Chapter 1: Introduction to R Programming 24
Basic Interaction with R 24
Using R as a Calculator 26
Simple Expressions 26
Assignments 28
Actually, All of the Above Are Vectors of Values… 28
Indexing Vectors 29
Vectorized Expressions 30
Comments 31
Functions 31
Getting Documentation for Functions 32
Writing Your Own Functions 33
Vectorized Expressions and Functions 35
A Quick Look at Control Structures 35
Factors 39
Data Frames 41
Dealing with Missing Values 43
Using R Packages 44
Data Pipelines (or Pointless Programming) 45
Writing Pipelines of Function Calls 46
Writing Functions that Work with Pipelines 46
The magical “.” argument 47
Defining Functions Using . 48
Anonymous Functions 49
Other Pipeline Operations 50
Coding and Naming Conventions 51
Exercises 51
Mean of Positive Values 51
Root Mean Square Error 51
Chapter 2: Reproducible Analysis 52
Literate Programming and Integration of Workflow and Documentation 53
Creating an R Markdown/knitr Document in RStudio 53
The YAML Language 56
The Markdown Language 57
Formatting Text 58
Cross-Referencing 61
Bibliographies 62
Controlling the Output (Templates/Stylesheets) 62
Running R Code in Markdown Documents 63
Using Chunks when Analyzing Data (Without Compiling Documents) 65
Caching Results 66
Displaying Data 66
Exercises 67
Create an R Markdown Document 67
Produce Different Output 67
Add Caching 67
Chapter 3: Data Manipulation 68
Data Already in R 68
Quickly Reviewing Data 70
Reading Data 71
Examples of Reading and Formatting Datasets 72
Breast Cancer Dataset 72
Boston Housing Dataset 78
The readr Package 79
Manipulating Data with dplyr 81
Some Useful dplyr Functions 82
select(): Pick Selected Columns and Get Rid of the Rest 82
mutate():Add Computed Values to Your Data Frame 84
Transmute(): Add Computed Values to Your Data Frame and Get Rid of All Other Columns 85
arrange(): Reorder Your Data Frame by Sorting Columns 85
filter(): Pick Selected Rows and Get Rid of the Rest 86
group_by(): Split Your Data Into Subtables Based on Column Values 87
summarise/summarize(): Calculate Summary Statistics 87
Breast Cancer Data Manipulation 88
Tidying Data with tidyr 92
Exercises 95
Importing Data 96
Using dplyr 96
Using tidyr 96
Chapter 4: Visualizing Data 97
Basic Graphics 97
The Grammar of Graphics and the ggplot2 Package 105
Using qplot() 106
Using Geometries 110
Facets 119
Scaling 122
Themes and Other Graphics Transformations 127
Figures with Multiple Plots 131
Exercises 133
Chapter 5: Working with Large Datasets 134
Subsample Your Data Before You Analyze the Full Dataset 134
Running Out of Memory During Analysis 136
Too Large to Plot 137
Too Slow to Analyze 141
Too Large to Load 142
Exercises 145
Subsampling 145
Hex and 2D Density Plots 145
Chapter 6: Supervised Learning 146
Machine Learning 146
Supervised Learning 146
Regression versus Classification 147
Inference versus Prediction 148
Specifying Models 149
Linear Regression 149
Logistic Regression (Classification, Really) 154
Model Matrices and Formula 157
Validating Models 166
Evaluating Regression Models 166
Evaluating Classification Models 168
Confusion Matrix 169
Accuracy 170
Sensitivity and Specificity 172
Other Measures 173
More Than Two Classes 174
Random Permutations of Your Data 174
Cross-Validation 178
Selecting Random Training and Testing Data 180
Examples of Supervised Learning Packages 182
Decision Trees 182
Random Forests 184
Neural Networks 185
Support Vector Machines 186
Naive Bayes 186
Exercises 187
Fitting Polynomials 187
Evaluating Different Classification Measures 187
Breast Cancer Classification 187
Leave-One-Out Cross-Validation (Slightly More Difficult) 188
Decision Trees 188
Random Forests 188
Neural Networks 188
Support Vector Machines 188
Compare Classification Algorithms 188
Chapter 7: Unsupervised Learning 189
Dimensionality Reduction 189
Principal Component Analysis 189
Multidimensional Scaling 197
Clustering 201
k-Means Clustering 202
Hierarchical Clustering 208
Association Rules 212
Exercises 216
Dealing with Missing Data in the HouseVotes84 Data 216
Rescaling for k-Means Clustering 216
Varying k 216
Project 1 216
Importing Data 217
Exploring the Data 218
Distribution of Quality Scores 218
Is This Wine Red or White? 219
Fitting Models 223
Exercises 224
Exploring Other Formulas 224
Exploring Different Models 224
Analyzing Your Own Dataset 224
Chapter 8: More R Programming 225
Expressions 225
Arithmetic Expressions 225
Boolean Expressions 226
Basic Data Types 227
The Numeric Type 227
The Integer Type 228
The Complex Type 228
The Logical Type 228
The Character Type 229
Data Structures 229
Vectors 229
Matrix 230
Lists 232
Indexing 233
Named Values 235
Factors 236
Formulas 236
Control Structures 236
Selection Statements 236
Loops 238
A Word of Warning About Looping 239
Functions 240
Named Arguments 241
Default Parameters 242
Return Values 242
Lazy Evaluation 243
Scoping 244
Function Names Are Different from Variable Names 247
Recursive Functions 247
Exercises 249
Fibonacci Numbers 249
Outer Product 249
Linear Time Merge 249
Binary Search 250
More Sorting 250
Selecting the k Smallest Element 251
Chapter 9: Advanced R Programming 252
Working with Vectors and Vectorizing Functions 252
ifelse 254
Vectorizing Functions 254
The apply Family 256
apply 257
lapply 259
sapply and vapply 260
Advanced Functions 261
Special Names 261
Infix Operators 261
Replacement Functions 262
How Mutable Is Data Anyway? 264
Functional Programming 265
Anonymous Functions 265
Functions Taking Functions as Arguments 266
Functions Returning Functions (and Closures) 266
Filter, Map, and Reduce 267
Function Operations: Functions as Input and Output 269
Ellipsis Parameters 272
Exercises 274
between 274
apply_if 274
power 274
Row and Column Sums 274
Factorial Again 274
Function Composition 275
Chapter 10: Object Oriented Programming 276
Immutable Objects and Polymorphic Functions 276
Data Structures 276
Example: Bayesian Linear Model Fitting 277
Classes 278
Polymorphic Functions 280
Defining Your Own Polymorphic Functions 281
Class Hierarchies 282
Specialization as Interface 282
Specialization in Implementations 283
Exercises 286
Shapes 286
Polynomials 286
Chapter 11: Building an R Package 287
Creating an R Package 287
Package Names 287
The Structure of an R Package 288
.Rbuildignore 288
Description 289
Title 289
Version 289
Description 290
Author and Maintainer 290
License 290
Type, Date, LazyData 290
URL and BugReports 290
Dependencies 291
Using an Imported Package 291
Using a Suggested Package 292
NAMESPACE 292
R/ and man/ 293
Roxygen 293
Documenting Functions 293
Import and Export 294
Package Scope Versus Global Scope 295
Internal Functions 295
File Load Order 295
Adding Data to Your Package 296
Building an R Package 297
Exercises 298
Chapter 12: Testing and Package Checking 299
Unit Testing 299
Automating Testing 300
Using testthat 301
Writing Good Tests 302
Using Random Numbers in Tests 303
Testing Random Results 303
Checking a Package for Consistency 304
Exercise 304
Chapter 13: Version Control 305
Version Control and Repositories 305
Using git in RStudio 306
Installing git 306
Making Changes to Files, Staging Files, and Committing Changes 307
Adding git to an Existing Project 309
Bare Repositories and Cloning Repositories 309
Pushing Local Changes and Fetching and Pulling Remote Changes 310
Handling Conflicts 312
Working with Branches 312
Typical Workflows Involve Lots of Branches 315
Pushing Branches to the Global Repository 315
GitHub 315
Moving an Existing Repository to GitHub 317
Installing Packages from GitHub 318
Collaborating on GitHub 318
Pull Requests 318
Forking Repositories Instead of Cloning 319
Exercises 319
Chapter 14: Profiling and Optimizing 320
Profiling 320
A Graph-Flow Algorithm 321
Speeding Up Your Code 332
Parallel Execution 334
Switching to C++ 337
Exercises 339
Project 2 339
Bayesian Linear Regression 340
Exercises: Priors and Posteriors 341
Sample from a Multivariate Normal Distribution 341
Computing the Posterior Distribution 343
Predicting Target Variables for New Predictor Values 345
Formulas and Their Model Matrix 347
Working with Model Matrices in R 348
Exercises 351
Building Model Matrices 351
Fitting General Models 351
Model Matrices Without Response Variables 351
Exercises 352
Model Matrices for New Data 352
Predicting New Targets 352
Interface to a blm Class 353
Constructor 353
Updating Distributions: An Example Interface 354
Designing Your blm Class 357
Model Methods 357
coefficients 357
confint 358
deviance 358
fitted 358
plot 358
predict 358
print 358
residuals 359
summary 359
Building an R Package for blm 359
Deciding on the Package Interface 359
Organization of Source Files 359
Document Your Package Interface Well 360
Adding README and NEWS Files to Your Package 360
README 361
NEWS 361
Testing 361
GitHub 361
Conclusions 361
Data Science 362
Machine Learning 362
Data Analysis 362
R Programming 362
The End 363
Acknowledgements 363
Index 364

Erscheint lt. Verlag 9.3.2017
Zusatzinfo XXVII, 352 p. 100 illus.
Verlagsort Berkeley
Sprache englisch
Themenwelt Informatik Datenbanken Data Warehouse / Data Mining
Mathematik / Informatik Informatik Netzwerke
Mathematik / Informatik Informatik Programmiersprachen / -werkzeuge
Sozialwissenschaften Soziologie Empirische Sozialforschung
Technik
Schlagworte AI • Analytics • Big Data • Cloud • Coding • Data Science • Deep learning • machine learning • programming • R • Software • Statistics
ISBN-10 1-4842-2671-2 / 1484226712
ISBN-13 978-1-4842-2671-1 / 9781484226711
Haben Sie eine Frage zum Produkt?
PDFPDF (Wasserzeichen)
Größe: 6,7 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasser­zeichen und ist damit für Sie persona­lisiert. Bei einer missbräuch­lichen Weiter­gabe des eBooks an Dritte ist eine Rück­ver­folgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seiten­layout eignet sich die PDF besonders für Fach­bücher mit Spalten, Tabellen und Abbild­ungen. Eine PDF kann auf fast allen Geräten ange­zeigt werden, ist aber für kleine Displays (Smart­phone, eReader) nur einge­schränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Mehr entdecken
aus dem Bereich
Build ETL, Hybrid ETL, and ELT pipelines using ADF, Synapse …

von Dmitry Anoshin; Tonya Chernyshova; Dmitry Foshin …

eBook Download (2024)
Packt Publishing Limited (Verlag)
39,59