Data Wrangling with Python - Jacqueline Kazil, Katharine Jarmul

Data Wrangling with Python

Tips and Tools to Make Your Life Easier
Buch | Softcover
508 Seiten
2016
O'Reilly Media (Verlag)
978-1-4919-4881-1 (ISBN)
44,85 inkl. MwSt
Digging into data does not have to be painful. With Data Wrangling Using Python, you'll learn how to clean and analyze data, create compelling stories, and scale that data as necessary. There are awesome discoveries to be made in unassuming datasets and stories to be told. You don't have to be a programmer to tell them.

What you need is to understand the context of the data and to know a few of the techniques found in this book. You'll learn enough Python to be empowered to engage with your data, through a series of examples that grow in complexity throughout the book.
How do you take your data analysis skills beyond Excel to the next level? By learning just enough Python to get stuff done. This hands-on guide shows non-programmers like you how to process information that’s initially too messy or difficult to access. You don't need to know a thing about the Python programming language to get started.

Through various step-by-step exercises, you’ll learn how to acquire, clean, analyze, and present data efficiently. You’ll also discover how to automate your data process, schedule file- editing and clean-up tasks, process larger datasets, and create compelling stories with data you obtain.
  • Quickly learn basic Python syntax, data types, and language concepts
  • Work with both machine-readable and human-consumable data
  • Scrape websites and APIs to find a bounty of useful information
  • Clean and format data to eliminate duplicates and errors in your datasets
  • Learn when to standardize data and when to test and script data cleanup
  • Explore and analyze your datasets with new Python libraries and techniques
  • Use Python solutions to automate your entire data-wrangling process

Jacqueline Kazil is a Presidential Innovation Fellow working on Disaster Response and Recovery at the Federal Emergency Management Agency (FEMA). Jackie is a software developer passionate about human behavior and open data. Most recently, she worked for CACI, where she was lead developer on a contract at The Library of Congress, working on projects such as Chronicling America and Congress.gov. Previously, Jackie worked for The Washington Post on news-driven data applications - including the notable Top Secret America series, which received multiple awards including the 2010 George Polk Award for Journalism and was a SXSW Finalist for Technical Achievement. She has experience in software development using best practices, data analysis, modeling and simulation, social network analysis, data handling, data storage, mapping, and geospatial analysis. She is also active in open-source community development. She founded PyLadies DC and Geo DC. She also runs Django District and assists with DC Python. Jackie received her MA in Convergence Journalism from the University of Missouri, and she is currently working on her PhD in Computational Social Science at George Mason University.

Katharine Jarmul is a Python developer who enjoys data analysis and acquisition, web scraping, teaching Python and all things Unix. She has worked at small and large start ups before starting her consulting career overseas. Originally from Los Angeles, she learned Python while working at the Washington Post in 2008. As one of the founders of PyLadies, Katharine hopes to promote diversity in Python and other open source languages through education and training. She has led numerous workshops and tutorials ranging from beginner to advanced topics in Python. For more information on upcoming trainings, reach out to her on Twitter or her her web site.

Chapter 1Introduction to Python
Why Python
Getting Started with Python
Summary
Chapter 2Python Basics
Basic Data Types
Data Containers
What Can the Various Data Types Do?
Helpful Tools: type, dir, and help
Putting It All Together
What Does It All Mean?
Summary
Chapter 3Data Meant to Be Read by Machines
CSV Data
JSON Data
XML Data
Summary
Chapter 4Working with Excel Files
Installing Python Packages
Parsing Excel Files
Getting Started with Parsing
Summary
Chapter 5PDFs and Problem Solving in Python
Avoid Using PDFs!
Programmatic Approaches to PDF Parsing
Parsing PDFs Using pdfminer
Learning How to Solve Problems
Uncommon File Types
Summary
Chapter 6Acquiring and Storing Data
Not All Data Is Created Equal
Fact Checking
Readability, Cleanliness, and Longevity
Where to Find Data
Case Studies: Example Data Investigation
Storing Your Data: When, Why, and How?
Databases: A Brief Introduction
When to Use a Simple File
Alternative Data Storage
Summary
Chapter 7Data Cleanup: Investigation, Matching, and Formatting
Why Clean Data?
Data Cleanup Basics
Summary
Chapter 8Data Cleanup: Standardizing and Scripting
Normalizing and Standardizing Your Data
Saving Your Data
Determining What Data Cleanup Is Right for Your Project
Scripting Your Cleanup
Testing with New Data
Summary
Chapter 9Data Exploration and Analysis
Exploring Your Data
Analyzing Your Data
Summary
Chapter 10Presenting Your Data
Avoiding Storytelling Pitfalls
Visualizing Your Data
Presentation Tools
Publishing Your Data
Summary
Chapter 11Web Scraping: Acquiring and Storing Data from the Web
What to Scrape and How
Analyzing a Web Page
Getting Pages: How to Request on the Internet
Reading a Web Page with Beautiful Soup
Reading a Web Page with LXML
Summary
Chapter 12Advanced Web Scraping: Screen Scrapers and Spiders
Browser-Based Parsing
Spidering the Web
Networks: How the Internet Works and Why It’s Breaking Your Script
The Changing Web (or Why Your Script Broke)
A (Few) Word(s) of Caution
Summary
Chapter 13APIs
API Features
A Simple Data Pull from Twitter’s REST API
Advanced Data Collection from Twitter’s REST API
Advanced Data Collection from Twitter’s Streaming API
Summary
Chapter 14Automation and Scaling
Why Automate?
Steps to Automate
What Could Go Wrong?
Where to Automate
Special Tools for Automation
Simple Automation
Large-Scale Automation
Monitoring Your Automation
No System Is Foolproof
Summary
Chapter 15Conclusion
Duties of a Data Wrangler
Beyond Data Wrangling
Where Do You Go from Here?
Appendix Comparison of Languages Mentioned
C, C++, and Java Versus Python
R or MATLAB Versus Python
HTML Versus Python
JavaScript Versus Python
Node.js Versus Python
Ruby and Ruby on Rails Versus Python
Appendix Python Resources for Beginners
Online Resources
In-Person Groups
Appendix Learning the Command Line
Bash
Windows CMD/Power Shell
Appendix Advanced Python Setup
Step 1: Install GCC
Step 2: (Mac Only) Install Homebrew
Step 3: (Mac Only) Tell Your System Where to Find Homebrew
Step 4: Install Python 2.7
Step 5: Install virtualenv (Windows, Mac, Linux)
Step 6: Set Up a New Directory
Step 7: Install virtualenvwrapper
Learning About Our New Environment (Windows, Mac, Linux)
Advanced Setup Review
Appendix Python Gotchas
Hail the Whitespace
The Dreaded GIL
= Versus == Versus is, and When to Just Copy
Default Function Arguments
Python Scope and Built-Ins: The Importance of Variable Names
Defining Objects Versus Modifying Objects
Changing Immutable Objects
Type Checking
Catching Multiple Exceptions
The Power of Debugging
Appendix IPython Hints
Why Use IPython?
Getting Started with IPython
Magic Functions
Final Thoughts: A Simpler Terminal
Appendix Using Amazon Web Services
Spinning Up an AWS Server
Logging into an AWS Server

Erscheint lt. Verlag 15.3.2016
Zusatzinfo black & white illustrations
Verlagsort Sebastopol
Sprache englisch
Maße 177 x 232 mm
Gewicht 862 g
Einbandart kartoniert
Themenwelt Informatik Datenbanken Data Warehouse / Data Mining
Informatik Programmiersprachen / -werkzeuge Python
Mathematik / Informatik Informatik Web / Internet
Schlagworte Data Mining Analytics • Data Warehouse • Python
ISBN-10 1-4919-4881-7 / 1491948817
ISBN-13 978-1-4919-4881-1 / 9781491948811
Zustand Neuware
Haben Sie eine Frage zum Produkt?
Mehr entdecken
aus dem Bereich
Auswertung von Daten mit pandas, NumPy und IPython

von Wes McKinney

Buch | Softcover (2023)
O'Reilly (Verlag)
44,90
Datenanalyse für Künstliche Intelligenz

von Jürgen Cleve; Uwe Lämmel

Buch | Softcover (2024)
De Gruyter Oldenbourg (Verlag)
69,95