Data Wrangling with SQL (eBook)
350 Seiten
Packt Publishing (Verlag)
978-1-83763-430-9 (ISBN)
The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data.
The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You'll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You'll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling.
By the end of this book, you'll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.
Become a data wrangling expert and make well-informed decisions by effectively utilizing and analyzing raw unstructured data in a systematic mannerPurchase of the print or Kindle book includes a free PDF eBookKey FeaturesImplement query optimization during data wrangling using the SQL language with practical use casesMaster data cleaning, handle the date function and null value, and write subqueries and window functionsPractice self-assessment questions for SQL-based interviews and real-world case study roundsBook DescriptionThe amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data. The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You'll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You'll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling. By the end of this book, you'll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.What you will learnBuild time series models using data wranglingDiscover data wrangling best practices as well as tips and tricksFind out how to use subqueries, window functions, CTEs, and aggregate functionsHandle missing data, data types, date formats, and redundant dataBuild clean and efficient data models using data wrangling techniquesRemove outliers and calculate standard deviation to gauge the skewness of dataWho this book is forThis book is for data analysts looking for effective hands-on methods to manage and analyze large volumes of data using SQL. The book will also benefit data scientists, product managers, and basically any role wherein you are expected to gather data insights and develop business strategies using SQL as a language. If you are new to or have basic knowledge of SQL and databases and an understanding of data cleaning practices, this book will give you further insights into how you can apply SQL concepts to build clean, standardized data models for accurate analysis.]]>
Table of Contents
Preface
Part 1: Data Wrangling Introduction
1
Database Introduction
Getting started
Establishing the foundation
Efficient data organization
Data integrity and consistency
Technical requirements
Decoding database structures – relational and non-relational
What is a database?
Types of databases
Tables and relationships
The SQL CREATE DATABASE statement
The SQL CREATE TABLE statement
SQL DROP TABLE versus TRUNCATE TABLE
SQL ALTER TABLE
SQL constraints
SQL keys
Database relationships
Comparing database normalization and denormalization
Normalization
Types of normalization
Denormalization
When to apply denormalization
Disadvantages of denormalization
Summary
Practical exercises
Practical exercise 1
Practical exercise 2
Practical exercise 3
Practical exercise 4
2
Data Profiling and Preparation before Data Wrangling
What is data wrangling?
Data wrangling steps
The importance of data wrangling
Benefits of data wrangling
Data wrangling use cases
Business use cases
Data capture
How does data get captured?
Data-capturing techniques
Web scraping
Structured versus unstructured data
Paid-for versus free data-wrangling tools
Data profiling
Data profiling types
Data profiling techniques
Practical exercise
Step 1 – Discovery
Step 2 – Structuring
Step 3 – Cleaning
Step 4 – Enriching
Step 5 – Validating
Step 6 – Publishing
Summary
Part 2: Data Wrangling Techniques Using SQL
3
Data Wrangling on String Data Types
SQL data types
Numeric data types
Date and time data types
String data type
SQL string functions
RIGHT()
LEFT()
LEN()
TRIM()
RTRIM()
LTRIM()
RPAD()
LPAD()
REPLACE()
REVERSE()
SUBSTRING()
CAST()
CONCATENATE()
CONCATENATE_WS()
UPPER function
LOWER function
INITCAP function
INSTR function
Summary
Practical exercises
Practical exercise 1
Practical exercise 2
Practical exercise 3
Practical exercise 4
4
Data Wrangling on the DATE Data Type
SQL DATE data type functions
EXTRACT
DATEDIFF()
TIMEDIFF()
DATE_ADD()
DATE_SUB()
DATE_FORMAT()
STR_TO_DATE()
Extracting the current date and time
Summary
5
Handling NULL Values
The impact of missing data and NULL values on data analysis
Understanding the importance of data validation and cleaning before analyzing data
Identifying NULL/missing values
NULL values versus zero values
Using the IS NULL and IS NOT NULL operators to filter and select data with NULL values
IS NULL() and IS NOT NULL() – scenario
Using the COALESCE and IFNULL functions to replace NULL values with a default value
IFNULL()
COALESCE()
IS NULL versus = NULL
Summary
6
Pivoting Data Using SQL
SQL Transpose – rows to columns
Use case scenario
SQL Cross Tab – columns to rows
Use case scenario
Unpivoting data in SQL
Analytical workflow – from SQL to business intelligence – transforming data into actionable insights
Summary
Part 3: SQL Subqueries, Aggregate And Window Functions
7
Subqueries and CTEs
Introduction to subqueries
Simple subqueries
Correlated subqueries
Using subqueries in SELECT statements
Using subqueries in FROM statements
Using subqueries in WHERE statements
Nested subqueries
Correlated subqueries
Using subqueries in INSERT, UPDATE, and DELETE statements
Managing and maintaining subqueries
Common table expressions
Performance considerations for subqueries and CTEs
Subquery versus CTEs
Summary
8
Aggregate Functions
Overview of aggregate functions in SQL
Using GROUP BY
COUNT()
SUM()
AVG()
MIN() and MAX()
COUNT(DISTINCT)
Case scenario – using all aggregate functions
Summary
9
SQL Window Functions
The importance of SQL window functions
SQL aggregate functions
SQL window functions versus aggregate functions
Window functions versus aggregate functions – an example to illustrate the differences
Window functions
SUM()
COUNT()
AVG()
ROW_NUMBER()
RANK() and DENSE_RANK()
Lead() and Lag()
NTILE()
Summary
Part 4: Optimizing Query Performance
10
Optimizing Query Performance
Introduction to query optimization
Query execution plan
Query optimization techniques
Example
Caching
Normalization
Query monitoring and troubleshooting
Query profiling
Query logging
Database monitoring
Tips and tricks for writing efficient queries
Summary
In the next chapter, we will learn about descriptive statistics using SQL, which will provide us with insights into the distribution, central tendency, and variability of data, which can, in turn, help us identify outliers and anomalies. Common SQL functions and statements used for descriptive statistics include COUNT, AVG, MIN, MAX, and GROUP BY. By using SQL to analyze data, researchers and analysts can efficiently extract and summarize information from large datasets.
Part 5:Data Science And Wrangling
11
Descriptive Statistics with SQL
Calculating descriptive statistics with SQL
Mean
Median
Mode
Standard deviation
Variance
Variability
Summary
In the next chapter, we will learn how SQL can be used for time series analysis.
12
Time Series with SQL
Running totals
Case scenario
Lead and lag for time series analysis
Case scenario
Key KPIs
Percentage change
Case scenario
Key KPIs
Moving averages
Case scenario
Key KPIs
Rank for time series analysis
Case scenario
Key KPIs
CTE for time series analysis
Importance of using CTEs while performing time series analysis
Forecasting with linear regression
Case scenario
Key KPIs
Summary
In the next chapter, we will learn different methods to find outliers in the data easily. Outlier detection is an important aspect of data analysis as it helps determine if the data is correct, looks at the skewness of the data, and removes any unexpected values.
13
Outlier Detection
Measures of central tendency and dispersion
Case scenario
Key KPIs
Methods for detecting...
Erscheint lt. Verlag | 31.7.2023 |
---|---|
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge | |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
Mathematik / Informatik ► Informatik ► Web / Internet | |
ISBN-10 | 1-83763-430-0 / 1837634300 |
ISBN-13 | 978-1-83763-430-9 / 9781837634309 |
Haben Sie eine Frage zum Produkt? |
Digital Rights Management: ohne DRM
Dieses eBook enthält kein DRM oder Kopierschutz. Eine Weitergabe an Dritte ist jedoch rechtlich nicht zulässig, weil Sie beim Kauf nur die Rechte an der persönlichen Nutzung erwerben.
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür die kostenlose Software Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür eine kostenlose App.
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich