PySpark SQL Recipes - Raju Kumar Mishra, Sundar Rajan Raman

PySpark SQL Recipes (eBook)

With HiveQL, Dataframe and Graphframes

Raju Kumar Mishra, Sundar Rajan Raman (Autoren)

eBook Download: PDF

2019 | 1st ed.
XXIV, 323 Seiten
Apress (Verlag)
978-1-4842-4335-0 (ISBN)

PySpark SQL Recipes starts with recipes on creating dataframes from different types of data source, data aggregation and summarization, and exploratory data analysis using PySpark SQL. You'll also discover how to solve problems in graph analysis using graphframes.

On completing this book, you'll have ready-made code for all your PySpark SQL tasks, including creating dataframes using data from different file formats as well as from SQL or NoSQL databases.

What You Will Learn

Understand PySpark SQL and its advanced features
Use SQL and HiveQL with PySpark SQL
Work with structured streaming
Optimize PySpark SQL
Master graphframes and graph processing

Who This Book Is For

Data scientists, Python programmers, and SQL programmers.

Raju Kumar Mishra has strong interests in data science and systems that have the capability of handling large amounts of data and operating complex mathematical models through computational programming. He was inspired to pursue an M. Tech in computational sciences from Indian Institute of Science in Bangalore, India. Raju primarily works in the areas of data science and its different applications. Working as a corporate trainer he has developed unique insights that help him in teaching and explaining complex ideas with ease. Raju is also a data science consultant solving complex industrial problems. He works on programming tools such as R, Python, scikit-learn, Statsmodels, Hadoop, Hive, Pig, Spark, and many others. His venture Walsoul Private Ltd provides training in data science, programming, and big data.

Sundar Rajan Raman is an artificial intelligence practitioner currently working at Bank of America. He holds a Bachelor of Technology degree from the National Institute of Technology, India. Being a seasoned Java and J2EE programmer he has worked on critical applications for companies such as AT&T, Singtel, and Deutsche Bank. He is also a seasoned big data architect. His current focus is on artificial intelligence space including machine learning and deep learning.

Carry out data analysis with PySpark SQL, graphframes, and graph data processing using a problem-solution approach. This book provides solutions to problems related to dataframes, data manipulation summarization, and exploratory analysis. You will improve your skills in graph data analysis using graphframes and see how to optimize your PySpark SQL code.PySpark SQL Recipes starts with recipes on creating dataframes from different types of data source, data aggregation and summarization, and exploratory data analysis using PySpark SQL. You'll also discover how to solve problems in graph analysis using graphframes.On completing this book, you'll have ready-made code for all your PySpark SQL tasks, including creating dataframes using data from different file formats as well as from SQL or NoSQL databases.What You Will LearnUnderstand PySpark SQL and its advanced featuresUse SQL and HiveQL with PySpark SQLWork with structured streamingOptimize PySpark SQL Master graphframes and graph processingWho This Book Is ForData scientists, Python programmers, and SQL programmers.

Raju Kumar Mishra has strong interests in data science and systems that have the capability of handling large amounts of data and operating complex mathematical models through computational programming. He was inspired to pursue an M. Tech in computational sciences from Indian Institute of Science in Bangalore, India. Raju primarily works in the areas of data science and its different applications. Working as a corporate trainer he has developed unique insights that help him in teaching and explaining complex ideas with ease. Raju is also a data science consultant solving complex industrial problems. He works on programming tools such as R, Python, scikit-learn, Statsmodels, Hadoop, Hive, Pig, Spark, and many others. His venture Walsoul Private Ltd provides training in data science, programming, and big data.Sundar Rajan Raman is an artificial intelligence practitioner currently working at Bank of America. He holds a Bachelor of Technology degree from the National Institute of Technology, India. Being a seasoned Java and J2EE programmer he has worked on critical applications for companies such as AT&T, Singtel, and Deutsche Bank. He is also a seasoned big data architect. His current focus is on artificial intelligence space including machine learning and deep learning.

Chapter 1: Introduction to PySparkSQL Chapter Goal: Reader will understand about PySpark, PySparkSQL , Catalyst Optimizer, Project Tungsten and Hive

No of pages                   20-30

Sub -Topics

1.      PySpark

2.      PySparkSQL

3.      Hive

4.      Catalyst

5.      Project Tungsten

Chapter 2: Some time with Installation Chapter Goal: Learner will understand about installation of Spark, Hive, PostgreSQL, MySQL, MongoDB, Cassandra etc.

No of pages: 30 -40

Sub - Topics

1.       Installation Spark

2.     Installation Hive

3.     Installation MySQL

4.     Installation MongoDB

Chapter 3: IO in PySparkSQL Chapter Goal: This chapter will provide recipes to the reader, which will enable them to create PySparkSQL DataFrame from different sources.

No of pages : 40-50

Sub - Topics:

1.      Creating DataFrame from data.

2.      Reading csv file to create Dataframe

3. Reading JSON file to create Dataframe.

4. Saving DataFrames to different formats.

Chapter 4 : Operations on PySparkSQL DataFrames Chapter Goal:               Reader will learn about data filtering, data manuipulation, data descriptive analysis , Dealing with missing value etc

No Of Pages ; 40 -50

1.      Data filtering

2.      Data manipulation

3.      Row and column manipulation

Chapter 5 : Data Merging and Data Aggregation using PySparkSQL Chapter Goal: Reader will learn about data merging and aggregation using PySparkSQL

1.      Data Merging

2.      Data aggregation

Chapter 6: SQL, NoSQL and PySparkSQL Chapter Goal: Reader will learn to run SQL and HiveQL queries on Dataframe

No of pages: 30-40

Sub - Topics:

1. Running SQL on DataFrame

2. Running HiveQL

Chapter 7: Structured Streaming Chapter Goal:               Reader will understand about structured streaming

No of pages : 30-40

1.      Different type of modes.

2.      Data aggregation in structured streaming

3.      Different type of sources

Chapter 8 : Optimizing PySparkSQL Chapter Goal:               Reader will learn about optimizing PySparkSQL

No Of pages : 20-30

Optimizing PySparkSQL

Chapter 9 : GraphFrames Chapter Goal:               Reader will understand about graph data analysis with Graphframes.

No of pages : 30-40

1. GraphFrame Creation

1.      Page Rank

2.      Breadth First Search

Erscheint lt. Verlag	18.3.2019
Zusatzinfo	XXIV, 323 p. 57 illus.
Verlagsort	Berkeley
Sprache	englisch
Themenwelt	Informatik ► Datenbanken ► SQL Server
	Mathematik / Informatik ► Informatik ► Netzwerke
	Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge
Schlagworte	Big Data • Data processing • Graph frames • No SQL • PySpark • PySpark SQL • Python • Spark Streaming
ISBN-10	1-4842-4335-8 / 1484243358
ISBN-13	978-1-4842-4335-0 / 9781484243350

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 4,8 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

37,44 €