Description

This course primarily focuses on explaining the concepts of Python and PySpark. It will help you enhance your data analysis skills using structured Spark DataFrames APIs.

Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames. The author provides an in-depth review of RDDs and contrasts them with DataFrames. There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course. The code bundle for this course is available here: https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-

What You Will Learn

Learn Spark architecture, transformations, and actions using the structured API
Learn to set up your own local PySpark environment
Learn to interpret DAG (Directed Acyclic Graph) for Spark execution
Learn to interpret the Spark web UI
Learn the RDD (Resilient Distributed Datasets) API
Learn to visualize (graphs and dashboards) data on Databricks

Audience

This course is designed for Python developers who wish to learn how to use the language for data engineering and analytics with PySpark. Any aspiring data engineering and analytics professionals. Data scientists/analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster. Data managers who want to gain a deeper understanding of managing data over a cluster.

Approach

The entire course follows a practical hands-on approach ensuring that you can practice and understand all core concepts. It provides interactive activities to give you a practical learning experience.

Key Features

Apply PySpark and SQL concepts to analyze data * Understand the Databricks interface and use Spark on Databricks * Learn Spark transformations and actions using the RDD (Resilient Distributed Datasets) API

Github Repo

https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-

About the Author

David Mngadi

David Mngadi is a data management professional who is influenced by the power of data in our lives and has helped several companies become more data-driven to gain a competitive edge as well as meet the regulatory requirements. In the last 15 years, he has had the pleasure of designing and implementing data warehousing solutions in retail, telco, and banking industries, and recently in more big data lake-specific implementations. He is passionate about technology and teaching programming online.

Course Outline

1. Introduction to Spark and Installation

1. Introduction

This video session introduces you to PySpark. Let's understand what PySpark/Spark is. PySpark is a Python API for spark.

2. The Spark Architecture

In this session, we will understand the architecture of Spark.

3. The Spark Unified Stack

In this video, we will get a detailed description of the Spark unified stack.

4. Java Installation

We will be installing Java SE in this session (Windows).

5. Hadoop Installation

We will be installing and setting up Hadoop in this session (Windows).

6. Python Installation

We will be installing Python in this session (Windows).

7. PySpark Installation

In this session, we will go ahead and install PySpark (Windows).

8. Install Microsoft Build Tools

In this session, we will install Jupyter notebook. We will be using this tool throughout the course (Windows).

9. MacOS - Java Installation

We will install Java in this session (macOS).

10. MacOS - Python Installation

We will install Python in this session (macOS).

11. MacOS - PySpark Installation

We will be installing PySpark in this session (macOS).

12. MacOS - Testing the Spark Installation

We will be testing our Spark installation for macOS in this session.

13. Install Jupyter Notebooks

We will be installing the Jupyter notebook in this session for macOS.

14. The Spark Web UI

In this session, we will go through the Spark Web UI and see how we can use it to track our Spark jobs.

15. Section Summary

Let's recap what we have learned so far in this section.

2. Spark Execution Concepts

1. Section Introduction

This section gives an overview of the entire section.

2. Spark Application and Session

In this video session, we will dive deep into the Spark application. Let's create a Spark program and learn more about Spark sessions.

3. Spark Transformations and Actions Part 1

In this video, we will see how Spark executes its application.

4. Spark Transformations and Actions Part 2

In this session, we will learn about narrow and wide transformations and Spark actions.

5. DAG Visualisation

In this session, we will revisit the Spark web UI again and unpack DAG.

3. RDD Crash Course

1. Introduction to RDDs

This video introduces you to RDD. RDD stands for Resilient Distributed Datasets.

2. Data Preparation

In this session, we're going to unpack RDD transformations and actions, but first, let's prepare the session and data.

3. Distinct and Filter Transformations

In this session, we will explore the distinct() and filter() transformations.

4. Map and Flat Map Transformations

Let's learn about map and flat map transformations in this session.

5. SortByKey Transformations

In this session, we will perform sorting using the SortByKey() transformation period.

6. RDD Actions

In this video session, we will explore a couple of Spark actions.

7. Challenge - Convert Fahrenheit to Centigrade

In this video, we will look at a challenge problem and find the solution.

8. Challenge - XYZ Research

In this session, we will look at another challenge.

9. Challenge - XYZ Research Part 1

In this session, we will address part one of the challenge that we discussed in our previous lesson. Let's look at how many research projects were initiated in the three-year period.

10. Challenge XYZ Research Part 2

In this session, we will address part two of the challenge. In this video, we will look at how many projects were completed in the first year.

4. Structured API - Spark DataFrame

1. Structured APIs Introduction

This session introduces you to Spark structured APIs.

2. Preparing the Project Folder

In this session, we will set up our Python environment to learn the DataFrame API (structured APIs).

3. PySpark DataFrame, Schema, and DataTypes

In this session, let's learn about the DataFrame, schema, and data types.

4. DataFrame Reader and Writer

In this session, we will learn to use the DataFrame. The DataFrame reader is a built-in API within the DataFrame that allows you to read various source files such as CSV, JSON, and other Big Data file type such as Paraquet, ORC, and AVRO.

5. Challenge Part 1 - Brief

It is time for your first task. This session provides you with all the details for the task.

6. Challenge Part 1 - Data Preparation

Let's tackle the first task that was discussed in our previous lesson. You can compare your solution with the solution provided in this video.

7. Working with Structured Operations

In this video session, we will start working with structured operations.

8. Managing Performance Errors

In this video session, we will learn how to manage performance errors. Spark is not designed to work on a single node computer. Hence, you are bound to experience weird errors.

9. Reading a JSON File

In this session, we will learn to read the JSON file.

10. Columns and Expressions

Let's explore columns and expressions in this session.

11. Filter and Where Conditions

Let's learn to filter data using the "FILTER, WHERE" function in this session.

12. Distinct Drop Duplicates Order By

Depending on the dataset that you are working with, you may require a unique set of rows. Let's explore how to get unique set of rows, drop duplicates, and sort/order the data frame.

13. Rows and Union

In this lesson, we will learn how to create individual rows and make a DataFrame out of the rows and use the union transformation to combine DataFrames.

14. Adding, Renaming, and Dropping Columns

In this session, we're going to learn how to add, rename, and drop columns.

15. Working with Missing or Bad Data

In this video, we will learn to clean the data frame and remove missing or bad data.

16. Working with User-Defined Functions

In this session, let's learn how to create user-defined functions.

17. Challenge Part 2 - Brief

It is time for a challenge. In this challenge, you are required to prepare and clean the data (remove defects).

18. Challenge Part 2 - Remove Null Row and Bad Records

In this session, we will work on the first part of the challenge discussed in the previous video. We will start with removing null and bad rows.

19. Challenge Part 2 - Get the City and State

Let's go ahead and work on the second part of the challenge. In this session, we will extract the city and state from the purchase address.

20. Challenge Part 2 - Rearrange the Schema

Let's work on the second part of the challenge. In this session, we will be changing some datatypes, rename a few columns, drop some columns, and add new columns.

21. Challenge Part 2 - Write Partitioned DataFrame to Parquet

Let's finish working on the last part of the challenge. In this session, we will write the final DataFrame into a partitioned paraquet file.

22. Aggregations

In this session, we will understand the concept of aggregations.

23. Aggregations - Setting Up Flight Summary Data

In this session, we will look at some data to understand the concept of aggregation better.

24. Aggregations - Count and Count Distinct

In this session, we will explore the concept of aggregation further and learn to use the count and countDistinct aggregation functions.

25. Aggregations - Min Max Sum SumDistinct AVG

In this session, we will work with the aggregation functions: min, max, sum, sumdistinct, and avg.

26. Aggregations with Grouping

In this lesson, we will group the data and then apply an aggregation function.

27. Challenge Part 3 - Brief

In this video session, we will look at another challenge.

28. Challenge Part 3 - Prepare 2019 Data

In the session, we will address the first task of the challenge that was discussed in the previous video. We will be preparing 2019 data and modularizing our programs.

29. Challenge Part 3 - Q1 Get the Best Sales Month

In this session, we will discuss the answer to the first question of the challenge, which is: What was the best month sales?

30. Challenge Part 3 - Q2 Get the City that Sold the Most Products

In this session, we will discuss the second question of the challenge and get the city that sold the most products.

31. Challenge Part 3 - Q3 When to Advertise

In this session, we will be addressing third question of the challenge, which is: What time should we display advertisements to maximize the likelihood of customers trying products?

32. Challenge Part 3 - Q4 Products Bought Together

In this session, we will address the final question of the challenge, which is: What products are often sold together in the state of 'NY'?

5. Introduction to Spark SQL and Databricks

1. Introduction to DataBricks

In this session, we will discuss the idea behind Databricks.

2. Spark SQL Introduction

In this session, we will discuss about SQL and Spark SQL.

3. Register Account on Databricks

In this session, we will go ahead and register on Databricks.

4. Create a Databricks Cluster

In this session, we will create a Databricks cluster.

5. Creating our First 2 Databricks Notebooks

We are now ready with our newly created cluster. In this session, we will go ahead and create our first two notebooks.

6. Reading CSV Files into DataFrame

In this session, we will load our CSV files from our previous project into the DataFrame.

7. Creating a Database and Table

In this session, we will create a database and a table.

8. Inserting Records into a Table

In this session, we will insert records into the table we created previously.

9. Exposing Bad Records

In this session, we will remove bad records to ensure that we have good quality data.

10. Figuring out How to Remove Bad Records

In the session, we will learn to produce query without the bad records.

11. Extract the City and State

We will extract the city and state from the purchase address in this video session.

12. Inserting Records to Final Sales Table

In this session, we will use SQL to insert records into the final sales table.

13. What was the Best Month in Sales?

By now, you should be ready with the sales analytic data. In this session, we will look at a few questions about the data.

14. Get the City that Sold the Most Products

In this session, we will address the question of which city sold the most products.

15. Get the Right Time to Advertise

In this session, we will address the question of what time should we display advertisements to maximize the likelihood of customers buying products.

16. Get the Most Products Sold Together

What products are most often sold together in the state of NY? This is the question we will be addressing in this session.

17. Create a Dashboard

In this session, we will be creating a data break dashboard.

18. Summary

Congratulations! You have successfully completed the course. Let's look at a short summary of the things you have learned so far before we wrap up.

Course Images

Apache Spark 3 for Data Engineering and Analytics with Python

By Packt

Booking options

Highlights

Description

course Content

About The Provider

Tags

Reviews