Booking options
£41.99
£41.99
On-Demand course
8 hours 30 minutes
All levels
This course primarily focuses on explaining the concepts of Python and PySpark. It will help you enhance your data analysis skills using structured Spark DataFrames APIs.
Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames. The author provides an in-depth review of RDDs and contrasts them with DataFrames. There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course. The code bundle for this course is available here: https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-
Learn Spark architecture, transformations, and actions using the structured API
Learn to set up your own local PySpark environment
Learn to interpret DAG (Directed Acyclic Graph) for Spark execution
Learn to interpret the Spark web UI
Learn the RDD (Resilient Distributed Datasets) API
Learn to visualize (graphs and dashboards) data on Databricks
This course is designed for Python developers who wish to learn how to use the language for data engineering and analytics with PySpark. Any aspiring data engineering and analytics professionals. Data scientists/analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster. Data managers who want to gain a deeper understanding of managing data over a cluster.
The entire course follows a practical hands-on approach ensuring that you can practice and understand all core concepts. It provides interactive activities to give you a practical learning experience.
Apply PySpark and SQL concepts to analyze data * Understand the Databricks interface and use Spark on Databricks * Learn Spark transformations and actions using the RDD (Resilient Distributed Datasets) API
https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-
David Mngadi is a data management professional who is influenced by the power of data in our lives and has helped several companies become more data-driven to gain a competitive edge as well as meet the regulatory requirements. In the last 15 years, he has had the pleasure of designing and implementing data warehousing solutions in retail, telco, and banking industries, and recently in more big data lake-specific implementations. He is passionate about technology and teaching programming online.
1. Introduction to Spark and Installation
1. Introduction This video session introduces you to PySpark. Let's understand what PySpark/Spark is. PySpark is a Python API for spark. |
2. The Spark Architecture In this session, we will understand the architecture of Spark. |
3. The Spark Unified Stack In this video, we will get a detailed description of the Spark unified stack. |
4. Java Installation We will be installing Java SE in this session (Windows). |
5. Hadoop Installation We will be installing and setting up Hadoop in this session (Windows). |
6. Python Installation We will be installing Python in this session (Windows). |
7. PySpark Installation In this session, we will go ahead and install PySpark (Windows). |
8. Install Microsoft Build Tools In this session, we will install Jupyter notebook. We will be using this tool throughout the course (Windows). |
9. MacOS - Java Installation We will install Java in this session (macOS). |
10. MacOS - Python Installation We will install Python in this session (macOS). |
11. MacOS - PySpark Installation We will be installing PySpark in this session (macOS). |
12. MacOS - Testing the Spark Installation We will be testing our Spark installation for macOS in this session. |
13. Install Jupyter Notebooks We will be installing the Jupyter notebook in this session for macOS. |
14. The Spark Web UI In this session, we will go through the Spark Web UI and see how we can use it to track our Spark jobs. |
15. Section Summary Let's recap what we have learned so far in this section. |
2. Spark Execution Concepts
1. Section Introduction This section gives an overview of the entire section. |
2. Spark Application and Session In this video session, we will dive deep into the Spark application. Let's create a Spark program and learn more about Spark sessions. |
3. Spark Transformations and Actions Part 1 In this video, we will see how Spark executes its application. |
4. Spark Transformations and Actions Part 2 In this session, we will learn about narrow and wide transformations and Spark actions. |
5. DAG Visualisation In this session, we will revisit the Spark web UI again and unpack DAG. |
3. RDD Crash Course
1. Introduction to RDDs This video introduces you to RDD. RDD stands for Resilient Distributed Datasets. |
2. Data Preparation In this session, we're going to unpack RDD transformations and actions, but first, let's prepare the session and data. |
3. Distinct and Filter Transformations In this session, we will explore the distinct() and filter() transformations. |
4. Map and Flat Map Transformations Let's learn about map and flat map transformations in this session. |
5. SortByKey Transformations In this session, we will perform sorting using the SortByKey() transformation period. |
6. RDD Actions In this video session, we will explore a couple of Spark actions. |
7. Challenge - Convert Fahrenheit to Centigrade In this video, we will look at a challenge problem and find the solution. |
8. Challenge - XYZ Research In this session, we will look at another challenge. |
9. Challenge - XYZ Research Part 1 In this session, we will address part one of the challenge that we discussed in our previous lesson. Let's look at how many research projects were initiated in the three-year period. |
10. Challenge XYZ Research Part 2 In this session, we will address part two of the challenge. In this video, we will look at how many projects were completed in the first year. |
4. Structured API - Spark DataFrame
1. Structured APIs Introduction This session introduces you to Spark structured APIs. |
2. Preparing the Project Folder In this session, we will set up our Python environment to learn the DataFrame API (structured APIs). |
3. PySpark DataFrame, Schema, and DataTypes In this session, let's learn about the DataFrame, schema, and data types. |
4. DataFrame Reader and Writer In this session, we will learn to use the DataFrame. The DataFrame reader is a built-in API within the DataFrame that allows you to read various source files such as CSV, JSON, and other Big Data file type such as Paraquet, ORC, and AVRO. |
5. Challenge Part 1 - Brief It is time for your first task. This session provides you with all the details for the task. |
6. Challenge Part 1 - Data Preparation Let's tackle the first task that was discussed in our previous lesson. You can compare your solution with the solution provided in this video. |
7. Working with Structured Operations In this video session, we will start working with structured operations. |
8. Managing Performance Errors In this video session, we will learn how to manage performance errors. Spark is not designed to work on a single node computer. Hence, you are bound to experience weird errors. |
9. Reading a JSON File In this session, we will learn to read the JSON file. |
10. Columns and Expressions Let's explore columns and expressions in this session. |
11. Filter and Where Conditions Let's learn to filter data using the "FILTER, WHERE" function in this session. |
12. Distinct Drop Duplicates Order By Depending on the dataset that you are working with, you may require a unique set of rows. Let's explore how to get unique set of rows, drop duplicates, and sort/order the data frame. |
13. Rows and Union In this lesson, we will learn how to create individual rows and make a DataFrame out of the rows and use the union transformation to combine DataFrames. |
14. Adding, Renaming, and Dropping Columns In this session, we're going to learn how to add, rename, and drop columns. |
15. Working with Missing or Bad Data In this video, we will learn to clean the data frame and remove missing or bad data. |
16. Working with User-Defined Functions In this session, let's learn how to create user-defined functions. |
17. Challenge Part 2 - Brief It is time for a challenge. In this challenge, you are required to prepare and clean the data (remove defects). |
18. Challenge Part 2 - Remove Null Row and Bad Records In this session, we will work on the first part of the challenge discussed in the previous video. We will start with removing null and bad rows. |
19. Challenge Part 2 - Get the City and State Let's go ahead and work on the second part of the challenge. In this session, we will extract the city and state from the purchase address. |
20. Challenge Part 2 - Rearrange the Schema Let's work on the second part of the challenge. In this session, we will be changing some datatypes, rename a few columns, drop some columns, and add new columns. |
21. Challenge Part 2 - Write Partitioned DataFrame to Parquet Let's finish working on the last part of the challenge. In this session, we will write the final DataFrame into a partitioned paraquet file. |
22. Aggregations In this session, we will understand the concept of aggregations. |
23. Aggregations - Setting Up Flight Summary Data In this session, we will look at some data to understand the concept of aggregation better. |
24. Aggregations - Count and Count Distinct In this session, we will explore the concept of aggregation further and learn to use the count and countDistinct aggregation functions. |
25. Aggregations - Min Max Sum SumDistinct AVG In this session, we will work with the aggregation functions: min, max, sum, sumdistinct, and avg. |
26. Aggregations with Grouping In this lesson, we will group the data and then apply an aggregation function. |
27. Challenge Part 3 - Brief In this video session, we will look at another challenge. |
28. Challenge Part 3 - Prepare 2019 Data In the session, we will address the first task of the challenge that was discussed in the previous video. We will be preparing 2019 data and modularizing our programs. |
29. Challenge Part 3 - Q1 Get the Best Sales Month In this session, we will discuss the answer to the first question of the challenge, which is: What was the best month sales? |
30. Challenge Part 3 - Q2 Get the City that Sold the Most Products In this session, we will discuss the second question of the challenge and get the city that sold the most products. |
31. Challenge Part 3 - Q3 When to Advertise In this session, we will be addressing third question of the challenge, which is: What time should we display advertisements to maximize the likelihood of customers trying products? |
32. Challenge Part 3 - Q4 Products Bought Together In this session, we will address the final question of the challenge, which is: What products are often sold together in the state of 'NY'? |
5. Introduction to Spark SQL and Databricks
1. Introduction to DataBricks In this session, we will discuss the idea behind Databricks. |
2. Spark SQL Introduction In this session, we will discuss about SQL and Spark SQL. |
3. Register Account on Databricks In this session, we will go ahead and register on Databricks. |
4. Create a Databricks Cluster In this session, we will create a Databricks cluster. |
5. Creating our First 2 Databricks Notebooks We are now ready with our newly created cluster. In this session, we will go ahead and create our first two notebooks. |
6. Reading CSV Files into DataFrame In this session, we will load our CSV files from our previous project into the DataFrame. |
7. Creating a Database and Table In this session, we will create a database and a table. |
8. Inserting Records into a Table In this session, we will insert records into the table we created previously. |
9. Exposing Bad Records In this session, we will remove bad records to ensure that we have good quality data. |
10. Figuring out How to Remove Bad Records In the session, we will learn to produce query without the bad records. |
11. Extract the City and State We will extract the city and state from the purchase address in this video session. |
12. Inserting Records to Final Sales Table In this session, we will use SQL to insert records into the final sales table. |
13. What was the Best Month in Sales? By now, you should be ready with the sales analytic data. In this session, we will look at a few questions about the data. |
14. Get the City that Sold the Most Products In this session, we will address the question of which city sold the most products. |
15. Get the Right Time to Advertise In this session, we will address the question of what time should we display advertisements to maximize the likelihood of customers buying products. |
16. Get the Most Products Sold Together What products are most often sold together in the state of NY? This is the question we will be addressing in this session. |
17. Create a Dashboard In this session, we will be creating a data break dashboard. |
18. Summary Congratulations! You have successfully completed the course. Let's look at a short summary of the things you have learned so far before we wrap up. |