Introduction to PySpark SQL
PySpark SQL is the module in Spark that manages the structured data and it natively supports Python programming language. PySpark provides APIs that support heterogeneous data sources to read the data for processing with Spark Framework. It is highly scalable and can be applied to a very high-volume dataset. PySpark is known for its advanced features such as speed, powerful caching, real-time computation, deployable with Hadoop and Spark cluster also, polyglot with multiple programming languages like Scala, Python, R, and Java. Because of its robust features and efficiency, It is gaining popularity in Data since and machine learning implementations.
What is PySpark SQL?
It is a tool to support python with Spark SQL. It is developed to support Python in Spark. For a Proper understanding of the PySpark, knowledge of Python, Big Data & Spark is required. It is slowly gaining popularity among database programmers due to its important features.
PySpark SQL works on the distributed System and It is also scalable that why it’s heavily used in data science. In PySpark SQL Machine learning is provided by the python library. This Python library is known as a machine learning library.
Features of PySpark SQL
Some of the important features of the PySpark SQL are given below:
- Speed: It is much faster than the traditional large data processing frameworks like Hadoop.
- Powerful Caching: PySpark provides a simple programming layer that helps in the caching than the other frameworks caching.
- Real-Time: Computation in the PySpark SQL takes place in the memory that’s why it is real-time.
- Deployment: It can deploy through the Hadoop or own cluster manager.
- Polyglot: It supports programming in Scala, Java, Python, and R.
It is used in Big data & where there is Big data involves that related to data analytics. It is the hottest tool in the market of Big Data Analytics.
Major Uses of PySpark SQL
Below is given some of the sectors where Pyspark is used in the majority:
E-commerce Industry
In the E-commerce industry, PySpark adds a major role. It’s used the enhance user accessibility, providing offers to the targeted customers, advertising to genuine customers. Different E-commerce industries like eBay, Alibaba, Flipkart, Amazon, etc use it to get genuine data for marketing purposes.
Media
Different media driving industries like Youtube, Netflix, Amazon, etc use PySpark in the majority for processing large data to make it available to the users. This processing of data takes place in real-time to the server-side applications.
Banking
Banking is another important sector where PySpark is being used on a very vast level. It is helping the finance sector to process real-time transactions for million of record processing, advertisement to genuine customers, credit risk assessment, etc.
PySpark Modules
Some of the important classes & their characteristics are given below:
- pyspark.sql.SparkSession: This class enables programmers to program in Spark with DataFrame and SQL functionality. SparkSession used to create DataFrame, register DataFrame as tables, cache tables, executes SQL over tables.
- pyspark.sql.DataFrame: DataFrame class plays an important role in the distributed collection of data. This data grouped into named columns. Spark SQL DataFrame is similar to a relational data table. A DataFrame can be created using SQLContext methods.
- pyspark.sql.Columns: A column instances in DataFrame can be created using this class.
- pyspark.sql.Row: A row in DataFrame can be created using this class.
- pyspark.sql.GroupedData: GroupedData class provide the aggregation methods created by groupBy().
- pyspark.sql.DataFrameNaFunctions: This class provides the functionality to work with the missing data.
- pyspark.sql.DataFrameStatFunctions: Statistic functions are available with the DataFrames of Spark SQL. The functionality of the statistic functions is provided by this class.
- pyspark.sql.functions: Many built-in functions in the Spark are available to work with the DataFrames. Some of the built-in functions are given below:
Built In Methods |
abs(col) |
acos(col) |
add_months(start, months) |
approxCountDistinct(col, res=none) |
array([cols]) |
array_contains(col, value) |
asc(col) |
ascii(col) |
asin(col) |
atan |
atan2 |
avg |
base64 |
bin |
bitwiseNot |
Broadcast |
Bround |
cbrt |
ceil |
coalesce([col]) |
col(col) |
collect_list(col) |
collect_set(col) |
column(col) |
concat(*cols) |
concat_ws(sep, *col) |
conv(col, fromBase, toBase) |
corr(col1, col2) |
cos(col) |
cosh(col) |
count(col) |
countDistinct(col, *cols) |
covar_pop(col1, col2) |
covar_samp(col1, col2) |
crc32(col) |
create_map(*cols) |
cume_dist() |
current_date() |
current_timestamp() |
date_add(start, days) |
date_format(date, format) |
date_sub(start, days) |
datediff(end, start) |
dayofmonth(col) |
dayofyear(col) |
decode(col, charset) |
degrees(col) |
dense_rank() |
desc(col) |
encode(col, charset) |
exp(col) |
explode(col) |
expm1(col) |
expr(str) |
factorial(col) |
first(col, ignorenulls=False) |
floor(col) |
format_number(col, d) |
format_string(format, *cols) |
from_json(col, schema, options={}) |
from_unixtime(timestamp, format=’yyyy-MM-dd HH:mm:ss’) |
from_utc_timestamp(timestamp, tz) |
get_json_object(col, path) |
greatest(*cols) |
grouping(col) |
grouping_id(*cols) |
hash(*cols) |
hex(cols) |
hour(col) |
hypot(col1, col2) |
initcap(col) |
input_file_name() |
instr(str, substr) |
isnan(col) |
isnull(col) |
json_tuple(col, *fields) |
kurtosis(col) |
lag(col, count=1, default=None) |
last(col, ignorenulls=False) |
last_day(date) |
lead(col, count=1, default=None) |
length(col) |
Built In Methods |
locate(substr, str, pos=1) |
log(arg1, arg2=None) |
log10(col) |
log1p(col) |
log2(col) |
lower(col) |
ltrim(col) |
max(col) |
md5(col) |
mean(col) |
min(col) |
minute(col) |
monotonically_increasing_id() |
month(col) |
months_between(date1, date2) |
nanvl(col1, col2) |
next_day(date, dayOfWeek) |
ntile(n) |
percent_rank() |
posexplode(col) |
pow(col1, col2) |
quarter(col) |
radians(col) |
rand(seed=None |
randn(seed=None) |
rank() |
regexp_extract(str, pattern, idx) |
regexp_replace(str, pattern, replacement) |
repeat(col, n) |
reverse(col) |
rint(col) |
round(col, scale=0) |
row_number() |
rpad(col, len, pad) |
rtrim(col) |
second(col) |
sha1(col) |
sha2(col, numBits) |
shiftLeft(col, numBits) |
shiftRight(col, numBits) |
shiftRightUnsigned(col, numBits) |
signum(col) |
sin(col) |
sinh(col) |
size(col) |
skewness(col) |
sort_array(col, asc=True) |
soundex(col) |
spark_partition_id() |
split(str, pattern) |
sqrt(col) |
stddev(col) |
stddev_pop(col) |
stddev_samp(col) |
struct(*cols) |
substring(str, pos, len) |
substring_index(str, delim, count) |
sum(col) |
sumDistinct(col) |
tan(col) |
toDegrees(col) |
toRadians(col) |
to_date(col) |
to_json(col, options={}) |
to_utc_timestamp(timestamp, tz) |
translate(srcCol, matching, replace) |
trim(col) |
trunc(date, format) |
udf(f, returnType=StringType) |
unbase64(col) |
unhex(col) |
unix_timestamp(timestamp=None, format=’yyyy-MM-dd HH:mm:ss’) |
upper(col) |
var_pop(col) |
var_samp(col) |
variance(col) |
weekofyear(col) |
when(condition, value) |
window(timeColumn, windowDuration, slideDuration=None, startTime=None) |
year(col) |
least(*cols) , lit(col) |
levenshtein(left, right) |
pyspark.sql.types: These class types used in data type conversion. Using this class an SQL object can be converted into a native Python object.
- pyspark.sql.streaming: This class handles all those queries which execute continue in the background. All these methods used in the streaming are stateless. The above given built-in functions are available to work with the dataFrames. These functions can be used by referring to the functions library.
- pyspark.sql.Window: All methods provided by this class can be used in defining & working with windows in DataFrames.
To explore more about SQL please visit our SQL Tutorial section.