Home » Tutorials » PySpark Tutorial » PySpark – SQL

PySpark – SQL

Introduction to PySpark SQL

PySpark SQL is the module in Spark that manages the structured data and it natively supports  Python programming language. PySpark provides APIs that support heterogeneous data sources to read the data for processing with Spark Framework. It is highly scalable and can be applied to a very high-volume dataset. PySpark is known for its advanced features such as speed, powerful caching, real-time computation, deployable with Hadoop and Spark cluster also, polyglot with multiple programming languages like Scala, Python, R, and Java. Because of its robust features and efficiency, It is gaining popularity in Data since and machine learning implementations.

What is PySpark SQL?

It is a tool to support python with Spark SQL. It is developed to support Python in Spark. For a Proper understanding of the PySpark, knowledge of Python, Big Data & Spark is required. It is slowly gaining popularity among database programmers due to its important features.

PySpark SQL works on the distributed System and It is also scalable that why it’s heavily used in data science. In PySpark SQL Machine learning is provided by the python library. This Python library is known as a machine learning library.

Features of PySpark SQL

Some of the important features of the PySpark SQL are given below:

  • Speed: It is much faster than the traditional large data processing frameworks like Hadoop.
  • Powerful Caching: PySpark provides a simple programming layer that helps in the caching than the other frameworks caching.
  • Real-Time: Computation in the PySpark SQL takes place in the memory that’s why it is real-time.
  • Deployment: It can deploy through the Hadoop or own cluster manager.
  • Polyglot: It supports programming in Scala, Java, Python, and R.
Related:  PySpark - round

It is used in Big data & where there is Big data involves that related to data analytics. It is the hottest tool in the market of Big Data Analytics.

Major Uses of PySpark SQL

Below is given some of the sectors where Pyspark is used in the majority:

Major Uses of PySpark SQL

E-commerce Industry

In the E-commerce industry, PySpark adds a major role. It’s used the enhance user accessibility, providing offers to the targeted customers, advertising to genuine customers. Different E-commerce industries like eBay, Alibaba, Flipkart, Amazon, etc use it to get genuine data for marketing purposes.

Media

Different media driving industries like Youtube, Netflix, Amazon, etc use PySpark in the majority for processing large data to make it available to the users. This processing of data takes place in real-time to the server-side applications.

Banking

Banking is another important sector where PySpark is being used on a very vast level. It is helping the finance sector to process real-time transactions for million of record processing, advertisement to genuine customers, credit risk assessment, etc.

PySpark Modules

Some of the important classes & their characteristics are given below:

  • pyspark.sql.SparkSession: This class enables programmers to program in Spark with DataFrame and SQL functionality. SparkSession used to create DataFrame, register DataFrame as tables, cache tables, executes SQL over tables.
  • pyspark.sql.DataFrame: DataFrame class plays an important role in the distributed collection of data. This data grouped into named columns. Spark SQL DataFrame is similar to a relational data table. A DataFrame can be created using SQLContext methods.
  • pyspark.sql.Columns: A column instances in DataFrame can be created using this class.
  • pyspark.sql.Row: A row in DataFrame can be created using this class.
  • pyspark.sql.GroupedData: GroupedData class provide the aggregation methods created by groupBy().
  • pyspark.sql.DataFrameNaFunctions: This class provides the functionality to work with the missing data.
  • pyspark.sql.DataFrameStatFunctions: Statistic functions are available with the DataFrames of Spark SQL. The functionality of the statistic functions is provided by this class.
  • pyspark.sql.functions: Many built-in functions in the Spark are available to work with the DataFrames. Some of the built-in functions are given below:
Related:  PySpark - select
Built In Methods
abs(col)
acos(col)
add_months(start, months)
approxCountDistinct(col,  res=none)
array([cols])
array_contains(col, value)
asc(col)
ascii(col)
asin(col)
atan
atan2
avg
base64
bin
bitwiseNot
Broadcast
Bround
cbrt
ceil
coalesce([col])
col(col)
collect_list(col)
collect_set(col)
column(col)
concat(*cols)
concat_ws(sep, *col)
conv(col, fromBase, toBase)
corr(col1, col2)
cos(col)
cosh(col)
count(col)
countDistinct(col, *cols)
covar_pop(col1, col2)
covar_samp(col1, col2)
crc32(col)
create_map(*cols)
cume_dist()
current_date()
current_timestamp()
date_add(start, days)
date_format(date, format)
date_sub(start, days)
datediff(end, start)
dayofmonth(col)
dayofyear(col)
decode(col, charset)
degrees(col)
dense_rank()
desc(col)
encode(col, charset)
exp(col)
explode(col)
expm1(col)
expr(str)
factorial(col)
first(col, ignorenulls=False)
floor(col)
format_number(col, d)
format_string(format, *cols)
from_json(col, schema, options={})
from_unixtime(timestamp, format=’yyyy-MM-dd HH:mm:ss’)
from_utc_timestamp(timestamp, tz)
get_json_object(col, path)
greatest(*cols)
grouping(col)
grouping_id(*cols)
hash(*cols)
hex(cols)
hour(col)
hypot(col1, col2)
initcap(col)
input_file_name()
instr(str, substr)
isnan(col)
isnull(col)
json_tuple(col, *fields)
kurtosis(col)
lag(col, count=1, default=None)
last(col, ignorenulls=False)
last_day(date)
lead(col, count=1, default=None)
length(col)
Built In Methods
locate(substr, str, pos=1)
log(arg1, arg2=None)
log10(col)
log1p(col)
log2(col)
lower(col)
ltrim(col)
max(col)
md5(col)
mean(col)
min(col)
minute(col)
monotonically_increasing_id()
month(col)
months_between(date1, date2)
nanvl(col1, col2)
next_day(date, dayOfWeek)
ntile(n)
percent_rank()
posexplode(col)
pow(col1, col2)
quarter(col)
radians(col)
rand(seed=None
randn(seed=None)
rank()
regexp_extract(str, pattern, idx)
regexp_replace(str, pattern, replacement)
repeat(col, n)
reverse(col)
rint(col)
round(col, scale=0)
row_number()
rpad(col, len, pad)
rtrim(col)
second(col)
sha1(col)
sha2(col, numBits)
shiftLeft(col, numBits)
shiftRight(col, numBits)
shiftRightUnsigned(col, numBits)
signum(col)
sin(col)
sinh(col)
size(col)
skewness(col)
sort_array(col, asc=True)
soundex(col)
spark_partition_id()
split(str, pattern)
sqrt(col)
stddev(col)
stddev_pop(col)
stddev_samp(col)
struct(*cols)
substring(str, pos, len)
substring_index(str, delim, count)
sum(col)
sumDistinct(col)
tan(col)
toDegrees(col)
toRadians(col)
to_date(col)
to_json(col, options={})
to_utc_timestamp(timestamp, tz)
translate(srcCol, matching, replace)
trim(col)
trunc(date, format)
udf(f, returnType=StringType)
unbase64(col)
unhex(col)
unix_timestamp(timestamp=None, format=’yyyy-MM-dd HH:mm:ss’)
upper(col)
var_pop(col)
var_samp(col)
variance(col)
weekofyear(col)
when(condition, value)
window(timeColumn, windowDuration, slideDuration=None, startTime=None)
year(col)
least(*cols) , lit(col)
levenshtein(left, right)
Related:  PySpark - datediff() and months_between()

pyspark.sql.types: These class types used in data type conversion. Using this class an SQL object can be converted into a native Python object.

  • pyspark.sql.streamingThis class handles all those queries which execute continue in the background. All these methods used in the streaming are stateless. The above given built-in functions are available to work with the dataFrames. These functions can be used by referring to the functions library.
  • pyspark.sql.WindowAll methods provided by this class can be used in defining & working with windows in DataFrames.

To explore more about SQL please visit our SQL Tutorial section.

Leave a Comment