Why does Spark SQL require double percent symbols for like conditions? - apache-spark

When I am using a like condition in Spark SQL, it seems that it requires the use of 2 percent symbols %%.
However, I could not find any documentation on this in the Spark SQL docs. I am curious as to why my set-up might be causing this requirement.
https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-like.html
Example data
product_table
id
product_type
region
location
measurement
43635665
ORANGE - Blood Orange
EU
FRA
30.5
78960788
APPLE GrannySmith
NA
USA
16.0
12312343
APPLE [Organic Washington]
NA
CAN
7.1
67867634
ORANGE, NavelOrange
NA
MEX
88.4
import pyspark
from pyspark.sql import functions as F
APP_NAME = "Product: Fruit Template"
SPARK_CONF = [
("spark.dynamicAllocation.maxExecutors", "5"),
("spark.executor.memory", "10g"),
("spark.executor.cores", "4"),
("spark.executor.memoryOverhead", "2000"),
]
spark_conf = pyspark.SparkConf()
spark_conf.setAppName(APP_NAME)
spark_conf.setAll(SPARK_CONF)
sc = pyspark.SparkContext(conf=spark_conf)
spark = pyspark.sql.SparkSession(sc)
def sql(query):
return spark.sql(query)
df = sql("""
SELECT *
FROM product_table
""")
this returns data
df.filter(F.col("product_type").like("ORANGE%%")).show()
whereas this returns an empty dataframe
df.filter(F.col("product_type").like("ORANGE%")).show()
Maybe worth noting, the same issue happens when the LIKE condition is used in the SQL string
this returns data
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%%'
""")
df_new.show()
whereas this returns an empty dataframe
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%'
""")
df_new.show()

i am using PySpark Version: 2.3.2.
conf = (SparkConf()
.set("spark.executor.instances", "24")
.set("spark.executor.cores", "5")
.set("spark.executor.memory", "33g")
.set("spark.driver.memory", "55g")
.set("spark.driver.maxResultSize", "10g")
.set("spark.sql.catalogImplementation", "hive")
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
)
spark = (
SparkSession.builder.appName("default")
.enableHiveSupport()
.config(conf=conf)
.getOrCreate()
)
df = spark.createDataFrame(
[('43635665','ORANGE - Blood Orange'),
('78960788','APPLE GrannySmith'),
('12312343','APPLE [Organic Washington'),
('67867634','ORANGE, NavelOrange')],
['id', 'product_type'])
df.createOrReplaceTempView("product_table")
def sql(query):
print(query)
return spark.sql(query)
df2 = sql("""
SELECT *
FROM product_table
""")
df2.filter(F.col("product_type").like("ORANGE%")).show(truncate=False)

Related

How to reliably obtain partition columns of delta table

I need to obtain the partitioning columns of a delta table, but the returned result of a
DESCRIBE delta.`my_table` returns different results on databricks and locally on PyCharm.
Minimal example:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
delta_table_path = "c:/temp_delta_table"
partition_column = ["rs_nr"]
schema = StructType([
StructField("rs_nr", StringType(), False),
StructField("event_category", StringType(), True),
StructField("event_counter", IntegerType(), True)])
data = [{'rs_nr': '001', 'event_category': 'event_01', 'event_counter': 1},
{'rs_nr': '002', 'event_category': 'event_02', 'event_counter': 2},
{'rs_nr': '003', 'event_category': 'event_03', 'event_counter': 3},
{'rs_nr': '004', 'event_category': 'event_04', 'event_counter': 4}]
sdf = spark.createDataFrame(data=data, schema=schema)
sdf.write.format("delta").mode("overwrite").partitionBy(partition_column).save(delta_table_path)
df_descr = spark.sql(f"DESCRIBE delta.`{delta_table_path}`")
df_descr.toPandas()
Shows, on databricks, the partition column(s):
col_name data_type comment
0 rs_nr string None
1 event_category string None
2 event_counter int None
3 # Partition Information
4 # col_name data_type comment
5 rs_nr string None
But when running this locally in PyCharm, I get the following different output:
col_name data_type comment
0 rs_nr string
1 event_category string
2 event_counter int
3
4 # Partitioning
5 Part 0 rs_nr
Parsing both types of return value seems ugly to me, so is there a reason that this is returned like this?
Setup:
In Pycharm:
pyspark = 3.2.3
delta-spark = 2.0.0
In DataBricks:
DBR 11.3 LTS
Spark = 3.3.0 (I just noted that this differs, I will test if 3.3.0 works locally in the meantime)
Scala = 2.12
In PyCharm, I create the connection using:
def get_spark():
spark = SparkSession.builder.appName('schema_checker')\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0")\
.config("spark.sql.catalogImplementation", "in-memory")\
.getOrCreate()
return spark
If you're using Python, then instead of executing SQL command that is harder to parse, it's better to use Python API. The DeltaTable instance has a detail function that returns a dataframe with details about the table (doc), and this dataframe has the partitionColumns column that is array of strings with partition columns names. So you can just do:
from delta.tables import *
detailDF = DeltaTable.forPath(spark, delta_table_path).detail()
partitions = detailDF.select("partitionColumns").collect()[0][0]

Find the list of all persisted dataframes in Spark

I have a long spark code similar to below which has a lot of keywords :
df = spark.sql("""
select * from abc
""")
df.persist()
df2 = spark.sql("""
select * from def
""")
df2.persist()
df3 = spark.sql("""
select * from mno""")
I want to find out all the dataframes which have been persisted and store them in a list.
Output:
l1 = [df, df2]
How can we proceed with this?
Try this:
from pyspark.sql import DataFrame
df = spark.sql("""
select * from abc
""")
df.persist()
df2 = spark.sql("""
select * from def
""")
df2.persist()
df3 = spark.sql("""
select * from mno
""")
dfNameList = []
for k, v in globals().items():
if isinstance(v, DataFrame):
# k is the name of DF, v is DF itself.
if v.storageLevel.useMemory == True:
dfNameList.append(k)
print(dfNameList)
Output:
['df', 'df2']
Loop globals().items();
Find DataFrame instance;
Determine whether DF is persistent in memory;
Collect the DF name and print.
If you want to put all DF in the list instead of DF names, just append the v to list.
Output will like:
[DataFrame[fieldOne: typeOne, fieldTwo: typeTwo, ……], DataFrame[fieldOne: typeOne, fieldTwo: typeTwo,……]]

What is the most efficient way to do two different joins on the same two dataframes in Pyspark

I am trying to compare two dataframes to look for new records and updated records, which in turn will be used to create a third dataframe. I am using Pyspark 2.4.3
As I come from a SQL background (ASE), my initial thought would be to do a left join to find new records and a != on a hash of all the columns to find updates:
SELECT a.*
FROM Todays_Data a
Left Join Yesterdays_PK_And_Hash b on a.pk = b.pk
WHERE (b.pk IS NULL) --finds new records
OR (b.hashOfColumns != HASHBYTES('md5',<converted and concatenated columns>)) --updated records
I have been playing around with Pyspark and have come up with a script that achieves the results I am after:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit
sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)
sp = SparkSession \
.builder \
.appName("test App") \
.getOrCreate()
df = sp.createDataFrame(
[("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"), # hashkey here is created based on YOB of 1973. To test for an update
("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
["First_name", "Last_name", "hashkey"]
)
df_a = sp.createDataFrame(
[("Fred", "Smith", "Adelaide", "Doctor", 1971),
("Fred", "Davis", "Melbourne", "Baker", 1970),
("Barry", "Clarke", "Sydney", "Scientist", 1975),
("Jane", "Hall", "Sydney", "Dentist", 1980)],
["First_name", "Last_name", "City", "Occupation", "YOB"]
)
df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))
df_ins = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')), 'left_anti') \
.select(lit("Insert").alias("_action"), 'a.*') \
.dropDuplicates()
df_up = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')) &
(col('a.hashkey') != col('b.hashkey')), 'inner') \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
df_delta = df_ins.union(df_up).sort("YOB")
df_delta = df_delta.drop("hashkey")
df_delta.show(truncate=False)
What this produces is my final delta as such:
+-------+----------+---------+--------+----------+----+
|_action|First_name|Last_name|City |Occupation|YOB |
+-------+----------+---------+--------+----------+----+
|Update |Fred |Smith |Adelaide|Doctor |1971|
|Insert |Jane |Hall |Sydney |Dentist |1980|
+-------+----------+---------+--------+----------+----+
While I am getting the results I am after, I am unsure how efficient the above code is.
Ultimately in the end, I would like to run similar patterns against datasets into the 100's of million records.
Is there anyway to make this more efficient?
Thanks
Have you explored broadcast join? Your join statements could be problematic if you have 100M + records. If the dataset B is smaller, this would the tiny modification I would try:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit, broadcast
sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)
sp = SparkSession \
.builder \
.appName("test App") \
.getOrCreate()
df = sp.createDataFrame(
[("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"), # hashkey here is created based on YOB of 1973. To test for an update
("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
["First_name", "Last_name", "hashkey"]
)
df_a = sp.createDataFrame(
[("Fred", "Smith", "Adelaide", "Doctor", 1971),
("Fred", "Davis", "Melbourne", "Baker", 1970),
("Barry", "Clarke", "Sydney", "Scientist", 1975),
("Jane", "Hall", "Sydney", "Dentist", 1980)],
["First_name", "Last_name", "City", "Occupation", "YOB"]
)
df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))
df_ins = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')), 'left_anti') \
.select(lit("Insert").alias("_action"), 'a.*') \
.dropDuplicates()
df_up = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')) &
(col('a.hashkey') != col('b.hashkey')), 'inner') \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
df_delta = df_ins.union(df_up).sort("YOB")
Maybe rewriting the code cleanly would be easier to follow too.
#Ash, from a readability standpoint, you could do a couple of things:
Use variables
Use functions.
Use pep-8 guiding style as much as possible. (ex: no more than 80 chars in a line)
joinExpr = (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')
joinType = 'left_anti'
df_up = df_a.alias('a').join(broadcast(df.alias('b')), joinExpr) &
(col('a.hashkey') != col('b.hashkey')), joinType) \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
This is still long, but you get the idea.

How do find out the total amount for each month using spark in python

I'm looking for a way to aggregate by month my data. I want firstly to keep only month in my visitdate. My DataFrame looks like this:
Row(visitdate = 1/1/2013,
patientid = P1_Pt1959,
amount = 200,
note = jnut,
)
My objectif subsequently is to group by visitdate and calculate the sum of amount. I tried this :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")
sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()
This is the result :
visitdate|totalamount|
+----------+-----------+
| 9/1/2013| 10800.0|
|25/04/2013| 12440.0|
|27/03/2014| 16930.0|
|26/03/2015| 18560.0|
|14/05/2013| 13770.0|
|30/06/2013| 13880.0
My objectif is to get something like this:
visitdate|totalamount|
+----------+-----------+
|1/1/2013| 10800.0|
|1/2/2013| 12440.0|
|1/3/2013| 16930.0|
|1/4/2014| 18560.0|
|1/5/2015| 13770.0|
|1/6/2015| 13880.0|
You need to truncate your date's down to months so they group properly, then do a groupBy/sum. There is a spark function to do this for you call date_trunc. For example.
from datetime import date
from pyspark.sql.functions import date_trunc, sum
data = [
(date(2000, 1, 2), 1000),
(date(2000, 1, 2), 2000),
(date(2000, 2, 3), 3000),
(date(2000, 2, 4), 4000),
]
df = spark.createDataFrame(sc.parallelize(data), ["date", "amount"])
df.groupBy(date_trunc("month", df.date)).agg(sum("amount"))
+-----------------------+-----------+
|date_trunc(month, date)|sum(amount)|
+-----------------------+-----------+
| 2000-01-01 00:00:00| 3000|
| 2000-02-01 00:00:00| 7000|
+-----------------------+-----------+

Use WHERE or FILTER whe creating TempView

Is it possible to use where or filter when creating a SparkSQL TempView ?
I have a Cassandra table words with
word | count
------------
apples | 20
banana | 10
I tried
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.where($"count" > 10)
.load()
.createOrReplaceTempView("high_counted")
or
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.where("count > 10")
.load()
.createOrReplaceTempView("high_counted")
You cannot do a WHERE or FILTER without .load()ing the table as #undefined_variable suggested.
Try:
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.load()
.where($"count" > 10)
.createOrReplaceTempView("high_counted")
Alternatively, you can do a free form query as documented here.
Spark evaluated statements in lazy fashion and the above statement is a Transformation. (If you are thinking we need to filter before we load)

Resources