SparkSQL query using "PARTITION by" giving wrong output - apache-spark

I have a bunch of csv files for which I am using Pyspark for faster processing. However, am a total noob with Spark (Pyspark). So far I have been able to create a RDD, a subsequent data frame and a temporary view (country_name) to easily query the data.
Input Data
+---+--------------------------+-------+--------------------------+-------------------+
|ID |NAME |COUNTRY|ADDRESS |DESCRIPTION |
+---+--------------------------+-------+--------------------------+-------------------+
|1 | |QAT | |INTERIOR DECORATING|
|2 |S&T |QAT |AL WAAB STREET |INTERIOR DECORATING|
|3 | |QAT | |INTERIOR DECORATING|
|4 |THE ROSA BERNAL COLLECTION|QAT | |INTERIOR DECORATING|
|5 | |QAT |AL SADD STREET |INTERIOR DECORATING|
|6 |AL MANA |QAT |SALWA ROAD |INTERIOR DECORATING|
|7 | |QAT |SUHAIM BIN HAMAD STREET |INTERIOR DECORATING|
|8 |INTERTEC |QAT |AL MIRQAB AL JADEED STREET|INTERIOR DECORATING|
|9 | |EGY | |HOTELS |
|10 | |EGY |QASIM STREET |HOTELS |
|11 |AIRPORT HOTEL |EGY | |HOTELS |
|12 | |EGY |AL SOUQ |HOTELS |
+---+--------------------------+-------+--------------------------+-------------------+
I am stuck trying to convert this particular PostgreSQL query into sparksql.
select country,
name as 'col_name',
description,
ct,
ct_desc,
(ct*100/ct_desc)
from
(select description,
country,
count(name) over (PARTITION by description) as ct,
count(description) over (PARTITION by description) as ct_desc
from country_table
) x
group by 1,2,3,4,5,6
Correct output from PostgreSQL -
+-------+--------+-------------------+--+-------+----------------+
|country|col_name|description |ct|ct_desc|(ct*100/ct_desc)|
+-------+--------+-------------------+--+-------+----------------+
|QAT |name |INTERIOR DECORATING|7 |14 |50.0 |
+-------+--------+-------------------+--+-------+----------------+
Here is the sparksql query I am using -
df_fill_by_col = spark.sql("select country,
name as 'col_name',
description,
ct,
ct_desc,
(ct*100/ct_desc)
from
( Select description,
country,
count(name) over (PARTITION by description) as ct,
count(description) over (PARTITION by description) as ct_desc
from country_name
)x
group by 1,2,3,4,5,6 ")
df_fill_by_col.show()
From SparkSQL -
+-------+--------+-------------------+--+-------+----------------+
|country|col_name|description |ct|ct_desc|(ct*100/ct_desc)|
+-------+--------+-------------------+--+-------+----------------+
|QAT |name |INTERIOR DECORATING|14|14 |100.0 |
+-------+--------+-------------------+--+-------+----------------+
The sparksql query is giving odd outputs especially where few values are null in the dataframe.
For the same file and record the ct column is giving double value 7 v/s 14.
Below is the entire code, from reading the csv file to creating dataframe and querying data.
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import csv, copy, os, sys, unicodedata, string, time, glob
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
if __name__ == "__main__":
spark = SparkSession.builder.appName("PythonSQL").config("spark.some.config.option", "some-value").getOrCreate()
sc = spark.sparkContext
lines = sc.textFile("path_to_csvfiles")
parts = lines.map(lambda l: l.split("|"))
country_name = parts.map(lambda p: (p[0], p[1], p[2], p[3], p[4].strip()))
schemaString = "ID NAME COUNTRY ADDRESS DESCRIPTION"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
df_schema = StructType(fields)
df_schema1 = spark.createDataFrame(country_name, df_schema)
df_schema1.createOrReplaceTempView("country_name")
df_schema1.cache()
df_fill_by_col = spark.sql("select country, name as 'col_name', description, ct, ct_desc, (ct*100/ct_desc) from ( Select description, country, count(name) over (PARTITION by description) as ct, count(description) over (PARTITION by description) as ct_desc from country_name )x group by 1,2,3,4,5,6 ")
df_fill_by_col.show()
Please let me know if there is a way of getting the sparksql query to work.
Thanks,
Pankaj
Edit - This code will run on multiple countries and columns

Related

How to create a combined data frame from each columns?

I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.

Check value from Spark DataFrame column and do transformations

I have a dataframe consists of person, transaction_id & is_successful. The dataframe consists of duplicate values for person with different transaction_ids and is_successful will be True/False for each transaction.
I would like to derive a new dataframe which will have one record for each person which consists latest transaction_id of that person and populate True only if any of his transactions are successful.
val input_df = sc.parallelize(Seq((1,1, "True"), (1,2, "False"), (2,1, "False"), (2,2, "False"), (2,3, "True"), (3,1, "False"), (3,2, "False"), (3,3, "False"))).toDF("person","transaction_id", "is_successful")
df: org.apache.spark.sql.DataFrame = [person: int, transaction_id: int ... 1 more field]
input_df.show(false)
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |1 |True |
|1 |2 |False |
|2 |1 |False |
|2 |2 |False |
|2 |3 |True |
|3 |1 |False |
|3 |2 |False |
|3 |3 |False |
+------+--------------+-------------+
Expected Df:
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |2 |True |
|2 |3 |True |
|3 |3 |False |
+------+--------------+-------------+
How can we derive the dataframe like above?
What you can do is below in spark sql
select person,max(transaction_id) as transaction_id,max(is_successful) as is_successful from <table_name> group by person
Leave the complex work to max operator.As per the max operation True will come over False.So if one of your person has three False and one True, max of that would be True.
You may achieve this by grouping your dataframe on person and finding the max transaction_id and max is_successful
I've included an example below of how this may be achieved using spark sql.
First, I created a temporary view of your dataframe in order to access using spark sql, then run the following sql statement in spark sql.
input_df.createOrReplaceTempView("input_df");
val result_df = sparkSession.sql("<insert sql below here>");
The sql statement groups the data for each person before using max to determine the last transaction id and a combination of max (sum could be used with the same logic also) and case expressions to derive the is_successful value. The case expression is nested as I've converted True to a numeric value of 1 and False to 0 to leverage a numeric comparison. This is within an outer case expression which checks if the max value is > 0 (i.e. any value was successful) before printing True/False.
SELECT
person,
MAX(transaction_id) as transaction_id,
CASE
WHEN MAX(
CASE
WHEN is_successful = 'True' THEN 1
ELSE 0
END
) > 0 THEN 'True'
ELSE 'False'
END as is_successful
FROM
input_df
GROUP BY
person
Here is the #ggordon's sql version of answer in dataframe version.
input_df.groupBy("person")
.agg(max("transaction_id").as("transaction_id"),
when(max(when('is_successful === "True", 1)
.otherwise(0)) > 0, "True")
.otherwise("False").as("is_successful"))

pyspark : How to explode a column of string type into rows and columns of a spark data frame

I am working with spark 2.3 . I have a spark data frame which is of the following format
| person_id | person_attributes
____________________________________________________________________________
| id_1 "department=Sales__title=Sales_executive__level=junior"
| id_2 "department=Engineering__title=Software Engineer__level=entry-level"
and so on.
The person_attributes column is of the type string
How can I explode this frame to get a data frame of the type as follows without the level attribute_key
| person_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1 department Sales
| id_1 title Sales_executive
| id_2 department Engineering
| id_2 title Software Engineer
This is a big distributed data frame so , converting to pandas or caching is not an option
Try this,
import org.apache.spark.sql.functions._
df
.withColumn("attributes_splitted", split(col("person_attributes"), "__")) // Split by delimiter `__`
.withColumn("exploded", explode(col("attributes_splitted"))) // explode the splitted column
.withColumn("temp", split(col("exploded"), "=")) // again split based on delimiter `=`
.withColumn("attribute_key", col("temp").getItem(0))
.withColumn("attribute_value", col("temp").getItem(1))
.drop("attributes_splitted", "exploded", "temp", "person_attributes")
.show(false)
Try this for Spark2.3:
from pyspark.sql import functions as F
df.withColumn("arr", F.split("person_attributes",'\=|__'))\
.withColumn("map", F.create_map(F.lit('department'),F.col("arr")[1]\
,F.lit('title'),F.col("arr")[3]))\
.select("person_id", F.explode("map").alias("attribute_key","attribute_value"))\
.show(truncate=False)
#+---------+-------------+-----------------+
#|person_id|attribute_key|attribute_value |
#+---------+-------------+-----------------+
#|id_1 |department |Sales |
#|id_1 |title |Sales_executive |
#|id_2 |department |Engineering |
#|id_2 |title |Software Engineer|
#+---------+-------------+-----------------+
Try this for Spark2.4+
from pyspark.sql import functions as F
df.withColumn("arr", F.split("person_attributes",'\=|__'))\
.withColumn("map", F.map_from_arrays(F.expr("""filter(arr,(x,i)->i%2=0)""")\
,F.expr("""filter(arr,(x,i)->i%2!=0)""")))\
.select("person_id", F.explode("map").alias("attribute_key","attribute_value")).filter("""attribute_key!='level'""")\
.show(truncate=False)
#+---------+-------------+-----------------+
#|person_id|attribute_key|attribute_value |
#+---------+-------------+-----------------+
#|id_1 |department |Sales |
#|id_1 |title |Sales_executive |
#|id_2 |department |Engineering |
#|id_2 |title |Software Engineer|
#+---------+-------------+-----------------+

PySpark - Using lists inside LIKE operator

I would like to use list inside the LIKE operator on pyspark in order to create a column.
I have the following input df :
input_df :
+------+--------------------+-------+
| ID| customers|country|
+------+--------------------+-------+
|161 |xyz Limited |U.K. |
|262 |ABC Limited |U.K. |
|165 |Sons & Sons |U.K. |
|361 |TÜV GmbH |Germany|
|462 |Mueller GmbH |Germany|
|369 |Schneider AG |Germany|
|467 |Sahm UG |Austria|
+------+--------------------+-------+
I would like to add a column CAT_ID. CAT_ID takes value 1 if "ID" contains "16" or "26". CAT_ID takes value 2 if "ID" contains "36" or "46".
So, I want my output df to look like this -
The desired output_df :
+------+--------------------+-------+-------+
| ID| customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161 |xyz Limited |U.K. |1 |
|262 |ABC Limited |U.K. |1 |
|165 |Sons & Sons |U.K. |1 |
|361 |TÜV GmbH |Germany|2 |
|462 |Mueller GmbH |Germany|2 |
|369 |Schneider AG |Germany|2 |
|467 |Sahm UG |Austria|2 |
+------+--------------------+-------+-------+
I am interested in learning how this can be done using LIKE statement and lists.
I know how to implement it without list, which works perfectly:
from pyspark.sql import functions as F
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%')) ) , "1") \
.when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
However, I would love to use list and have something like:
list1 =['16', '26']
list2 =['36', '46']
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like(list1 %)) ) , "1") \
.when( ( (F.col('ID').like('list2 %')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
Thanks a lot in advance,
SQL wildcards do not support "or" clauses. There are several ways you can handle it though.
1. Regular expressions
You can use rlike with a regular expression:
import pyspark.sql.functions as psf
list1 =['16', '26']
list2 =['36', '46']
df.withColumn(
'CAT_ID',
psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1))), '1') \
.when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))), '2') \
.otherwise('999')) \
.show()
+---+------------+-------+------+
| ID| customers|country|CAT_ID|
+---+------------+-------+------+
|161| xyz Limited| U.K.| 1|
|262|ABC Limited| U.K.| 1|
|165| Sons & Sons| U.K.| 1|
|361| TÜV GmbH|Germany| 2|
|462|Mueller GmbH|Germany| 2|
|369|Schneider AG|Germany| 2|
|467| Sahm UG|Austria| 2|
+---+------------+-------+------+
Here, we get for list1 the regular expression (16|26)\d matching 16 or 26 followed by an integer (\d is equivalent to [0-9]).
2. Dynamically build an SQL clause
If you want to keep the sql like, you can use selectExpr and chain the values with ' OR ':
df.selectExpr(
'*',
"CASE WHEN ({}) THEN '1' WHEN ({}) THEN '2' ELSE '999' END AS CAT_ID"
.format(*[' OR '.join(["ID LIKE '{}%'".format(x) for x in l]) for l in [list1, list2]]))
3. Dynamically build a Python expression
You can also use eval if you don't want to write SQL:
df.withColumn(
'CAT_ID',
psf.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list1])), '1')
.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list2])), '2')
.otherwise('999'))
With Spark 2.4 onwards, you can use higher order functions in the spark-sql.
Try the below one, the sql solution is same for both scala/python
val df = Seq(
("161","xyz Limited","U.K."),
("262","ABC Limited","U.K."),
("165","Sons & Sons","U.K."),
("361","TÜV GmbH","Germany"),
("462","Mueller GmbH","Germany"),
("369","Schneider AG","Germany"),
("467","Sahm UG","Germany")
).toDF("ID","customers","country")
df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
""" with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
t2 (select id, customers, country, filter(a1, x -> id like x||'%') a1f, filter(a2, x -> id like x||'%') a2f from t1),
t3 (select id, customers, country, a1f, a2f,
case when size(a1f) > 0 then 1 else 0 end a1r,
case when size(a2f) > 0 then 2 else 0 end a2r
from t2)
select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
""").show(false)
Results:
+---+------------+-------+
|ID |customers |country|
+---+------------+-------+
|161|xyz Limited |U.K. |
|262|ABC Limited|U.K. |
|165|Sons & Sons |U.K. |
|361|TÜV GmbH |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG |Germany|
+---+------------+-------+
+---+------------+-------+----+----+---+---+------+
|id |customers |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K. |[16]|[] |1 |0 |1 |
|262|ABC Limited|U.K. |[26]|[] |1 |0 |1 |
|165|Sons & Sons |U.K. |[16]|[] |1 |0 |1 |
|361|TÜV GmbH |Germany|[] |[36]|0 |2 |2 |
|462|Mueller GmbH|Germany|[] |[46]|0 |2 |2 |
|369|Schneider AG|Germany|[] |[36]|0 |2 |2 |
|467|Sahm UG |Germany|[] |[46]|0 |2 |2 |
+---+------------+-------+----+----+---+---+------+
Create the List, then Dataframe
If the list is structured a little differently, we can do a simple join using the like function and an expression after turning the list into a pyspark Dataframe. This demonstrates that while 'or' is not supported in the like clause, we can get around that by restructuring the inputs and using a join.
list1 =[('16', '1'),
('26', '1'),
('36', '2'),
('46', '2')]
listCols = ['listDat','Cat_ID']
df = spark.createDataFrame(data=list1, schema=listCols)
Create input_df
data = [(161, 'xyz Limited','U.K.'),
(262,'ABC Limited','U.K.'),
(165,'Sons & Sons','U.K.'),
(361,'TÜV GmbH','Germany'),
(462,'Mueller GmbH','Germany'),
(369,'Schneider AG','Germany'),
(467,'Sahm UG','Austria')]
dataCols = ['ID','Customers','Country']
input_df = spark.createDataFrame(data=data, schema = dataCols)
Given the Dataframe, the join can now take place with wildcards. Note that while with SSMS we get to use the plus sign to concatenate, with pyspark we need to use concat().
Before displaying, the extra column is dropped since it isn't needed in the result set.
output_df = input_df.alias('i').join(df.alias('d'),
F.expr("i.ID like concat('%',d.listDat,'%')"),
how = 'left').drop('listDat').show()
Display output_df
+---+------------+-------+------+
| ID| Customers|Country|Cat_ID|
+---+------------+-------+------+
|161| xyz Limited| U.K.| 1|
|262| ABC Limited| U.K.| 1|
|165| Sons & Sons| U.K.| 1|
|361| TÜV GmbH|Germany| 2|
|462|Mueller GmbH|Germany| 2|
|369|Schneider AG|Germany| 2|
|467| Sahm UG|Austria| 2|
+---+------------+-------+------+

Spark structured streaming drop duplicates keep last

I would like to maintain a streaming dataframe that get "update".
To do so I will use dropDuplicates.
But dropDuplicates drop the latest change.
How can I retain the last only?
Assuming you need to select the last record on id column by removing other duplicates, you can use the window functions and filter on row_number = count. Check this out
scala> val df = Seq((120,34.56,"2018-10-11"),(120,65.73,"2018-10-14"),(120,39.96,"2018-10-20"),(122,11.56,"2018-11-20"),(122,24.56,"2018-10-20")).toDF("id","amt","dt")
df: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> val df2=df.withColumn("dt",'dt.cast("date"))
df2: org.apache.spark.sql.DataFrame = [id: int, amt: double ... 1 more field]
scala> df2.show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|120|34.56|2018-10-11|
|120|65.73|2018-10-14|
|120|39.96|2018-10-20|
|122|11.56|2018-11-20|
|122|24.56|2018-10-20|
+---+-----+----------+
scala> df2.createOrReplaceTempView("ido")
scala> spark.sql(""" select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido """).show(false)
+---+-----+----------+---+---+
|id |amt |dt |rw |cw |
+---+-----+----------+---+---+
|122|24.56|2018-10-20|1 |2 |
|122|11.56|2018-11-20|2 |2 |
|120|34.56|2018-10-11|1 |3 |
|120|65.73|2018-10-14|2 |3 |
|120|39.96|2018-10-20|3 |3 |
+---+-----+----------+---+---+
scala> spark.sql(""" select id,amt,dt from (select id,amt,dt,row_number() over(partition by id order by dt) rw, count(*) over(partition by id) cw from ido) where rw=cw """).show(false)
+---+-----+----------+
|id |amt |dt |
+---+-----+----------+
|122|11.56|2018-11-20|
|120|39.96|2018-10-20|
+---+-----+----------+
scala>
If you want to sort on dt descending you can just give "order by dt desc" in the over(0 clause.. Does this help?

Resources