Only keep rows with specific condition in PySpark - apache-spark

I'm processing my logs using PySpark. I have two dataframes, one: logs DF is storing search queries and the other one: clicks DF is storing clicked document IDs.
Here is their following structure:
+-------------------+-------+----------------------+
|timestamp |user |query |
+-------------------+-------+----------------------+
|2021-12-01 06:14:38|m96cles|minoration |
|2021-12-01 06:32:54|m96ngro|associés |
|2021-12-01 06:40:40|m96mbeg|cessation |
|2021-12-01 07:02:42|m96ngro|membres de société |
|2021-12-01 07:02:58|m96ngro|cumul |
|2021-12-01 07:07:30|m96rara|cessation |
|2021-12-01 07:09:37|m64nesc|INVF |
|2021-12-01 07:16:14|m83ccat|report didentifiation |
+-------------------+-------+----------------------+
+-------------------+-------+------+
|timestamp |user |doc_id|
+-------------------+-------+------+
|2021-12-01 06:14:42|m96cles|783 |
|2021-12-01 06:33:38|m96ngro|6057 |
|2021-12-01 06:40:52|m96mbeg|1407 |
|2021-12-01 06:49:12|m96mbeg|1414 |
|2021-12-01 06:53:19|m51cstr|15131 |
|2021-12-01 06:53:35|m51cstr|14992 |
|2021-12-01 06:53:55|m51cstr|15093 |
|2021-12-01 06:54:20|m51cstr|15110 |
+-------------------+-------+------+
I merged both dataframes by doing df = logs.unionByName(clicks, allowMissingColumns=True), and sorted it by timestamp.
+-------------------+--------+--------------------+------+
| timestamp| user| query|doc_id|
+-------------------+--------+--------------------+------+
|2022-05-31 20:23:40|ozenfada| null| 7931|
|2022-05-31 21:06:44| m97emou| apnée du sommeil| null|
|2022-05-31 21:28:24| m64lbeh| null| 192|
|2022-05-31 21:29:04| m97emou| null| 3492|
+-------------------+--------+--------------------+------+
The idea is to only keep rows with search queries that lead to clicks. I don't want to keep logs that lead to no clicks on documents. In order to achieve this, I'm trying to look at next rows with the same user and see if they clicked at least on one document within 5 minutes. In the end, I only want to keep rows with search query values.
Here's what I've done so far, I tried to create a boolean column:
df = df.withColumn('valid',
df.user == F.lead('user').over(
W.partitionBy(clicks.user, F.window('timestamp', '5 minutes')).orderBy('timestamp')
)
)
Here's the desired output. Notice how only rows with search queries that lead to clicks (a search query row (query != null) that is followed by "clicks" row(s) (doc_id != null) with the same username) have the true flag. Also, the row with query "rech" lead to a query correction "recherche" and therefore shouldn't be flagged as true.
+-------------------+--------+----------------------------------------+------+-----+
|timestamp |user |query |doc_id|valid|
+-------------------+--------+----------------------------------------+------+-----+
|2022-05-31 18:56:47|m97bcar |exemple |null |false|
|2022-05-31 19:22:40|ozenfada|fort |null |true |
|2022-05-31 19:23:40|ozenfada|null |7931 |false|
|2022-05-31 19:24:09|ozenfada|null |1034 |false|
|2022-05-31 21:06:44|m97emou |apnée du sommeil |null |true |
|2022-05-31 21:07:24|m64lbeh |rech |192 |false|
|2022-05-31 21:07:40|m64lbeh |recherche |null |true |
|2022-05-31 21:08:21|m64lbeh |null |3002 |false|
|2022-05-31 21:11:04|m97emou |null |3492 |false|
+-------------------+--------+----------------------------------------+------+-----+
Any help would be greatly appreciated.

The following will get you all the queries that resulted in a click within 5 minutes. Join on user will work with the extra condition that the timestamp difference between query and click is <=5 minutes. The result printed below is for the sample data provided.
# rename columns to avoid ambiguity
logs = logs.withColumnRenamed('timestamp', 'query_timestamp')
clicks = clicks.withColumnRenamed('timestamp', 'click_timestamp')
clicks = clicks.withColumnRenamed('user', 'click_user')
# join on same username, and if the click is within 5 minutes of the query
time_diff_in_seconds = F.unix_timestamp(clicks['click_timestamp']) - F.unix_timestamp(logs['query_timestamp'])
join_cond = (logs['user']==clicks['click_user']) & \
(time_diff_in_seconds >= 0) & \
(time_diff_in_seconds <= 5*60)
df2 = logs.join(clicks, join_cond, how='left')
# drop all queries that didn't lead to a click
df2 = df2.filter(df2['doc_id'].isNotNull())
# select only the necessary columns
df2 = df2.select('query_timestamp', 'user', 'query').distinct()
df2.show()
+-------------------+-------+----------+
| query_timestamp| user| query|
+-------------------+-------+----------+
|2021-12-01 06:14:38|m96cles|minoration|
|2021-12-01 06:40:40|m96mbeg| cessation|
|2021-12-01 06:32:54|m96ngro| associés|
+-------------------+-------+----------+
Update - To handle misspelled queries
Introduce a column that shows the time difference in seconds between a query and a click. After the join both rows will be retained but the time difference will be larger for the misspelled query. So do an orderBy() on time difference and drop the second row. This can be done with a dropDuplicates('click_timestamp', 'user', 'doc_id').
Let's say there was a search for minor in the 2nd row:
+-------------------+-------+--------------------+
| timestamp| user| query|
+-------------------+-------+--------------------+
|2021-12-01 06:14:38|m96cles| minoration|
|2021-12-01 06:14:39|m96cles| minor|
... and rest of the rows
logs = logs.withColumnRenamed('timestamp', 'query_timestamp')
clicks = clicks.withColumnRenamed('timestamp', 'click_timestamp')
clicks = clicks.withColumnRenamed('user', 'click_user')
time_diff_in_seconds = F.unix_timestamp(clicks['click_timestamp']) - F.unix_timestamp(logs['query_timestamp'])
join_cond = (logs['user']==clicks['click_user']) & \
(time_diff_in_seconds >= 0) & \
(time_diff_in_seconds <= 5*60)
df2 = logs.join(clicks, join_cond, how='left')
df2 = df2.withColumn('time_diff_in_seconds', time_diff_in_seconds)
# ensures if a query leads to multiple clicks then duplicates caused due to left join are dropped
df2 = df2.orderBy('time_diff_in_seconds').dropDuplicates(['query_timestamp', 'user', 'query'])
# keep only the latest query that lead to a click
df2 = df2.orderBy('time_diff_in_seconds').dropDuplicates(['click_timestamp', 'user', 'doc_id'])
df2 = df2.filter(df2['doc_id'].isNotNull())
df2 = df2.select('query_timestamp', 'user', 'query')
df2.show()
+-------------------+-------+---------+
| query_timestamp| user| query|
+-------------------+-------+---------+
|2021-12-01 06:14:39|m96cles| minor|
|2021-12-01 06:32:54|m96ngro| associés|
|2021-12-01 06:40:40|m96mbeg|cessation|
+-------------------+-------+---------+
You may need to test for more complex scenarios, and perhaps modify the code a bit. I think this logic would work, based on the sample data.

Related

Unable to create a new column from a list using spark concat method?

i have below data frame in which i am trying to create a new column by concatinating name from a list
df=
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
key_list=['name','state','id']
df.withColumn('prim_key', concat(*key_list)
df.show()
but above return the same result
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
i suspecting it might be due to space in the column names in DF. so i used trim to remove all space in column names, but no luck . it returning the same result
Any solution to this?
i found it... the issue was due to assigning the result to new or existing df
df=df.withColumn('prim_key', concat(*key_list)

How to create a combined data frame from each columns?

I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.

How to identify if a particular string/pattern exist in a column using pySpark

Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+

Get all rows after doing GroupBy in SparkSQL

I tried to do group by in SparkSQL which works good but most of the rows went missing.
spark.sql(
"""
| SELECT
| website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)
I am getting output like this :
+------------------+---------+
|website_session_id|min_pv_id|
+------------------+---------+
|1 |1 |
|10 |15 |
|100 |168 |
|1000 |1910 |
|10000 |20022 |
|100000 |227964 |
|100001 |227966 |
|100002 |227967 |
|100003 |227970 |
|100004 |227973 |
+------------------+---------+
Same query in MySQL gives the desired result like this :
What is the best way to do ,so that all rows are fetched in my Query.
Please note I already checked other answers related to this, like joining to get all rows etc, but I want to know if there is any other way by with we can get the result like we get in MySQL ?
It looks like it is ordered by alphabetically, in which case 10 comes before 2.
You might want to check that the columns type is a number, not string.
What datatypes do the columns have (printSchema())?
I think website_session_id is of string type. Cast it to an integer type and see what you get:
spark.sql(
"""
| SELECT
| CAST(website_session_id AS int) as website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)

filter dataframe by multiple columns after exploding

My df contains product names and corresponding information. Relevant here is the name and country sold to:
+--------------------+-------------------------+
| Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
| null| [Belgium,United K...|
| #5 pecan/almond| [Belgium]|
| #8 mango/strawberry| [Belgium]|
|& Sully A Mild Th...| [Belgium,France]|
|"70CL Liqueu...| [Belgium,France]|
|"Gingembre&q...| [Belgium]|
|"Les Schtrou...| [Belgium,France]|
|"Sho-key&quo...| [Belgium]|
|"mini Chupa ...| [Belgium,France]|
| 'S Lands beste| [Belgium]|
|'T vlierbos confi...| [Belgium]|
|(H)eat me - Spagh...| [Belgium]|
| -cheese flips| [Belgium]|
| .soupe cerfeuil| [Belgium]|
|1 1/2 Minutes Bas...| [Belgium,Luxembourg]|
| 1/2 Reblochon AOP| [Belgium]|
| 1/2 nous de jambon| [Belgium]|
|1/2 tarte cerise ...| [Belgium]|
|10 Original Knack...| [Belgium,France,S...|
| 10 pains au lait| [Belgium,France]|
+--------------------+-------------------------+
sample input data:
[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]
Since I want to explore to which countries the products are sold to besides Belgium i split the country column to show every country individually using the code below
#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")
#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
.where(col("Countries")!="null")\
.select("Product_name",\
F.split("Countries", ",").alias("Countries"),
F.posexplode(F.split("Countries", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Product_name",
F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
F.expr("Countries[pos]").alias("val")
)\
.groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()
However, this table now has over 400 columns for countries alone which is not presentable. So my question is:
am I doing the splitting / exploding correctly?
can I split the df so I get the countries as column names (e.g. 'France' instead of 'countries1' etc.) counting the number of times the product is sold in this country?
Some sample data :
val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")
Transform to required df :
df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))
+------------+-------+
|Product_name|country|
+------------+-------+
| p1|BELGIUM|
| p1|GERMANY|
| p1|BELGIUM|
| p1| ITALY|
| p1|GERMANY|
| p2|BELGIUM|
+------------+-------+
If you need only counts per country :
countDF = df.groupBy("Product_name","country").count()
countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|BELGIUM| 2|
| p1|GERMANY| 1|
| p2|BELGIUM| 1|
+------------+-------+-----+
Except Belgium :
countDF.filter(col("country") =!="BELGIUM").show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|GERMANY| 1|
+------------+-------+-----+
And if you really want countries as Columns :
countDF.groupBy("Product_name").pivot("country").agg(first("count"))
+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
| p2| 1| null|
| p1| 2| 1|
+------------+-------+-------+
And you can .drop("BELGIUM") to achieve it.
Final code used:
#create df where countries are split off
df = productsDF\
.withColumn("country_list",split(col("Countries_en"),","))\
.select(col("Product_name"), explode(col("country_list")).alias("Country"))\
#create count and filter out Country Belgium, Product Name can be changed as needed
countDF = df.groupBy("Product","Country").count()\
.filter(col("Country") !="Belgium")\
.filter(col('Product') == 'Café').show()

Resources