Detect existence of column element in multiple other columns using join - apache-spark

I'm using PySpark 2.4.
I have a dataframe like below as input:
ceci_p| ceci_l|ceci_stok|
-------+-------+---------+
SFIL401| BPI202| BPI202|
BPI202| CDC111| BPI202|
LBP347|SFIL402| SFIL402|
LBP347|SFIL402| LBP347|
-------+-------+---------+
I want to detect which ceci_stok values exist in both ceci_l and ceci_p columns using a join (maybe a self join).
For example: ceci_stok = BPI202 exists in both ceci_l and ceci_p.
I want to create a new dataframe as a result that contains ceci_stok which exist in both ceci_l and ceci_p.

#c reate data for testing
data = [("SFIL401","BPI202","BPI202"),
("BPI202","CDC111","BPI202"),
("LBP347","SFIL402","SFIL402"),
("LBP347","SFIL402","LBP347")]
data_schema = ["ceci_p","ceci_l","ceci_stok"]
df = spark.createDataFrame(data=data, schema = data_schema)
ceci_p = df.cache()\ #don't forget to cache table you reference multiple times.
.select( df.ceci_p.alias("join_key") )\ #rename for union
.distinct()
ceci_l = df\
.select( df.ceci_l.alias("join_key") )\ #rename for union
.distinct()
vals = ceci_l.join(ceci_p,"join_key").distinct() # get unique values to both columns your interested in
df.join( vals, df.ceci_stok == vals.join_key ).show()
+-------+-------+---------+--------+
| ceci_p| ceci_l|ceci_stok|join_key|
+-------+-------+---------+--------+
|SFIL401| BPI202| BPI202| BPI202|
| BPI202| CDC111| BPI202| BPI202|
+-------+-------+---------+--------+

The following seems to be working in Spark 3.0.2. Please try it.
from pyspark.sql functions as F
df2 = (
df.select('ceci_stok').alias('_stok')
.join(df.alias('_p'), F.col('_stok.ceci_stok') == F.col('_p.ceci_p'), 'leftsemi')
.join(df.alias('_l'), F.col('_stok.ceci_stok') == F.col('_l.ceci_l'), 'leftsemi')
.distinct()
)
df2.show()
# +---------+
# |ceci_stok|
# +---------+
# | BPI202|
# +---------+

You're right, that can be done using autojoin. If you have a dataframe
>>> df.show(truncate=False)
+-------+-------+---------+
|ceci_p |ceci_l |ceci_stok|
+-------+-------+---------+
|SFIL401|BPI202 |BPI202 |
|BPI202 |CDC111 |BPI202 |
|LBP347 |SFIL402|SFIL402 |
|LBP347 |SFIL402|LBP347 |
+-------+-------+---------+
...then the following couple of joins (with "leftsemi" to drop right-hand side) should produce what you need:
>>> df.select("ceci_stok") \
.join(df.select("ceci_p"),df.ceci_stok == df.ceci_p,"leftsemi") \
.join(df.select("ceci_l"),df.ceci_stok == df.ceci_l,"leftsemi") \
.show(truncate=False)
+---------+
|ceci_stok|
+---------+
|BPI202 |
|BPI202 |
+---------+
You can dedup the result if you're just interested in unique values.

Related

How to update two columns in PySpark satisfying the same condition?

I have a table in which there are 4 columns: "ID", "FLAG_A", "FLAG_B", "FLAG_C".
This is the SQL query I want to transform into PySpark, there are two conditions which I need to satisfy for both columns "FLAG_A" and "FLAG_B". How to do it in PySpark?
UPDATE STATUS_TABLE SET STATUS_TABLE.[FLAG_A] = "JAVA",
STATUS_TABLE.FLAG_B = "PYTHON"
WHERE (((STATUS_TABLE.[FLAG_A])="PROFESSIONAL_CODERS") AND
((STATUS_TABLE.FLAG_C) Is Null));
Is it possible to code this in a single statement by giving two conditions and satisfying the "FLAG_A" and "FLAG_B" columns in PySpark?
I can't think of any way to rewrite this into a single statement which you thought of. I tried writing the UPDATE query inside Spark, but it seems UPDATE is not working:
: java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily.
The following does exactly the same as your UPDATE query:
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'PROFESSIONAL_CODERS', 'X', None),
(2, 'KEEP', 'KEEP', 'KEEP')],
['ID', 'FLAG_A', 'FLAG_B', 'FLAG_C'])
Script:
cond = (F.col('FLAG_A') == 'PROFESSIONAL_CODERS') & F.isnull('FLAG_C')
df = df.withColumn('FLAG_B', F.when(cond, 'PYTHON').otherwise(F.col('FLAG_B')))
df = df.withColumn('FLAG_A', F.when(cond, 'JAVA').otherwise(F.col('FLAG_A')))
df.show()
# +---+------+------+------+
# | ID|FLAG_A|FLAG_B|FLAG_C|
# +---+------+------+------+
# | 1| JAVA|PYTHON| null|
# | 2| KEEP| KEEP| KEEP|
# +---+------+------+------+

Select a next or previous record on a dataframe (PySpark)

I have a spark dataframe that has a list of timestamps (partitioned by uid, ordered by timestamp). Now, I'd like to query the dataframe to get either previous or next record.
df = myrdd.toDF().repartition("uid").sort(desc("timestamp"))
df.show()
+------------------------+-------+
|uid |timestamp |
+------------------------+-------+
|Peter_Parker|2020-09-19 02:14:40|
|Peter_Parker|2020-09-19 01:07:38|
|Peter_Parker|2020-09-19 00:04:39|
|Peter_Parker|2020-09-18 23:02:36|
|Peter_Parker|2020-09-18 21:58:40|
So for example if I were to query:
ts=datetime.datetime(2020, 9, 19, 0, 4, 39)
I want to get the previous record on (2020-09-18 23:02:36), and only that one.
How can I get the previous one?
It's possible to do it using withColumn() and diff, but is there a smarter more efficient way of doing that? I really really don't need to calculate diff for ALL events, since it is already ordered. I just want prev/next record.
You can use a filter and order by, and then limit the results to 1 row:
df2 = (df.filter("uid = 'Peter_Parker' and timestamp < timestamp('2020-09-19 00:04:39')")
.orderBy('timestamp', ascending=False)
.limit(1)
)
df2.show()
+------------+-------------------+
| uid| timestamp|
+------------+-------------------+
|Peter_Parker|2020-09-18 23:02:36|
+------------+-------------------+
Or by using row_number after filtering :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.filter("timestamp < '2020-09-19 00:04:39'") \
.withColumn("rn", F.row_number().over(Window.orderBy(F.desc("timestamp")))) \
.filter("rn = 1").drop("rn")
df1.show()
#+------------+-------------------+
#| uid| timestamp|
#+------------+-------------------+
#|Peter_Parker|2020-09-18 23:02:36|
#+------------+-------------------+

Pyspark: join some the type of date data

Forexample:
I have two dataframes in Pyspark.
A_dataframe【table name: link_data_test】,The size is so big about 1 billion rows:
-----+--------------------+---------------+
| id| link_date| tuch_url|
+-----+--------------------+-------------+
|day_1|2020-01-01 06:00:...|www.google.com|
|day_2|2020-01-01 11:00:...|www.33e.......|
|day_3|2020-01-03 22:21:...|www.3tg.......|
|day_4|2019-01-04 20:00:...|www.96g.......|
.........
+-----+--------------------+
B_dataframe【table name: url_data_test】:
-----+--------------------+
| url| extra_date|
+-----+--------------------+
|www.google.com|2019-02-01 |
|www.23........|2020-01-02 |
|www.hsi.......|2020-01-03 |
|www.cc........|2020-01-05 |
.......
+-----+--------------------+
I can use the spark.sql() to create a query:
sql_str="""
select
t1.*,t2.*
from
link_data_test as t1
inner join
url_data_test as t2
on
t1.link_date> t2.extra_date and t1.link_date< date_add(t2.extra_date,8)
where
t1.tuch_url like "%t2.url%"
"""
test1=spark.sql(sql_str).saveAsTable("xxxx",mode="overwrite")
I tried this to use the following writing replace the sql wording above for some other tests,but I don't know how writing this.
A_dataframe.join(B_dataframe, ......,'inner').select(....).saveAsTable("xxxx",mode="overwrite")
Thank you for your help!
Here is the way.
df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df1.show(10, False)
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")
df2.show(10, False)
+-----+-------------------+--------------+
|id |link_date |tuch_url |
+-----+-------------------+--------------+
|day_1|2020-01-08 23:59:59|www.google.com|
+-----+-------------------+--------------+
+--------------+----------+
|url |extra_date|
+--------------+----------+
|www.google.com|2020-01-01|
+--------------+----------+
df1.join(broadcast(df2),
col('link_date').between(col('extra_date'), date_add('extra_date', 7))
& col('url').contains(col('tuch_url')), 'inner') \
.show(10, False)
+-----+-------------------+--------------+--------------+----------+
|id |link_date |tuch_url |url |extra_date|
+-----+-------------------+--------------+--------------+----------+
|day_1|2020-01-08 23:59:59|www.google.com|www.google.com|2020-01-01|
+-----+-------------------+--------------+--------------+----------+

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.
The Dataframes looks like this:
df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
| ID| MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003| [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+
df2.show(5)
+-----+----------+
|score| word|
+-----+----------+
| 1.68|abandonado|
| 3.18| abejas|
| 2.8| aborto|
| 2.46| abrasador|
| 8.13| abrazo|
+-----+----------+
The new columns to add in df1, should look like this:
+------------------+---------------------+
| MeanScore| ScoreList|
+------------------+---------------------+
| 2.95|[3.10, 2.50, 1.28,...|
| 2.15|[1.15, 3.50, 2.75,...|
| 2.75|[4.20, 1.00, 1.75,...|
| 3.25|[3.25, 2.50, 3.20,...|
| 3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+
I have reviewed some options using .join, but using columns with different data types gives error.
I have also tried converting the Dataframes to RDD and calling a function:
def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]
RDD1=swRemoved.rdd.map(list)
RDD2=Dict_df.rdd.map(list)
reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)
But with this option I get the error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I have found some examples to score text using afinn library. But it doesn't works with spanish text.
I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.
You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)
welcome to SO
df1.show()
#+------------------+----------------------------+
#|ID |MeaningfulWords |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien] |
#+------------------+----------------------------+
df2.show()
#+-----+---------+
#|score| word|
#+-----+---------+
#| 1.68| casa|
#| 2.8| alejado|
#| 1.03| buen|
#| 3.68| gusto|
#| 0.68| clientes|
#| 2.1|contentos|
#| 2.68| servi|
#| 1.18| resto|
#| 1.98| bien|
#+-----+---------+
from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
.groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
,F.collect_list("score").alias("ScoreList")\
,F.mean("score").alias("MeanScore"))\
.show(truncate=False)
#+------------------+----------------------------+-----------------------+------------------+
#|ID |MeaningfullWords |ScoreList |MeanScore |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien] |[1.18, 1.98] |1.58 |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68] |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975 |
#+------------------+----------------------------+-----------------------+------------------+

How to deal with white space in column names to use spark coalesce function in expr method

I am working on spark coalesce functionality in my project.Code works fine on columns with no spaces but fails on spaced columns.
e1.csv
id,code,type,no root
1,,A,1
2,,,0
3,123,I,1
e2.csv
id,code,type,no root
1,456,A,1
2,789,A1,0
3,,C,0
logic code
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("/home/user/Videos/<folder>/e2.csv");
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("coalesce(`a.no root`,`b.no root`) AS `a.no root`");
newDS.show();
What I have tried
Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id== b.id").selectExpr("""coalesce(`a.no root`,`b.no root`) AS `a.no root`""");
The espexted result would be like
no root
1
0
1
Using the following criteria
val newDS = df1.as("a").join(df2.as("b")).where("a.id==b.id").selectExpr("coalesce(a.`no root`,b.`no root`) AS `a.no root`")
will generate the expected output
+---------+
|a.no root|
+---------+
| 1|
| 0|
| 1|
+---------+

Resources