Replace values of several columns with values mapping in other dataframe PySpark - apache-spark

I need to replace values of several columns (many more than those in the example, so I would like to avoid doing multiple left joins) of a dataframe with values from another dataframe (mapping).
Example:
df1 EXAM
id
question1
question2
question3
1
12
12
5
2
12
13
6
3
3
7
5
df2 VOTE MAPPING :
id
description
3
bad
5
insufficient
6
sufficient
12
very good
13
excellent
Output
id
question1
question2
question3
1
very good
very good
insufficient
2
very good
excellent
sufficient
3
bad
null
insufficient
Edit 1: Corrected id for excellent in vote map

First of all, you can create a reference dataframe:
df3 = df2.select(
func.create_map(func.col('id'), func.col('desc')).alias('ref')
).groupBy().agg(
func.collect_list('ref').alias('ref')
).withColumn(
'ref', func.udf(lambda lst: {k:v for element in lst for k, v in element.items()}, returnType=MapType(StringType(), StringType()))(func.col('ref'))
)
+---------------------------------------------------------------------------+
|ref |
+---------------------------------------------------------------------------+
|{3 -> bad, 12 -> good, 5 -> insufficient, 13 -> excellent, 6 -> sufficient}|
+---------------------------------------------------------------------------+
Then you can replace the value in question columns by getting the value in reference with 1 crossJoin:
df4 = df1.crossJoin(df3)\
.select(
'id',
*[func.col('ref').getItem(func.col(col)).alias(col) for col in df1.columns[1:]]
)
df4.show(10, False)
+---+----+---------+------------+
|id |q1 |q2 |q3 |
+---+----+---------+------------+
|1 |good|good |insufficient|
|2 |good|excellent|sufficient |
|3 |bad |null |insufficient|
+---+----+---------+------------+

Related

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).
You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

How to create a group | sub-group (pre-defined) cyclic order by considering the identical consecutive groupings (in Pandas DataFrame) columns?

Task 1: I am looking for a solution to create a group by considering the identical consecutive groupings in one of the columns (of my Panda's DataFrame, ..considering this as values of a list):
from itertools import groupby
test_list = ['AA', 'AA', 'BB', 'CC', 'DD', 'DD', 'DD', 'AA', 'BB', 'EE', 'CC']
data = pd.DataFrame(test_list)
data['batches'] = ['1','1','2','3','4','4','4','5','6','7','8'] # this is the goal to reach
print(data)
result = [list(y) for x, y in groupby(test_list)]
print(result)
[['AA', 'AA'], ['BB'], ['CC'], ['DD', 'DD', 'DD'], ['AA'], ['BB'], ['EE'], ['CC']]
So, I have a DataFrame with two columns: the first is a list of elements that must be kept in order + grouped into batches: identical consecutive grouping. The batch column where the result should be assigned.
I couldn't find a solution or a workaround. As you can see, I've created a list using the itertools groupby function by grouping the same cons. items, but this isn't the final result I'd like to see. I know that itertools groupby allows me to utilize a lambda function with the 'key=' parameter to perhapsĀ get to my solution.
I was thinking of merging the above and looping it into a dictionary, with the key being the batch numbers obtained by iterating the list using enumerate and the values being the list elements:
{1:['AA', 'AA'], 2:['BB'], 3:['CC'], 4: ['DD', 'DD', 'DD']...}
After that, I'll convert the dictionary (or any other solution/workaround) to a Data Series and add it to my batch column:
In this exercise, I just want to return the key(s) of my 'dictionary' (the number of unique batches) to the batches column.
| list | batches |
| -------- | ------- |
| AA | 1 |
| AA | 1 |
| BB | 2 |
| CC | 3 |
| DD | 4 |
| DD | 4 |
| DD | 4 |
| AA | 5 |
| BB | 6 |
| EE | 7 |
| CC | 8 |
EDITED:
Task 2: The added query for a similar task:
In this scenario, my initial list has a (pre-defined) cyclic order to follow such as AA -- AB -- AC belongs to one main group, DA -- DB -- belongs to another group.
The question is how to calculate the column sub-group so that I can have sub-groups listings under my main group...so to say, capturing repeated groups within the main group.
list
sub
main gr
AA
1
1
AB
1
1
AC
1
1
AA
2
1
AB
2
1
AC
2
1
DA
1
2
DB
1
2
I found a solution whose logic was based on #Shubham's comment. My solution to use the .cumcount() function as the following: df['sub'] = df.groupby(['main gr', 'list'].cumcount()+1 .cumcount()+1 if we want that the sub-order count/index starts at 1 instead 0.
(I'm not looking for the best solution, I am looking for a solution. Nevertheless, I would like to use this code for large datasets containing millions of entries).
I will highly appreciate any comment or supporting feedback.

Calculate mean per few columns in Pandas Dataframe

I have a Pandas dataframe, Data:
ID | A1| A2| B1| B2
ID1| 2 | 1 | 3 | 7
ID2| 4 | 6 | 5 | 3
I want to calculate mean of columns (A1 and A2), and (B1 and B2) separately and row-wise . My desired output:
ID | A1A2 mean | B1B2 mean
ID1| 1.5 | 5
ID2| 5 | 4
I can do mean of all columns together , but cannot find any functions to get my desired output.
Is there any built-in method in Python?
Use DataFrame.groupby with lambda function for get first letter of columns for mean, also if first column is not index use DataFrame.set_index:
df=df.set_index('ID').groupby(lambda x: x[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Another solution is extract columns names by indexing str[0]:
df = df.set_index('ID')
print (df.columns.str[0])
Index(['A', 'A', 'B', 'B'], dtype='object')
df = df.groupby(df.columns.str[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Or:
df = (df.set_index('ID')
.groupby(df.columns[1:].str[0], axis=1)
.mean()
.add_suffix('_mean').reset_index()
Verify solution:
a = df.filter(like='A').mean(axis=1)
b = df.filter(like='B').mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
EDIT:
If have different columns names and need specify them in lists:
a = df[['A1','A2']].mean(axis=1)
b = df[['B1','B2']].mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)

How to iterate over columns of "spark" dataframe?

I have the following Spark dataframe that is created dynamically
| name| number |
+--------+---------+
| Andy | (20,10,30)|
|Berta | (30,40,20)|
| Joe | (40,90,60)|
+-------+---------+
Now, I need to iterate each row and column in Spark to print the following output, How to do this?
Andy 20
Andy 10
Andy 30
Berta 30
Berta 40
Berta 20
Joe 40
Joe 90
Joe 60
Assuming the number column is of string Data Type, you can achieve the desired results by following below steps.
Original Data Frame:
val df = Seq(("Andy", "20,10,30"), ("Berta", "30,40,20"), ("Joe", "40,90,60"))
.toDF("name", "number")
Then Create an intermediate Data Frame having 3 number columns by splitting the number column with comma.
val Interim_Df = df.withColumn("n1", split(col("number"), ",").getItem(0))
.withColumn("n2", split(col("number"), ",").getItem(1))
.withColumn("n3", split(col("number"), ",").getItem(2))
.drop("number")
Then generate the final result data frame by doing union with oneIndexDfs.
val columnIndexes = Seq(1, 2, 3)
val onlyOneIndexDfs = columnIndexes.map(x =>
Interim_Df.select(
$"name",
col(s"n$x").alias("number")))
val resultDF = onlyOneIndexDfs.reduce(_ union _)
You need explode function.
Here samples of its usage.

Apache Spark - Finding Array/List/Set subsets

I have 2 dataframes each one having Array[String] as one of the columns. For each entry in one dataframe, I need to find out subsets, if any, in the other dataframe. An example is here:
DF1:
----------------------------------------------------
id : Long | labels : Array[String]
---------------------------------------------------
10 | [label1, label2, label3]
11 | [label4, label5]
12 | [label6, label7]
DF2:
----------------------------------------------------
item : String | labels : Array[String]
---------------------------------------------------
item1 | [label1, label2, label3, label4, label5]
item2 | [label4, label5]
item3 | [label4, label5, label6, label7]
After the subset operation I described, the expected o/p should be
DF3:
----------------------------------------------------
item : String | id : Long
---------------------------------------------------
item1 | [10, 11]
item2 | [11]
item3 | [11, 12]
It is guaranteed that the DF2, will always have corresponding subsets in DF1, so there won't be any left over elements.
Can someone please help with the right approach here ? It looks like for each element in DF2, I need to scan DF1 and do subset operation (or set subtraction) on the 2nd column until I find all the subsets and exhaust the labels in that row and while doing that accumulate the list of "id" fields. How do I do this in compact and efficient manner ? Any help is greatly appreciated. Realistically, I may have 100s of elements in DF1 and 1000s of elements in DF2.
I'm not aware of any way to perform this kind of operation in an efficient way. However, here is one possible solution using UDF as well as Cartesian join.
The UDF takes two sequences and checks if all strings in the first exists in the second:
val matchLabel = udf((array1: Seq[String], array2: Seq[String]) => {
array1.forall{x => array2.contains(x)}
})
To use Cartesian join, it needs to be enabled as it is computationally expensive.
val spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.crossJoin.enabled", true)
The two dataframes are joined together utilizing the UDF. Afterwards the resulting dataframe is grouped by the item column to collect a list of all ids. Using the same DF1 and DF2 as in the question:
val DF3 = DF2.join(DF1, matchLabel(DF1("labels"), DF2("labels")))
.groupBy("item")
.agg(collect_list("id").as("id"))
The result is as follows:
+-----+--------+
| item| id|
+-----+--------+
|item3|[11, 12]|
|item2| [11]|
|item1|[10, 11]|
+-----+--------+

Resources