Logical Conditioning on Pyspark - where versus aggregation with when - apache-spark

Can anyone explain to me why these two conditions produce different outputs (even different count() )?
FIRST:
(df
.where(cond1)
.where((cond2) | (cond3))
.groupBy('id')
.agg(F.avg(F.column('col1')).alias('name'),
F.avg(F.column('col2')).alias('name'))
).count()
SECOND:
(df
.groupBy('id')
.agg(F.avg(F.when(((cond2) | (cond3))) & (cond1),
F.column('col1'))).alias('name'),
F.avg(F.when(((cond2) | (cond3)) & (cond1),
F.column('col2'))).alias('name'))
).count()

I just figured it out. when() returns None when it finds no match, but None is still a return, which means the aggregation takes into account all the values. When compared to a simple df grouped by the same column and just aggregated with no conditions, the result is the same.
On the other hand, where() filters the DataFrame, so the aggregation is only applied to the filtered version of the DataFrame, hence lower number of results

Without knowing what the conditions are, my understanding is that they are different processes: in the first case you first filter the rows you need to process, group by id and get the averages of the filtered data, that results to lets say x rows. In the second case, you group by id first, no rows filtering, and you tell spark to add a column named 'name' that holds the conditional average to the grouped df. You don't conditionally filter the rows, so you now have x+something more rows (depending on your conditions)
(df
.where(cond1) # remove rows by applying cond1
.where((cond2) | (cond3)) # remove rows by applying cond2, 3
.groupBy('id') # group *remaining* rows by id
.agg(F.avg(F.column('col1')).alias('name'), # then get the average
F.avg(F.column('col2')).alias('name'))
).count()
But:
(df
.groupBy('id') # group initial data by id
.agg(F.avg(F.when(((cond2) | (cond3))) & (cond1), # add a column to the grouped data that computes average conditionally
F.column('col1'))).alias('name'),
F.avg(F.when(((cond2) | (cond3)) & (cond1),
F.column('col2'))).alias('name'))
).count()
# the agg does not change the number of the rows.
Hope this helps (I think you've already figured it out though :) ). Good luck!

Related

drop_duplicates after unionByName

I am trying to stack two dataframes (with unionByName()) and, then, dropping duplicate entries (with drop_duplicates()).
Can I trust that unionByName() will preserve the order of the rows, i.e., that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1's? Because, if so, when applying drop_duplicates(), df1's row would always be preserved, which is the behaviour I want.
UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))

How to drop entire record if more than 90% of features have missing value in pandas

I have a pandas dataframe called df with 500 columns and 2 million records.
I am able to drop columns that contain more than 90% of missing values.
But how can I drop in pandas the entire record if 90% or more of the columns have missing values across the whole record?
I have seen a similar post for "R" but I am coding in python at the moment.
You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
You could use isna + mean on axis=1 to find the percentage of NaN values for each row. Then select the rows where it's less than 0.9 (i.e. 90%) using loc:
out = df.loc[df.isna().mean(axis=1)<0.9]

Using Pandas: How do I combine multiple rows of data into a single row based on a common key?

Need help merging multiple rows of data with various datatypes for multiple columns
I have a dataframe that contains 14 columns and x number of rows of data. An example slice of the dataframe is linked below:
Current Example of my dataframe
I want to be able to merge all four rows of data into a single row based on the "work order" column. See linked image below. I am currently using pandas to take data from four different data sources and create a dataframe that has all the relevant data I want based on each work order number. I have tried various methods including groupby, merge, join, and others without any good results.
How I want my dataframe to look in the end
I essentially want to groupby the work order value, merge all the site names into a single value, then have all data essentially condense to a single row. If there is identical data in a column then I just want it to merge together. If there are values in a column that are different (such as in "Operator Ack Timestamp") then I don't mind the data being a continuous string of data (ex. one date after the next within the same cell).
example dataframe data:
df = pd.DataFrame({'Work Order': [10025,10025,10025,10025],
'Site': ['SC1', 'SC1', 'SC1', 'SC1'],
'Description_1':['','','Inverter 10A-1 - No Comms',''],
'Description_2':['','','Inverter 10A-1 - No Comms',''],
'Description_3':['Inverter 10A-1 has lost communications.','','',''],
'Failure Type':['','','Communications',''],
'Failure Class':['','','2',''],
'Start of Fault':['','','2021-05-30 06:37:00',''],
'Operator Ack Timestamp':['2021-05-30 8:49:21','','2021-05-30 6:47:57',''],
'Timestamp of Notification':['2021-05-30 07:18:58','','',''],
'Actual Start Date':['','2021-05-30 6:37:00','','2021-05-30 6:37:00'],
'Actual Start Time':['','06:37:00','','06:37:00'],
'Actual End Date':['','2021-05-30 08:24:00','',''],
'Actual End Time':['','08:24:00','','']})
df.head()
4 steps to get expected output:
Replace empty values by pd.NA,
Group your data by Work Order column because it seems to be the index key,
For each group, fill NA value by last valid observation and keep the last record,
Reset index to have the same format as input.
I choose to group by "Work Order" because it seems to be the index key of your dataframe.
The index of your dataframe is "Work Order":
df = df.set_index("Work Order")
out = df.replace({'': pd.NA}) \
.groupby("Work Order", as_index=False) \
.apply(lambda x: x.ffill().tail(1)) \
.reset_index(level=0, drop=True)```
>>> out.T # transpose for better vizualisation
Work Order 10025
Site SC1
Description_1 Inverter 10A-1 - No Comms
Description_2 Inverter 10A-1 - No Comms
Description_3 Inverter 10A-1 has lost communications.
Failure Type Communications
Failure Class 2
Start of Fault 2021-05-30 06:37:00
Operator Ack Timestamp 2021-05-30 6:47:57
Timestamp of Notification 2021-05-30 07:18:58
Actual Start Date 2021-05-30 6:37:00
Actual Start Time 06:37:00
Actual End Date 2021-05-30 08:24:00
Actual End Time 08:24:00

Spark: Filter & withColumn using row values?

I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?
Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.
Edit
<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:
df.withColumn('sim_count',
f.when(
f.col("col_1").isNotNull(),
(
df.where(
f.col("price_list").between(f.col("col1"), f.col("col2"))
).count()
)
).otherwise(f.lit(None).cast(LongType()))
)

How to filter out duplicate rows based on some columns in spark dataframe?

Suppose, I have a Dataframe like below:
Here, you can see that transaction number 1,2 and 3 have same value for columns A,B,C but different value for column D and E. Column E has date entries.
For same A,B and C combination (A=1,B=1,C=1), we have 3 rows. I want to take only one row based on the recent transaction date of column E means the rows which have the most recent date. But for the most recent date, there are 2 transactions. But i want to take just one of them if two or more rows found for the same combination of A,B,C and most recent date in column E.
So my expected output for this combination will be row number 3 or 4(any one will do).
For same A,B and C combination (A=2,B=2,C=2), we have 2 rows. But based on column E, the most recent date is the date of row number 5. So we will just take this row for this combination of A,B and C.
So my expected output for this combination will be row number 5
So the final output will be (3 and 5) or (4 and 5).
Now how should i approach:
I read this:
Both reduceByKey and groupByKey can be used for the same purposes but
reduceByKey works much better on a large dataset. That’s because Spark
knows it can combine output with a common key on each partition before
shuffling the data.
I tried with groupBy on Column A,B,C and max on column E. But it can't give me the head of the rows if multiple rows present for the same date.
What is the most optimized approach to solve this? Thanks in advance.
EDIT: I need get back my filtered transactions. How to do it also?
I have used spark window functions to get my solution:
val window = Window
.partitionBy(dataframe("A"), dataframe("B"),dataframe("C"))
.orderBy(dataframe("E") desc)
val dfWithRowNumber = dataframe.withColumn("row_number", row_number() over window)
val filteredDf = dfWithRowNumber.filter(dfWithRowNumber("row_number") === 1)
Link possible by several steps. Agregated Dataframe:
val agregatedDF=initialDF.select("A","B","C","E").groupBy("A","B","C").agg(max("E").as("E_max"))
Link intial-agregated:
initialDF.join(agregatedDF, List("A","B","C"))
If initial DataFrame comes from Hive, all can be simplified.
val initialDF = Seq((1,1,1,1,"2/28/2017 0:00"),(1,1,1,2,"3/1/2017 0:00"),
(1,1,1,3,"3/1/2017 0:00"),(2,2,2,1,"2/28/2017 0:00"),(2,2,2,2,"2/25/20170:00"))
This will miss out on corresponding col(D)
initialDF
.toDS.groupBy("_1","_2","_3")
.agg(max(col("_5"))).show
In case you want the corresponding colD for the max col:
initialDF.toDS.map(x=>x._1,x._2,x._3,x._5,x._4))).groupBy("_1","_2","_3")
.agg(max(col("_4")).as("_4")).select(col("_1"),col("_2"),col("_3"),col("_4._2"),col("_4._1")).show
For ReduceByKey you can convert the dataset to pairRDD and then work off it.Should be faster in case the Catalyst is not able to optimize the groupByKey in the first one. Refer Rolling your own reduceByKey in Spark Dataset

Resources