GroupBy dataframe column without aggregation and set not null values - apache-spark

I have a dataframe having records like below:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |null|
|101|null|303 |
+---+----+----+
Is their a simple way to groupBy and get result like below in Spark SQL:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |303 |
+---+----+----+
Thanks.

Use max or min to aggregate the data. Since you only have a single valid value, this is the one that will be selected. Note that it's not possible to use first here (which is faster) since that can still return null values.
When the columns are of numeric types it can be solved as follows:
df.groupBy("id").agg(max($"L1").as("L1"), max($"L2").as("L2"))
However, if you are dealing with strings, you need to collect all values as a list (or set) and then use coalesce:
df.groupBy("id")
.agg(coalesce(collect_list($"L1")).as("L1"), coalesce(collect_list($"L2")).as("L2"))
Of course, this assumes that the nulls are not strings but actual nulls.

Related

Pyspark AND/ALSO Partition Column Query

How do you perform a AND/ALSO query in pyspark? I want both conditions to be met for results to be filtered.
Original dataframe:
df.count()
4105
The first condition does not find any records:
df.filter((df.created_date != 'add')).count()
4105
Therefore I would expect the AND clause here to return 4105, but instead it continues to filter on df.lastUpdatedDate:
df.filter((df.created_date != 'add') & (df.lastUpdatedDate != '2022-12-21')).count()
3861
To me 3861 is the result of an OR clause. How do I address this? lastUpdatedDate is a partition filter based on .explain() so maybe that has something to do with these results?
...PartitionFilters: [isnotnull(lastUpdatedDate#26), NOT
(lastUpdatedDate#26 = 2022-12-21)], PushedFilters:
[IsNotNull(created_date), Not(EqualTo(created_date,add))], ReadSchema ...
Going by our conversation in the comments - Your requirement is to filter out rows where (df.created_date != 'add') & (df.lastUpdatedDate != '2022-12-21')
Your confusion seems on the name of the method i.e. filter but rather consider it as where
Where or Filter by definition works the same as SQL where clause i.e. retains the rows where the expression returns true - and drops the rest.
i.e. Consider a dataframe
+---+-----+-------+
| id| Name|Subject|
+---+-----+-------+
| 1| Sam|History|
| 2| Amy|History|
| 3|Harry| Maths|
| 4| Jake|Physics|
+---+-----+-------+
The below filter would return a new Dataframe with only rows where the Subject is history, i.e. where the expression returned true (i.e. Filters out where it is false)
rightDF.filter(rightDF.col("Subject").equalTo("History")).show();
OR
rightDF.where(rightDF.col("Subject").equalTo("History")).show();
Output:
+---+----+-------+
| id|Name|Subject|
+---+----+-------+
| 1| Sam|History|
| 2| Amy|History|
+---+----+-------+
So in your case, you would want to negate the statements to get the results you desire.
i.e. Use equal to instead of not equal to
df.filter((df.created_date == 'add') & (df.lastUpdatedDate == '2022-12-21')).count()
This means keep rows where both statements are True and filter out where any of them is false
Let me know if this works or anything else

Can't fathom how aggregate distinct count works?

I want to calculate the number of distinct rows according to one column.
I see that the following works :
long countDistinctAtt = Math.toIntExact(dataset.select(att).distinct().count());
But this doesn't :
long countDistinctAtt = dataset.agg(countDistinct(att)).agg(count("*")).collectAsList().get(0).getLong(0);
Why the second solution does not calculate the distinct rows number ?
The second command needs to have a grouping of rows with a groupBy method before any aggregation agg occurs. This particular command doesn't specify based on what rows the aggregation(s) will take place, so in that case of course it won't work.
The main problem with the second command, though, is that even with grouping the rows and aggregating their values based on a column, the results are going to be based per row (aka, with that kind of logic you tell the machine that you want to count the occurrences of a value for each (now grouped and aggregated) row) than based on the entire DataFrame/DataSet. This means that the result is going to be a column/list of values instead of just one value of the total count, because each element will correspond to each aggregated row. Getting the first (get(0)) of those values doesn't really make any sense here, because even if the command would run, you would only get a value count of just one row.
The first command bypasses the hassles by specifying that we only want the distinct values of the selected column, so you can count these values up and find the total number of them. This will result in just one value (which is long and you correctly cast it to int).
As a rule of thumb, 9 times out of 10 you should use groupBy/agg when you want to do row-based computations. In case you do not really care about rows and just want a total result for the whole DataFrame/DataSet, you can use the built-in SQL functions of Spark (you can find all of them here, and you can study their implementations for Java/Scala/Python on each of their documentations too) like in the first command.
To illustrate this, let's say we have a DataFrame (or DataSet, doesn't matter at this point) named dfTest with the following data:
+------+------+
|letter|number|
+------+------+
| a| 5|
| b| 8|
| c| 14|
| d| 20|
| e| 8|
| f| 8|
| g| 20|
+------+------+
If we use the basic built-in SQL functions to select the number column values, filter out the duplicates, and count the remaining rows, the command we correctly put out 4 because there are indeed 4 unique values in number:
// In Scala:
println(dfTest.select("number").distinct().count().toInt)
// In Java:
System.out.println(Math.toIntExact(dfTest.select("number").distinct().count()))
// Output:
4
In contrary, if we group the DataFrame rows together and count the values for each row on its own (no need to use agg here, since count takes a column's value as argument by default), this will result in the following DataFrame where the count will be calculated strictly for each distinct value of the number column:
// In Scala & Java:
dfTest.groupBy("number")
.count()
.show()
// Output:
+------+-----+
|number|count|
+------+-----+
| 20| 2|
| 5| 1|
| 8| 3|
| 14| 1|
+------+-----+

Eliminate null value rows for a specific column while doing partitionBy column in pyspark

I have a pyspark dataframe like this:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
|333| null| CT|
+---+-----+-----+
For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs. Below is the desired output:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
+---+-----+-----+
How can I achieve this in PySpark?
You can do this by grouping by the id column and count the number of names in each group. Null values will be ignored by default in Spark so any group that has 0 in count should be kept. We can now filter away any nulls in groups with a count larger than 0.
In Scala this can be done with a window function as follows:
val w = Window.partitionBy("id")
val df2 = df.withColumn("gCount", count($"name").over(w))
.filter($"name".isNotNull or $"gCount" === 0)
.drop("gCount")
The PySpark equivalent:
w = Window.partitionBy("id")
df.withColumn("gCount", count("name").over(w))
.filter((col("name").isNotNull()) | (col("gCount") == 0))
.drop("gCount")
The above will not remove rows that have multiple nulls for the same id (all these will be kept).
If these should be removed as well, keeping only a single row with name==null, an easy way would be to use .dropDuplicates(['id','name']) before or after running the above code. Note that this also will remove any other duplicates (in which case .dropDuplicates(['id','name', 'state']) could be preferable).
I think you can do that in two steps. First, count values by id
import pyspark.sql.window as psw
w = psw.Window.partitionBy("id")
df = df.withColumn("n",psf.sum(psf.lit(1)).over(w))
Then filter to remove Null when n<1:
df.filter(!((psf.col('name').isNull()) & (psf.col('n') > 1)))
Edit
As mentioned by #Shubham Jain, if you have several Null values for name (duplicates), the above filter will keep them. In that case, the solution proposed by #Shaido is useful: add a post treatment using .dropDuplicates(['id','name']). Or .dropDuplicates(['id','name','state']), following your preference

How to select distinct rows from a Spark Window partition

I have a sample DF with duplicate rows like this:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
val w = Window.partitionBy($"ENCOUNTER_ID")
Using the above Spark Window partition, is it possible to select distinct rows? I am expecting the output DF as:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
I don't want to use DF.DISTINCT or DF.DROPDUPLICATES as it would involve shuffling.
I prefer not to use lag or lead because, in real-time, the order of rows can't be guaranteed.
Window function also shuffle data. So if your all columns are duplicate then df.dropDuplicates will be better option to use. If your use case want to use Window function then you can use below approach.
scala> df.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
//You can use column in partitionBy that need to check for duplicate and also use respective orderBy also as of now I have use sample Window
scala> val W = Window.partitionBy(col("ID"),col("CL_ID"),col("NBR"),col("DT"), col("TYP"), col("KEY")).orderBy(lit(1))
scala> df.withColumn("duplicate", when(row_number.over(W) === lit(1), lit("Y")).otherwise(lit("N")))
.filter(col("duplicate") === lit("Y"))
.drop("duplicate")
.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
An answer to your question that scales up well with big data :
df.dropDuplicates(include your key cols here = ID in this case).
Window function shuffles data, but if you have duplicate entries and want to choose which one to keep for example, or want to sum the value of the duplicates then window function is the way to go
w = Window.PartitionBy('id')
df.agg(first( value col ).over(w)) #you can use max, min, sum, first, last depending on how you want to treat duplicates
An interesting third possibility if you want to keep the values of duplicates (for record)
is the below before
df.withColumn('dup_values', collect(value_col).over(w))
this will create an extra column with an array per row to keep duplicate values after you've got rid of the rows

how to get first value and last value from dataframe column in pyspark?

I Have Dataframe,I want get first value and last value from DataFrame column.
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| G| 2| 0.09523809523809523|
| K| 2| 0.09523809523809523|
| D| 1|0.047619047619047616|
+----+-----+--------------------+
expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]
You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.
Another idea would be to use agg with the first and last aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)
Spark offers a head function, which makes getting the first element very easy. However, spark does not offer any last function. A straightforward approach would be to sort the dataframe backward and use the head function again.
first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.
size = df.count()
df.rdd.zipWithIndex()\
.filter(lambda x : x[1] == 0 or x[1] == size-1)\
.map(lambda x : x[0].support)\
.collect()
You can try indexing the data frame see below example:
df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]
EDIT:
You have to pass the column name as well.
df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']
Since version 3.0.0, spark also have DataFrame function called
.tail() to get the last value.
This will return List of Row objects:
last=df.tail(1)[0].support

Resources