Can't fathom how aggregate distinct count works? - apache-spark

I want to calculate the number of distinct rows according to one column.
I see that the following works :
long countDistinctAtt = Math.toIntExact(dataset.select(att).distinct().count());
But this doesn't :
long countDistinctAtt = dataset.agg(countDistinct(att)).agg(count("*")).collectAsList().get(0).getLong(0);
Why the second solution does not calculate the distinct rows number ?

The second command needs to have a grouping of rows with a groupBy method before any aggregation agg occurs. This particular command doesn't specify based on what rows the aggregation(s) will take place, so in that case of course it won't work.
The main problem with the second command, though, is that even with grouping the rows and aggregating their values based on a column, the results are going to be based per row (aka, with that kind of logic you tell the machine that you want to count the occurrences of a value for each (now grouped and aggregated) row) than based on the entire DataFrame/DataSet. This means that the result is going to be a column/list of values instead of just one value of the total count, because each element will correspond to each aggregated row. Getting the first (get(0)) of those values doesn't really make any sense here, because even if the command would run, you would only get a value count of just one row.
The first command bypasses the hassles by specifying that we only want the distinct values of the selected column, so you can count these values up and find the total number of them. This will result in just one value (which is long and you correctly cast it to int).
As a rule of thumb, 9 times out of 10 you should use groupBy/agg when you want to do row-based computations. In case you do not really care about rows and just want a total result for the whole DataFrame/DataSet, you can use the built-in SQL functions of Spark (you can find all of them here, and you can study their implementations for Java/Scala/Python on each of their documentations too) like in the first command.
To illustrate this, let's say we have a DataFrame (or DataSet, doesn't matter at this point) named dfTest with the following data:
+------+------+
|letter|number|
+------+------+
| a| 5|
| b| 8|
| c| 14|
| d| 20|
| e| 8|
| f| 8|
| g| 20|
+------+------+
If we use the basic built-in SQL functions to select the number column values, filter out the duplicates, and count the remaining rows, the command we correctly put out 4 because there are indeed 4 unique values in number:
// In Scala:
println(dfTest.select("number").distinct().count().toInt)
// In Java:
System.out.println(Math.toIntExact(dfTest.select("number").distinct().count()))
// Output:
4
In contrary, if we group the DataFrame rows together and count the values for each row on its own (no need to use agg here, since count takes a column's value as argument by default), this will result in the following DataFrame where the count will be calculated strictly for each distinct value of the number column:
// In Scala & Java:
dfTest.groupBy("number")
.count()
.show()
// Output:
+------+-----+
|number|count|
+------+-----+
| 20| 2|
| 5| 1|
| 8| 3|
| 14| 1|
+------+-----+

Related

Pyspark AND/ALSO Partition Column Query

How do you perform a AND/ALSO query in pyspark? I want both conditions to be met for results to be filtered.
Original dataframe:
df.count()
4105
The first condition does not find any records:
df.filter((df.created_date != 'add')).count()
4105
Therefore I would expect the AND clause here to return 4105, but instead it continues to filter on df.lastUpdatedDate:
df.filter((df.created_date != 'add') & (df.lastUpdatedDate != '2022-12-21')).count()
3861
To me 3861 is the result of an OR clause. How do I address this? lastUpdatedDate is a partition filter based on .explain() so maybe that has something to do with these results?
...PartitionFilters: [isnotnull(lastUpdatedDate#26), NOT
(lastUpdatedDate#26 = 2022-12-21)], PushedFilters:
[IsNotNull(created_date), Not(EqualTo(created_date,add))], ReadSchema ...
Going by our conversation in the comments - Your requirement is to filter out rows where (df.created_date != 'add') & (df.lastUpdatedDate != '2022-12-21')
Your confusion seems on the name of the method i.e. filter but rather consider it as where
Where or Filter by definition works the same as SQL where clause i.e. retains the rows where the expression returns true - and drops the rest.
i.e. Consider a dataframe
+---+-----+-------+
| id| Name|Subject|
+---+-----+-------+
| 1| Sam|History|
| 2| Amy|History|
| 3|Harry| Maths|
| 4| Jake|Physics|
+---+-----+-------+
The below filter would return a new Dataframe with only rows where the Subject is history, i.e. where the expression returned true (i.e. Filters out where it is false)
rightDF.filter(rightDF.col("Subject").equalTo("History")).show();
OR
rightDF.where(rightDF.col("Subject").equalTo("History")).show();
Output:
+---+----+-------+
| id|Name|Subject|
+---+----+-------+
| 1| Sam|History|
| 2| Amy|History|
+---+----+-------+
So in your case, you would want to negate the statements to get the results you desire.
i.e. Use equal to instead of not equal to
df.filter((df.created_date == 'add') & (df.lastUpdatedDate == '2022-12-21')).count()
This means keep rows where both statements are True and filter out where any of them is false
Let me know if this works or anything else

simplify multiple (30 columns) column complex pyspark aggregation in one go

I have a sample spark df as below:
df = ([[1, 'a', 'b' , 'c'],
[1, 'b', 'c' , 'b'],
[1, 'b', 'a' , 'b'],
[2, 'c', 'a' , 'a'],
[3, 'b', 'b' , 'a']]).toDF(['id', 'field1', 'field2', 'field3'])
What I need next is to provide a multiple aggregations to show summary of the a, b, c values for each field. I have a working but tedious process as below:
agg_table = (
df
.groupBy('id')
.agg(
# field1
sum(when(col('field1') == 'a',1).otherwise(0)).alias('field1_a_count')
,sum(when(col('field1') == 'b',1).otherwise(0)).alias('field1_b_count')
,sum(when(col('field1') == 'c',1).otherwise(0)).alias('field1_c_count')
# field2
,sum(when(col('field2') == 'a',1).otherwise(0)).alias('field2_a_count')
,sum(when(col('field2') == 'b',1).otherwise(0)).alias('field2_b_count')
,sum(when(col('field2') == 'c',1).otherwise(0)).alias('field2_c_count')
# field3
,sum(when(col('field3') == 'a',1).otherwise(0)).alias('field3_a_count')
,sum(when(col('field3') == 'b',1).otherwise(0)).alias('field3_b_count')
,sum(when(col('field3') == 'c',1).otherwise(0)).alias('field3_c_count')
))
What I am expecting to get is this:
agg_table = (['id':'1','2','3'],
['field1_a_count':1,0,0],
['field1_b_count':2,0,1],
['field1_c_count':0, 1, 0],
['field2_a_count':1,1,0],
['field2_b_count':1,0,1],
['field2_c_count':1,0,0],
['field3_a_count':0,1,1],
['field3_b_count':2,0,0],
['field3_c_count':1,0,0])
It is just fine if I only really have 3 fields, but I have 30 fields with varying/custom names. Maybe somebody can help me with the repetitive task of coding the aggregated sum per field. I tried playing around with a suggestion from :
https://danvatterott.com/blog/2018/09/06/python-aggregate-udfs-in-pyspark/
I can make it work if I will only pull one column and one value, but I get varying errors, one of them is:
AnalysisException: cannot resolve '`value`' given input columns: ['field1','field2','field3']
One last line I tried is using:
validated_cols = ['field1','field2','field3']
df.select(validated_cols).groupBy('id').agg(collect_list($'field1_a_count',$'field1_b_count',$'field1_c_count', ...
$'field30_c_count')).show()
Output: SyntaxError: invalid syntax
I tried with pivot too, but from searches so far, it says it is only good for one column. I tried this multiple columns:
df.withColumn("p", concat($"p1", $"p2"))
.groupBy("a", "b")
.pivot("p")
.agg(...)
I still get a syntax error.
Another link I tried: https://danvatterott.com/blog/2019/02/05/complex-aggregations-in-pyspark/
I also tried the exprs approach: exprs1 = {x: "sum" for x in df.columns if x != 'id'}
Any suggested will be appreciated. Thanks
Let me answer your question in two steps. First, you are wondering if it is possible to avoid hard coding all your aggregations in your attempt to compute all your aggregations. It is. I would do it like this:
from pyspark.sql import functions as f
# let's assume that this is known, but we could compute it as well
values = ['a', 'b', 'c']
# All the columns except the id
cols = [ c for c in df.columns if c != 'id' ]
def count_values(column, value):
return f.sum(f.when(f.col(column) == value, 1).otherwise(0))\
.alias(f"{column}_{value}_count")
# And this gives you the result of your hard coded aggregations:
df\
.groupBy('id')\
.agg(*[count_values(c, value) for c in cols for value in values])\
.show()
But that is not what you expect right? You are trying to compute some kind of pivot on the id column. To do this, I would not use the previous result, but just work the data differently. I would start by replacing all the columns of the dataframe but id (that is renamed into x) by an array of values of the form {column_name}_{value}_count, and I would explode that array. From there, we just need to compute a simple pivot on the former id column renamed x, grouped by the values contained in the exploded array.
df\
.select(f.col('id').alias('x'), f.explode(
f.array(
[f.concat_ws('_', f.lit(c), f.col(c), f.lit('count')).alias(c)
for c in cols]
)
).alias('id'))\
.groupBy('id')\
.pivot('x')\
.count()\
.na.fill(0)\
.orderBy('id')\
.show()
which yields:
+--------------+---+---+---+
| id| 1| 2| 3|
+--------------+---+---+---+
|field1_a_count| 1| 0| 0|
|field1_b_count| 2| 0| 1|
|field1_c_count| 0| 1| 0|
|field2_a_count| 1| 1| 0|
|field2_b_count| 1| 0| 1|
|field2_c_count| 1| 0| 0|
|field3_a_count| 0| 1| 1|
|field3_b_count| 2| 0| 0|
|field3_c_count| 1| 0| 0|
+--------------+---+---+---+
update
based on discussion in the comments, I think this question is a case of an X-Y problem. The task at hand is something that is seen very frequently in the world of Data Engineering and ETL development: how to partition and then quantify good and bad records.
In the case where the data is being prepared to load to a data warehouse / hadoop ecosystem, the usual pattern is to take the raw input and load it to a dataframe, then apply transformations & validations that partition the data into "The Good, The Bad, and The Ugly":
The first— and hopefully largest— partition contains records that are successfully transformed and which pass validation. These will go on to be persisted in durable storage and certified to be used for anayltics.
The second partition contains records that were successfully transformed but which failed during QA. The QA rules should include checks for illegal nulls, string pattern matching (like phone number format), etc...
The third partition is for records that are rejected early in the process because they failed on a transformation step. Examples include fields that contain non-number values that are cast to numeric types, text fields that exceed the maximum length, or strings that contain control characters that are not supported by the database.
The goal should not be to generate counts for each of these 3 classifications across every column and for every row. Trying to do that is counterproductive. Why? Because when a transformation step or QA check fails for a given record, that entire record should be rejected immediately and sent to a separate output stream to be analyzed later. Each row in the data set should be treated as just that: a single record. It isn't possible for a single field to fail and still have the complete record pass, which makes metrics at this granularity unnecessary. What action will you take knowing that 100 rows passed on the "address" field? For valid records, all that matters is the total number that passed for every column. Otherwise, it wouldn't be a valid record.
With that said, remember that the goal is to build a usable and cleansed data set; analyzing the rejected records is a secondary task and can be done offline.
It is common practice to add a field to the rejected data to indicated which column caused the failure. That makes it easy to troubleshoot any malformed data, so there is really no need to generate counts across all columns, even for bad records. Instead, just review the rejected data after the main job finishes, and address the problems. Continue doing that iteratively until the number of rejected records is below whatever threshold you think is reasonable, and then continue to monitor it going forward.
Old answer
This is a sign of a design flaw in the data. Whatever the "field1", "field2", etc... columns actually represent, it appears they are all related, in the sense that the values quantify some attribute (maybe each one is a count for a specific merchandise ID, or the number of people with a certain property...). The problem is that these fields are being added as individual columns on a fact table1, which then needs to be aggregated, resulting in the situation that you're facing.
A better design would be to collapse those "field1", "field2", etc... columns into a single code field that can be used as the GROUP BY field when doing the aggregation. You might want to consider creating a separate table to do this if the existing one has many other columns and making this change would alter the grain in a way that might cause other problems.
1: it's usually a big red flag to have a table with a bunch of enumerated columns with the same name and purpose. I've even seen cases where someone has created tables with "spare" columns for when they want to add more attributes later. Not good.

PySpark - Timestamp behavior

I'm trying to understand behaviour differences between pyspark.sql.currenttimestamp() and datetime.now()
If I create a Spark dataframe in DataBricks using these 2 mechanisms to create a timestamp column, everything works nicely as expected....
curDate2 = spark.range(10)\
.withColumn("current_date_lit",F.lit(date.today()))\
.withColumn("current_timestamp_lit",F.lit(F.current_timestamp()))\
.withColumn("current_timestamp",F.current_timestamp())\
.withColumn("now",F.lit(datetime.now()))
+---+----------------+---------------------+--------------------+--------------------+
| id|current_date_lit|current_timestamp_lit| current_timestamp| now|
+---+----------------+---------------------+--------------------+--------------------+
| 0| 2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
| 1| 2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
| 2| 2022-02-12| 2022-02-12 16:40:...|2022-02-12 16:40:...|2022-02-12 16:40:...|
+---+----------------+---------------------+--------------------+--------------------+
However, when I then call show() on the dataframe a couple of minutes later the columns based on currenttimestamp() show me the time NOW (16:44) whilst the datetime.now() column shows me the timestamp from the first creation of the dataframe (16:40)
Clearly one column holds a literal value & the other enumerates the function at runtime but I'm at a loss to understand why they behave differently
show() a few mins later...
+---+----------------+---------------------+--------------------+--------------------+
| id|current_date_lit|current_timestamp_lit| current_timestamp| now|
+---+----------------+---------------------+--------------------+--------------------+
| 0| 2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
| 1| 2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
| 2| 2022-02-12| 2022-02-12 16:44:...|2022-02-12 16:44:...|2022-02-12 16:40:...|
+---+----------------+---------------------+--------------------+--------------------+
Thanks - I hope this makes sense!
Good question that I tried out with rand() function just to check. It is sort of intuitive, but at the same time an Action without some prior .cache() applied to some data, would lead one to believe, a new round --> a new set of results.
show() is an Action with some smarts. Here is is based on the same underlying rdd and logically one would expect a deterministic outcome - at least I think so.
However, F.current_timestamp() is evaluated once at serialization time. So, two successive show()'s will have 2 different times. The other answer states that and points to the docs. So that is an exception and thus tried with rand(). See below.
Datetime.now() is held constant by Spark - see WholeStageCodeGen - just how it works as it concerns the same underlying DF; it assumes the first lit function still applies because the preceding creation of the DF (underlying RDD) still exists. I did a check with rand() and all successive show() Actions return the same sequence of random numbers - the same seed is used. This emulates deterministic behaviour which is what we would want with 2 successive show() 's
With a new DF with same name, then that is also re-evaluated, obviously.
You can try and see what happens if you use .cache().
It is a contrived example range(10) of course.
current_timestamp() returns a TimestampType column, the value of which is evaluated at query time as described in the docs. So that is 'computed' each time your call show.
Returns the current timestamp at the start of query evaluation as a
TimestampType column. All calls of current_timestamp within the same
query return the same value.
Passing this column to a lit call doesn't change anything, if you check the source code you can see lit simply returns the column you called it with.
return col if isinstance(col, Column) else _invoke_function("lit",
col)
If you cal lit with something else than a column, e.g. a datetime object then a new column is created with this literal value. The literal being the datetime object returned from datetime.now(). This is a static value representing the time the datetime.now function was called.

GroupBy dataframe column without aggregation and set not null values

I have a dataframe having records like below:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |null|
|101|null|303 |
+---+----+----+
Is their a simple way to groupBy and get result like below in Spark SQL:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |303 |
+---+----+----+
Thanks.
Use max or min to aggregate the data. Since you only have a single valid value, this is the one that will be selected. Note that it's not possible to use first here (which is faster) since that can still return null values.
When the columns are of numeric types it can be solved as follows:
df.groupBy("id").agg(max($"L1").as("L1"), max($"L2").as("L2"))
However, if you are dealing with strings, you need to collect all values as a list (or set) and then use coalesce:
df.groupBy("id")
.agg(coalesce(collect_list($"L1")).as("L1"), coalesce(collect_list($"L2")).as("L2"))
Of course, this assumes that the nulls are not strings but actual nulls.

how to get first value and last value from dataframe column in pyspark?

I Have Dataframe,I want get first value and last value from DataFrame column.
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| G| 2| 0.09523809523809523|
| K| 2| 0.09523809523809523|
| D| 1|0.047619047619047616|
+----+-----+--------------------+
expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]
You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.
Another idea would be to use agg with the first and last aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)
Spark offers a head function, which makes getting the first element very easy. However, spark does not offer any last function. A straightforward approach would be to sort the dataframe backward and use the head function again.
first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.
size = df.count()
df.rdd.zipWithIndex()\
.filter(lambda x : x[1] == 0 or x[1] == size-1)\
.map(lambda x : x[0].support)\
.collect()
You can try indexing the data frame see below example:
df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]
EDIT:
You have to pass the column name as well.
df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']
Since version 3.0.0, spark also have DataFrame function called
.tail() to get the last value.
This will return List of Row objects:
last=df.tail(1)[0].support

Resources