aggregate function in lit() of pyspark along with withColumn

aggregate function in lit() of pyspark along with withColumn - apache-spark

I have column quantity in dataframe. I want to add a new column to this dataframe with each record having min("Quantity"). I am trying to use lit() in pyspark. something like below
df.withColumn("min_quant", lit(min(col("Quantity")))).show().
It's resulting in the getting below error
grouping expressions sequence is empty, and `InvoiceNo` is not an aggregate function.
Wrap (min(`Quantity`) AS `min_quant`) in windowing function(s) or wrap
This is working:
df.withColumn("min_quant", lit(2)).show().
But, in place of 2 here, I want min(Quantity). Am I missing something?

Please try using window function as min() function needs aggregation.
val windowSpec = Window.orderBy("InvoiceNo")
df.withColumn("min_quant", min("Quantity") over(windowSpec)).show()
Sample Result:
+---------+----+--------+---------+
|InvoiceNo|name|Quantity|min_quant|
+---------+----+--------+---------+
| 1| ABC| 19| 1|
| 1| ABC| 1| 1|
| 1| ABC| 8| 1|
| 1| ABC| 389| 1|
| 1| ABC| 196| 1|
| 2| CBD| 10| 1|
| 2| CBD| 946| 1|
| 3| XYZ| 3| 1|
+---------+----+--------+---------+

Related

regarding the usage of F.count(F.col("some column").isNotNull()) in window function

I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.

If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.
What you could do is to sum the isNotNull Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Another way is to do a conditional count using when:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+

how does sortWithinPartitions sort?

After applying sortWithinPartitions to a df and writing the output to a table I'm getting a result I'm not sure how to interpret.
df
.select($"type", $"id", $"time")
.sortWithinPartitions($"type", $"id", $"time")
result file looks somewhat like
1 a 5
2 b 1
1 a 6
2 b 2
1 a 7
2 b 3
1 a 8
2 b 4
It's not actually random, but neither is it sorted like I would expect it to be. Namely, first by type, then id, then time.
If I try to use a repartition before sorting, then I get the result I want. But for some reason the files weight 5 times more(100gb vs 20gb).
I'm writing to a hive orc table with compresssion set to snappy.
Does anyone know why it's sorted like this and why a repartition gets the right order, but a larger size?
Using spark 2.2.

The documentation of sortWithinPartition states
Returns a new Dataset with each partition sorted by the given expressions
The easiest way to think of this function is to imagine a fourth column (the partition id) that is used as primary sorting criterion. The function spark_partition_id() prints the partition.
For example if you have just one large partition (something that you as a Spark user would never do!), sortWithinPartition works as a normal sort:
df.repartition(1)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 1| a| 5| 0|
| 1| a| 6| 0|
| 1| a| 7| 0|
| 1| a| 8| 0|
| 2| b| 1| 0|
| 2| b| 2| 0|
| 2| b| 3| 0|
| 2| b| 4| 0|
+----+---+----+---------+
If there are more partitions, the results are only sorted within each partition:
df.repartition(4)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 2| b| 1| 0|
| 2| b| 3| 0|
| 1| a| 5| 1|
| 1| a| 6| 1|
| 1| a| 8| 2|
| 2| b| 2| 2|
| 1| a| 7| 3|
| 2| b| 4| 3|
+----+---+----+---------+
Why would one use sortWithPartition instead of sort? sortWithPartition does not trigger a shuffle, as the data is only moved within the executors. sort however will trigger a shuffle. Therefore sortWithPartition executes faster. If the data is partitioned by a meaningful column, sorting within each partition might be enough.

Skewed By in Spark

I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.
However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.
Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?
(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)

As far as I know, there is no such out of the box tool in Spark. In case of skewed data, what's very common is to add an artificial column to further bucketize the data.
Let's say you want to partition by column "y", but the data is very skewed like in this toy example (1 partition with 5 rows, the others with only one row):
val df = spark.range(8).withColumn("y", when('id < 5, 0).otherwise('id))
df.show()
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 5|
| 6| 6|
| 7| 7|
+-------+
Now let's add an artificial random column and write the dataframe.
val maxNbOfBuckets = 3
val part_df = df.withColumn("r", floor(rand() * nbOfBuckets))
part_df.show
+---+---+---+
| id| y| r|
+---+---+---+
| 0| 0| 2|
| 1| 0| 2|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 1|
| 5| 5| 2|
| 6| 6| 2|
| 7| 7| 1|
+---+---+---+
// and writing. We divided the partition with 5 elements into 3 partitions.
part_df.write.partitionBy("y", "r").csv("...")

Divide aggregate value using values from data frame in PySpark

I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+

Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+

sort the most frequent combinations of two columns in descending order

I have dataframe that looks like this
+---+---+---
| A| B| C|
+---+---+---
| 1| 3| 1|
| 2| 1| 1|
| 2| 3| 1|
| 1| 2| 1|
| 3| 1| 1|
| 1| 2| 1|
| 2| 1| 1|
| 1| 3| 1|
| 1| 2| 1|
+---+---+---
I want to reduce the data to only the most frequent combinations of two columns (A and B) sorted in descending order
The output should look like
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+
I wrote this code but it does not sort
import pandas as pd
import numpy as np
data=pd.read_csv("file.csv",sep=',')
gps = data[['A','B','C']]
gps1=gps.groupby(['A','C'])
gps1=gps1.count()
gps1.columns=['count']
gps1.sort_values(['count'],ascending=False)
print(gps1)

use nlargest
gps.groupby(['A', 'B']).size().nlargest(2)
A B
1 2 3
3 2
dtype: int64
or
gps.groupby(['A', 'B']).size().nlargest(2).reset_index(name='count')

You need to assign the result of sort_values() back into gps1 or use `inplace=True:
gps1.sort_values(['count'],ascending=False, inplace=True)
or
gps1 = gps1.sort_values(['count'],ascending=False)
As stated in the documentation of sort_values, inplace is by default set to False

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

aggregate function in lit() of pyspark along with withColumn - apache-spark

Related

regarding the usage of F.count(F.col("some column").isNotNull()) in window function

how does sortWithinPartitions sort?

Skewed By in Spark

Divide aggregate value using values from data frame in PySpark

sort the most frequent combinations of two columns in descending order

Categories

Resources