Can someone help me with pyspark using both aggregate and groupby functions? I have made my data frames, and applied filters and selects to get the data I want. However, I am now stuck trying to aggregate things correctly.
Currently, my code outputs the content below:
+----------+-----------+--------------+---------------+----------+---------+
|l_orderkey|o_orderdate|o_shippriority|l_extendedprice|l_discount| rev|
+----------+-----------+--------------+---------------+----------+---------+
| 53634| 1995-02-22| 0| 20517.44| 0.08|18876.045|
| 265539| 1995-01-25| 0| 70423.08| 0.01| 69718.85|
| 331590| 1994-12-10| 0| 46692.75| 0.03| 45291.97|
| 331590| 1994-12-10| 0| 37235.1| 0.1| 33511.59|
| 420545| 1995-03-05| 0| 75542.1| 0.04|72520.414|
| 420545| 1995-03-05| 0| 1062.0| 0.07|987.66003|
| 420545| 1995-03-05| 0| 9729.45| 0.1| 8756.505|
| 420545| 1995-03-05| 0| 15655.6| 0.04|15029.375|
| 420545| 1995-03-05| 0| 3121.3| 0.03|3027.6611|
| 420545| 1995-03-05| 0| 71723.0| 0.03| 69571.31|
| 488928| 1995-02-15| 0| 1692.77| 0.01|1675.8423|
| 488928| 1995-02-15| 0| 22017.84| 0.01|21797.662|
| 488928| 1995-02-15| 0| 57100.42| 0.04|54816.402|
| 488928| 1995-02-15| 0| 3807.76| 0.05| 3617.372|
| 488928| 1995-02-15| 0| 73332.52| 0.01|72599.195|
| 510754| 1994-12-21| 0| 41171.78| 0.09| 37466.32|
| 512422| 1994-12-26| 0| 87251.56| 0.07| 81143.95|
| 677761| 1994-12-26| 0| 60123.34| 0.0| 60123.34|
| 956646| 1995-03-07| 0| 61853.68| 0.05|58760.996|
| 1218886| 1995-02-13| 0| 24844.0| 0.01| 24595.56|
+----------+-----------+--------------+---------------+----------+---------+
I wish to apply a group by: l_orderkey and aggregate the Rev as a sum.
Here is my most recent attempt with 't' being the dataframe and F being functions from pyspark.sql "from pyspark.sql import functions as F"
(t .groupby(t.l_orderkey,t.o_orderdate, t.o_shippriority)
.agg(F.collect_set(sum(t.rev)), F.collect_set(t.l_orderkey)) .show())
Can someone help me know if I'm on the right track? I keep getting "Column is not iterable"
total_rev = t.groupby(t.l_orderkey).agg(F.sum(t.rev).alias('total_rev'))
# print /show the top results
total_rev.show()
would give you a new df with columns l_orderkey, total_rev where total_rev would store the aggregated sum of rev
You use collect_set when attempting to remove duplicates.
You are also getting Column is not iterable because you are using the built in python function sum and not the spark function F.sum
Related
I have a data frame containing daily events related to various entities in time.
I want to fill the gaps in those times series.
Here is the aggregate data I have (left), and on the right side, the data I want to have:
+---------+----------+-------+ +---------+----------+-------+
|entity_id| date|counter| |entity_id| date|counter|
+---------+----------+-------+ +---------+----------+-------+
| 3|2020-01-01| 7| | 3|2020-01-01| 7|
| 1|2020-01-01| 10| | 1|2020-01-01| 10|
| 2|2020-01-01| 3| | 2|2020-01-01| 3|
| 2|2020-01-02| 9| | 2|2020-01-02| 9|
| 1|2020-01-03| 15| | 1|2020-01-02| 0|
| 2|2020-01-04| 3| | 3|2020-01-02| 0|
| 1|2020-01-04| 14| | 1|2020-01-03| 15|
| 2|2020-01-05| 6| | 2|2020-01-03| 0|
+---------+----------+-------+ | 3|2020-01-03| 0|
| 3|2020-01-04| 0|
| 2|2020-01-04| 3|
| 1|2020-01-04| 14|
| 2|2020-01-05| 6|
| 1|2020-01-05| 0|
| 3|2020-01-05| 0|
+---------+----------+-------+
I have used this stack overflow topic, which was very useful:
Filling gaps in timeseries Spark
Here is my code (filter for only one entity), it is in Python but I think the API is the same in Scala:
(
df
.withColumn("date", sf.to_date("created_at"))
.groupBy(
sf.col("entity_id"),
sf.col("date")
)
.agg(sf.count(sf.lit(1)).alias("counter"))
.filter(sf.col("entity_id") == 1)
.select(
sf.col("date"),
sf.col("counter")
)
.join(
spark
.range(
df # range start
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.min("created_at")).alias("min"))
.first().min // a * a, # a = 60 * 60 * 24 = seconds in one day
(df # range end
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.max("created_at")).alias("max"))
.first().max // a + 1) * a,
a # range step, a = 60 * 60 * 24 = seconds in one day
)
.select(sf.to_date(sf.from_unixtime("id")).alias("date")),
["date"], # column which will be used for the join
how="right" # type of join
)
.withColumn("counter", sf.when(sf.isnull("counter"), 0).otherwise(sf.col("counter")))
.sort(sf.col("date"))
.show(200)
)
This work very well, but now I want to avoid the filter and do a range to fill the time series gaps for every entity (entity_id == 2, entity_id == 3, ...). For your information, depending on the entity_id value, the minimum and the maximum of the column date can be different, nevertheless if your help involves the global minimum and maximum of the whole data frame, it is ok for me as well.
If you need any other information, feel free to ask.
edit: add data example I want to have
When creating the elements of the date range, I would rather use the Pandas function than the Spark range, as the Spark range function has some shortcomings when dealing with date values. The amount of different dates is usually small. Even when dealing with a time span of multiple years, the number of different dates is so small that it can be easily broadcasted in a join.
#get the minimun and maximun date and collect it to the driver
min_date, max_date = df.select(F.min("date"), F.max("date")).first()
#use Pandas to create all dates and switch back to PySpark DataFrame
from pandas import pandas as pd
timerange = pd.date_range(start=min_date, end=max_date, freq='1d')
all_dates = spark.createDataFrame(timerange.to_frame(),['date'])
#get all combinations of dates and entity_ids
all_dates_and_ids = all_dates.crossJoin(df.select("entity_id").distinct())
#create the final result by doing a left join and filling null values with 0
result = all_dates_and_ids.join(df, on=['date', 'entity_id'], how="left_outer")\
.fillna({'counter':'0'}) \
.orderBy(['date', 'entity_id'])
This gives
+-------------------+---------+-------+
| date|entity_id|counter|
+-------------------+---------+-------+
|2020-01-01 00:00:00| 1| 10|
|2020-01-01 00:00:00| 2| 3|
|2020-01-01 00:00:00| 3| 7|
|2020-01-02 00:00:00| 1| 0|
|2020-01-02 00:00:00| 2| 9|
|2020-01-02 00:00:00| 3| 0|
|2020-01-03 00:00:00| 1| 15|
|2020-01-03 00:00:00| 2| 0|
|2020-01-03 00:00:00| 3| 0|
|2020-01-04 00:00:00| 1| 14|
|2020-01-04 00:00:00| 2| 3|
|2020-01-04 00:00:00| 3| 0|
|2020-01-05 00:00:00| 1| 0|
|2020-01-05 00:00:00| 2| 6|
|2020-01-05 00:00:00| 3| 0|
+-------------------+---------+-------+
A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?
A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.
I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.
However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.
Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?
(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)
As far as I know, there is no such out of the box tool in Spark. In case of skewed data, what's very common is to add an artificial column to further bucketize the data.
Let's say you want to partition by column "y", but the data is very skewed like in this toy example (1 partition with 5 rows, the others with only one row):
val df = spark.range(8).withColumn("y", when('id < 5, 0).otherwise('id))
df.show()
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 5|
| 6| 6|
| 7| 7|
+-------+
Now let's add an artificial random column and write the dataframe.
val maxNbOfBuckets = 3
val part_df = df.withColumn("r", floor(rand() * nbOfBuckets))
part_df.show
+---+---+---+
| id| y| r|
+---+---+---+
| 0| 0| 2|
| 1| 0| 2|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 1|
| 5| 5| 2|
| 6| 6| 2|
| 7| 7| 1|
+---+---+---+
// and writing. We divided the partition with 5 elements into 3 partitions.
part_df.write.partitionBy("y", "r").csv("...")
I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?
I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+
Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+