Sample random n rows from each group - apache-spark

I would like to sample at most n rows from each group in the data, where the grouping is defined by a single column. There are many answers for selecting the top n rows, but I dont't need order and am not sure whether ordering would not introduce unnecessary shuffling.
I have looked at
sampleBy(), but I don't need a fraction but a maximal absolute amount of rows.
Windows, but they always seem to imply ordering the values
groupBy, but was not able to construct something of the available aggregate functions.
Code example:
data = [('A',1), ('B',1), ('C',2)]
columns = ["field_1","field_2"]
df = spark.createDataFrame(data=data, schema = columns)
Where I would be looking for a pandas-like
df.groupby('field_2').head(1)
I would also be happy with a suitable SQL expression.
Otherwise if there is no better performance than using
Window.partitionBy(df['field_2']).orderBy('field_1')...
then I'd also be happy to know that.
Thanks!

The below would work if a sort isn't required, and it uses RDD transformations.
For a dataframe like the following
sdf.show()
# +-----------+-------+--------+----+
# |bvdidnumber|dt_year|dt_rfrnc|goal|
# +-----------+-------+--------+----+
# | 1| 2020| 202006| 0|
# | 1| 2020| 202012| 1|
# | 1| 2020| 202012| 0|
# | 1| 2021| 202103| 0|
# | 1| 2021| 202106| 0|
# | 1| 2021| 202112| 1|
# | 2| 2020| 202006| 0|
# | 2| 2020| 202012| 0|
# | 2| 2020| 202012| 1|
# | 2| 2021| 202103| 0|
# | 2| 2021| 202106| 0|
# | 2| 2021| 202112| 1|
# +-----------+-------+--------+----+
I created a function that can be shipped to all executors, and then used with flatMapValues() in RDD transformation.
# best to ship this function to all executors for optimum performance
def get_n_from_group(group, num_recs):
"""
get `N` number of sample records
"""
res = []
i = 0
for rec in group:
res.append(rec)
i = i + 1
if i == num_recs:
break
return res
rdd = sdf.rdd. \
groupBy(lambda x: x.bvdidnumber). \
flatMapValues(lambda k: get_n_from_group(k, 2)) # 2 records only
top_n_sdf = spark.createDataFrame(rdd.values(), schema=sdf.schema)
top_n_sdf.show()
# +-----------+-------+--------+----+
# |bvdidnumber|dt_year|dt_rfrnc|goal|
# +-----------+-------+--------+----+
# | 1| 2020| 202006| 0|
# | 1| 2020| 202012| 1|
# | 2| 2020| 202006| 0|
# | 2| 2020| 202012| 0|
# +-----------+-------+--------+----+

Related

How to groupby by consective 1s in column in pyspark and keep groups with specific size

I want to split my pyspark dataframe in groups with monotonically increasing trend and keep the groups with size greater than 10.
here i tried some part of code,
from pyspark.sql import functions as F, Window
df = df1.withColumn(
"FLAG_INCREASE",
F.when(
F.col("x")
> F.lag("x").over(Window.partitionBy("x1").orderBy("time")),
1,
).otherwise(0),
)
i don't know how to do groupby by consective one's in pyspark... if anyone have better solution for this
same thing in pandas we can do like this :
df=df1.groupby((df1['x'].diff() < 0).cumsum())
how to convert this code to pyspark ?
example dataframe:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 4
9 3
10 2
11 1
12 2
13 3
14 4
15 5
16 5
17 6
expected output
group1:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
group2:
x
0 1
1 2
2 3
3 4
4 5
5 5
6 6
I'll map out all the steps (and keep all columns in the output) to replicate (df1['x'].diff() < 0).cumsum() which is easy to calculate using a lag.
However, it is important that your data has an ID column that has the order of the dataframe because unlike pandas, spark does not retain dataframe's sorting (due to its distributed nature). For this example, I've assumed that your data has an ID column named idx, which is the index you printed in your example input.
# input data
data_sdf.show(5)
# +---+---+
# |idx|val|
# +---+---+
# | 0| 1|
# | 1| 2|
# | 2| 2|
# | 3| 2|
# | 4| 3|
# +---+---+
# only showing top 5 rows
# calculating the group column
data_sdf. \
withColumn('diff_with_prevval',
func.col('val') - func.lag('val').over(wd.orderBy('idx'))
). \
withColumn('diff_lt_0',
func.coalesce((func.col('diff_with_prevval') < 0).cast('int'), func.lit(0))
). \
withColumn('diff_lt_0_cumsum',
func.sum('diff_lt_0').over(wd.orderBy('idx').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+---+-----------------+---------+----------------+
# |idx|val|diff_with_prevval|diff_lt_0|diff_lt_0_cumsum|
# +---+---+-----------------+---------+----------------+
# | 0| 1| null| 0| 0|
# | 1| 2| 1| 0| 0|
# | 2| 2| 0| 0| 0|
# | 3| 2| 0| 0| 0|
# | 4| 3| 1| 0| 0|
# | 5| 3| 0| 0| 0|
# | 6| 4| 1| 0| 0|
# | 7| 5| 1| 0| 0|
# | 8| 4| -1| 1| 1|
# | 9| 3| -1| 1| 2|
# | 10| 2| -1| 1| 3|
# | 11| 1| -1| 1| 4|
# | 12| 2| 1| 0| 4|
# | 13| 3| 1| 0| 4|
# | 14| 4| 1| 0| 4|
# | 15| 5| 1| 0| 4|
# | 16| 5| 0| 0| 4|
# | 17| 6| 1| 0| 4|
# +---+---+-----------------+---------+----------------+
You can now use the diff_lt_0_cumsum column in your groupBy() for further use.

Splitting rows of a dataset depending on a column values

I am using Spark 3.1.1 along with JAVA 8, i am trying to split a dataset<Row> according to values of one of its numerical columns (greater or lesser than a threshold), the split is possible only if some string column values of the rows are identical : i am trying something like this :
Iterator<Row> iter2 = partition.toLocalIterator();
while (iter2.hasNext()) {
Row item = iter2.next();
//getColVal is a function that gets the value given a column
String numValue = getColVal(item, dim);
if (Integer.parseInt(numValue) < threshold)
pl.add(item);
else
pr.add(item);
But how to check, beforehand splitting, if some other column values (string) of the concerned rows are identical in order to perform the split ?
PS : i tried to groupBy the columns before splitting like so :
Dataset<Row> newDataset=oldDataset.groupBy("col1","col4").agg(col("col1"));
but it's not working
Thank you for the help
EDIT :
A sample dataset which i want to split is :
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
If the threshold is 30 then the two first and last rows will form two datasets because the first and fourth columns of these are identical; otherwise the split is not possible.
EDIT : the resulting outpout would be
abc,9,40,A
abc,7,50,A
cde,4,20,B
cde,3,25,B
I mainly use pyspark but you could adapt to your environment
## could add some conditional logic or just always output 2 data frames where
## one would be empty
print("pdf - two dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
## filter
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# +----+----+-----+
print("pdf - one dataframe")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[11,29,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 11| A|
# | abc| 7| 29| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
pl = sdf.filter('col3 <= 30')\
.groupBy("col1","col4").agg( F.sum('col2').alias('sumC2') )
pr = sdf.filter('col3 > 30')\
.groupBy("col1","col4").agg(F.sum('col2').alias('sumC2'))
print("pl")
pl.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# | abc| A| 16|
# | cde| B| 7|
# +----+----+-----+
print("pr")
pr.show()
# +----+----+-----+
# |col1|col4|sumC2|
# +----+----+-----+
# +----+----+-----+
Filtering by a dynamic mean
print("pdf - filter by mean")
## create pandas dataframe
pdf = pd.DataFrame({'col1':['abc','abc','cde','cde'],'col2':[9,7,4,3],'col3':[40,50,20,25],'col4':['A','A','B','B']})
print( pdf )
## move it to spark
print("sdf")
sdf = spark.createDataFrame(pdf)
sdf.show()
# +----+----+----+----+
# |col1|col2|col3|col4|
# +----+----+----+----+
# | abc| 9| 40| A|
# | abc| 7| 50| A|
# | cde| 4| 20| B|
# | cde| 3| 25| B|
# +----+----+----+----+
w = Window.partitionBy("col1").orderBy("col2")
## add another column, the mean of col2 partitioned by col1
sdf = sdf.withColumn('mean_c2', F.mean('col2').over(w))
## filter by the dynamic mean
pr = sdf.filter('col2 > mean_c2')
pr.show()
# +----+----+----+----+-------+
# |col1|col2|col3|col4|mean_c2|
# +----+----+----+----+-------+
# | cde| 4| 20| B| 3.5|
# | abc| 9| 40| A| 8.0|
# +----+----+----+----+-------+

PySpark using both aggregate and group by

Can someone help me with pyspark using both aggregate and groupby functions? I have made my data frames, and applied filters and selects to get the data I want. However, I am now stuck trying to aggregate things correctly.
Currently, my code outputs the content below:
+----------+-----------+--------------+---------------+----------+---------+
|l_orderkey|o_orderdate|o_shippriority|l_extendedprice|l_discount| rev|
+----------+-----------+--------------+---------------+----------+---------+
| 53634| 1995-02-22| 0| 20517.44| 0.08|18876.045|
| 265539| 1995-01-25| 0| 70423.08| 0.01| 69718.85|
| 331590| 1994-12-10| 0| 46692.75| 0.03| 45291.97|
| 331590| 1994-12-10| 0| 37235.1| 0.1| 33511.59|
| 420545| 1995-03-05| 0| 75542.1| 0.04|72520.414|
| 420545| 1995-03-05| 0| 1062.0| 0.07|987.66003|
| 420545| 1995-03-05| 0| 9729.45| 0.1| 8756.505|
| 420545| 1995-03-05| 0| 15655.6| 0.04|15029.375|
| 420545| 1995-03-05| 0| 3121.3| 0.03|3027.6611|
| 420545| 1995-03-05| 0| 71723.0| 0.03| 69571.31|
| 488928| 1995-02-15| 0| 1692.77| 0.01|1675.8423|
| 488928| 1995-02-15| 0| 22017.84| 0.01|21797.662|
| 488928| 1995-02-15| 0| 57100.42| 0.04|54816.402|
| 488928| 1995-02-15| 0| 3807.76| 0.05| 3617.372|
| 488928| 1995-02-15| 0| 73332.52| 0.01|72599.195|
| 510754| 1994-12-21| 0| 41171.78| 0.09| 37466.32|
| 512422| 1994-12-26| 0| 87251.56| 0.07| 81143.95|
| 677761| 1994-12-26| 0| 60123.34| 0.0| 60123.34|
| 956646| 1995-03-07| 0| 61853.68| 0.05|58760.996|
| 1218886| 1995-02-13| 0| 24844.0| 0.01| 24595.56|
+----------+-----------+--------------+---------------+----------+---------+
I wish to apply a group by: l_orderkey and aggregate the Rev as a sum.
Here is my most recent attempt with 't' being the dataframe and F being functions from pyspark.sql "from pyspark.sql import functions as F"
(t .groupby(t.l_orderkey,t.o_orderdate, t.o_shippriority)
.agg(F.collect_set(sum(t.rev)), F.collect_set(t.l_orderkey)) .show())
Can someone help me know if I'm on the right track? I keep getting "Column is not iterable"
total_rev = t.groupby(t.l_orderkey).agg(F.sum(t.rev).alias('total_rev'))
# print /show the top results
total_rev.show()
would give you a new df with columns l_orderkey, total_rev where total_rev would store the aggregated sum of rev
You use collect_set when attempting to remove duplicates.
You are also getting Column is not iterable because you are using the built in python function sum and not the spark function F.sum

Filling gaps in time series Spark for different entities

I have a data frame containing daily events related to various entities in time.
I want to fill the gaps in those times series.
Here is the aggregate data I have (left), and on the right side, the data I want to have:
+---------+----------+-------+ +---------+----------+-------+
|entity_id| date|counter| |entity_id| date|counter|
+---------+----------+-------+ +---------+----------+-------+
| 3|2020-01-01| 7| | 3|2020-01-01| 7|
| 1|2020-01-01| 10| | 1|2020-01-01| 10|
| 2|2020-01-01| 3| | 2|2020-01-01| 3|
| 2|2020-01-02| 9| | 2|2020-01-02| 9|
| 1|2020-01-03| 15| | 1|2020-01-02| 0|
| 2|2020-01-04| 3| | 3|2020-01-02| 0|
| 1|2020-01-04| 14| | 1|2020-01-03| 15|
| 2|2020-01-05| 6| | 2|2020-01-03| 0|
+---------+----------+-------+ | 3|2020-01-03| 0|
| 3|2020-01-04| 0|
| 2|2020-01-04| 3|
| 1|2020-01-04| 14|
| 2|2020-01-05| 6|
| 1|2020-01-05| 0|
| 3|2020-01-05| 0|
+---------+----------+-------+
I have used this stack overflow topic, which was very useful:
Filling gaps in timeseries Spark
Here is my code (filter for only one entity), it is in Python but I think the API is the same in Scala:
(
df
.withColumn("date", sf.to_date("created_at"))
.groupBy(
sf.col("entity_id"),
sf.col("date")
)
.agg(sf.count(sf.lit(1)).alias("counter"))
.filter(sf.col("entity_id") == 1)
.select(
sf.col("date"),
sf.col("counter")
)
.join(
spark
.range(
df # range start
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.min("created_at")).alias("min"))
.first().min // a * a, # a = 60 * 60 * 24 = seconds in one day
(df # range end
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.max("created_at")).alias("max"))
.first().max // a + 1) * a,
a # range step, a = 60 * 60 * 24 = seconds in one day
)
.select(sf.to_date(sf.from_unixtime("id")).alias("date")),
["date"], # column which will be used for the join
how="right" # type of join
)
.withColumn("counter", sf.when(sf.isnull("counter"), 0).otherwise(sf.col("counter")))
.sort(sf.col("date"))
.show(200)
)
This work very well, but now I want to avoid the filter and do a range to fill the time series gaps for every entity (entity_id == 2, entity_id == 3, ...). For your information, depending on the entity_id value, the minimum and the maximum of the column date can be different, nevertheless if your help involves the global minimum and maximum of the whole data frame, it is ok for me as well.
If you need any other information, feel free to ask.
edit: add data example I want to have
When creating the elements of the date range, I would rather use the Pandas function than the Spark range, as the Spark range function has some shortcomings when dealing with date values. The amount of different dates is usually small. Even when dealing with a time span of multiple years, the number of different dates is so small that it can be easily broadcasted in a join.
#get the minimun and maximun date and collect it to the driver
min_date, max_date = df.select(F.min("date"), F.max("date")).first()
#use Pandas to create all dates and switch back to PySpark DataFrame
from pandas import pandas as pd
timerange = pd.date_range(start=min_date, end=max_date, freq='1d')
all_dates = spark.createDataFrame(timerange.to_frame(),['date'])
#get all combinations of dates and entity_ids
all_dates_and_ids = all_dates.crossJoin(df.select("entity_id").distinct())
#create the final result by doing a left join and filling null values with 0
result = all_dates_and_ids.join(df, on=['date', 'entity_id'], how="left_outer")\
.fillna({'counter':'0'}) \
.orderBy(['date', 'entity_id'])
This gives
+-------------------+---------+-------+
| date|entity_id|counter|
+-------------------+---------+-------+
|2020-01-01 00:00:00| 1| 10|
|2020-01-01 00:00:00| 2| 3|
|2020-01-01 00:00:00| 3| 7|
|2020-01-02 00:00:00| 1| 0|
|2020-01-02 00:00:00| 2| 9|
|2020-01-02 00:00:00| 3| 0|
|2020-01-03 00:00:00| 1| 15|
|2020-01-03 00:00:00| 2| 0|
|2020-01-03 00:00:00| 3| 0|
|2020-01-04 00:00:00| 1| 14|
|2020-01-04 00:00:00| 2| 3|
|2020-01-04 00:00:00| 3| 0|
|2020-01-05 00:00:00| 1| 0|
|2020-01-05 00:00:00| 2| 6|
|2020-01-05 00:00:00| 3| 0|
+-------------------+---------+-------+

Saving iteratively to a new DataFrame in Pyspark

I'm performing computations based on 3 different PySpark DataFrames.
This script works in the sense that it performs the computation as it should, however, I struggle with working properly with the results of said computation.
import sys
import numpy as np
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)
# Dummy Data
df = sqlContext.createDataFrame([[0,1,0,0,0],[1,1,0,0,1],[0,0,1,0,1],[1,0,1,1,0],[1,1,0,0,0]], ['p1', 'p2', 'p3', 'p4', 'p5'])
df.show()
+---+---+---+---+---+
| p1| p2| p3| p4| p5|
+---+---+---+---+---+
| 0| 1| 0| 0| 0|
| 1| 1| 0| 0| 1|
| 0| 0| 1| 0| 1|
| 1| 0| 1| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
# Values
values = sqlContext.createDataFrame([(0,1,'p1'),(None,1,'p2'),(0,0,'p3'),(None,0, 'p4'),(1,None,'p5')], ('f1', 'f2','index'))
values.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 0| 1| p1|
|null| 1| p2|
| 0| 0| p3|
|null| 0| p4|
| 1|null| p5|
+----+----+-----+
# Weights
weights = sqlContext.createDataFrame([(4,3,'p1'),(None,1,'p2'),(2,2,'p3'),(None, 3, 'p4'),(3,None,'p5')], ('f1', 'f2','index'))
weights.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 4| 3| p1|
|null| 1| p2|
| 2| 2| p3|
|null| 3| p4|
| 3|null| p5|
+----+----+-----+
# Function: it sums the vector W for the values of Row equal to the value of V and then divide by the length of V.
# If there a no similarities between Row and V outputs 0
def W_sum(row,v,w):
if len(w[row==v])>0:
return float(np.sum(w[row==v])/len(w))
else:
return 0.0
For each of the columns and for each row in Data, the above function is applied.
# We iterate over the columns of Values (except the last one called index)
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we select only the useful columns
df_select= df.select(defined_col)
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
df_select.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in df_select.columns))))
This gives :
+---+---+---+---+---+---+
| p1| p2| p3| p4| p5| f1|
+---+---+---+---+---+---+
| 0| 1| 0| 0| 0|2.0|
| 1| 1| 0| 0| 1|1.0|
| 0| 0| 1| 0| 1|2.0|
| 1| 0| 1| 1| 0|0.0|
| 1| 1| 0| 0| 0|0.0|
+---+---+---+---+---+---+
It added the column to the sliced DataFrame as I asked it to. The problem is that I would rather collect the data into a new one that I could access at the end to consult the results.
It it possible to grow (somewhat efficiently) a DataFrame in PySpark as I would with pandas?
Edit to make my goal clearer:
Ideally I would get a DataFrame with the just the computed columns, like this:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
There are some issues with your question...
First, your for loop will produce an error, since df_select in the last line is nowhere defined; there is also no assignment at the end (what does it produce?).
Assuming that df_select is actually your subsubsample dataframe, defined some lines before, and that your last line is something like
new_df = subsubsample.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in subsubsample.columns))))
then your problem starts getting more clear. Since
values.columns[:-1]
# ['f1', 'f2']
the result of the whole loop would be just
+---+---+---+---+---+
| p1| p2| p3| p4| f2|
+---+---+---+---+---+
| 0| 1| 0| 0|1.0|
| 1| 1| 0| 0|2.0|
| 0| 0| 1| 0|0.0|
| 1| 0| 1| 1|0.0|
| 1| 1| 0| 0|2.0|
+---+---+---+---+---+
i.e. with only the column f2 included (natural, since the results with f1 are simply overwritten).
Now, as I said, assuming that the situation is like this, and that your problem is actually how to have both columns f1 & f2 together rather in different dataframes, you can just forget subsubsample and append columns to your initial df, possibly dropping afterwards the unwanted ones:
init_cols = df.columns
init_cols
# ['p1', 'p2', 'p3', 'p4', 'p5']
new_df = df
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
new_df = new_df.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in defined_col)))) # change here
# drop initial columns:
for i in init_cols:
new_df = new_df.drop(i)
The resulting new_df will be:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
UPDATE (after comment): To force the division in your W_sum function to be a float, use:
from __future__ import division
new_df now will be:
+---------+----+
| f1| f2|
+---------+----+
| 2.0| 1.5|
|1.6666666|2.25|
|2.3333333|0.75|
| 0.0|0.75|
|0.6666667|2.25|
+---------+----+
with f2 exactly as it should be according to your comment.

Resources