PySpark: how to resample frequencies

PySpark: how to resample frequencies - apache-spark

Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different variables. This is because the timestamp is generated when the value of a variable changed and is recorded.
#Variable Time Value
#852-YF-007 2016-05-10 00:00:00 0
#852-YF-007 2016-05-09 23:59:00 0
#852-YF-007 2016-05-09 23:58:00 0
Problem I would like to put all variables into the same frequency (for instance 10min) using forward-fill. To visualize this, I copied a page from the Book "Python for Data Analysis". Question: How to do that on a Spark Dataframe in an efficient way?

Question: How to do that on a Spark Dataframe in an efficient way?
Spark DataFrame is simply not a good choice for an operation like this one. In general SQL primitives won't be expressive enough and PySpark DataFrame doesn't provide low level access required to implement it.
While re-sampling can be easily represented using epoch / timestamp arithmetics. With data like this:
from pyspark.sql.functions import col, max as max_, min as min_
df = (spark
.createDataFrame([
("2012-06-13", 0.694), ("2012-06-20", -2.669), ("2012-06-27", 0.245)],
["ts", "val"])
.withColumn("ts", col("ts").cast("date").cast("timestamp")))
we can re-sample input:
day = 60 * 60 * 24
epoch = (col("ts").cast("bigint") / day).cast("bigint") * day
with_epoch = df.withColumn("epoch", epoch)
min_epoch, max_epoch = with_epoch.select(min_("epoch"), max_("epoch")).first()
and join with reference:
# Reference range
ref = spark.range(
min_epoch, max_epoch + 1, day
).toDF("epoch")
(ref
.join(with_epoch, "epoch", "left")
.orderBy("epoch")
.withColumn("ts_resampled", col("epoch").cast("timestamp"))
.show(15, False))
## +----------+---------------------+------+---------------------+
## |epoch |ts |val |ts_resampled |
## +----------+---------------------+------+---------------------+
## |1339459200|2012-06-13 00:00:00.0|0.694 |2012-06-12 02:00:00.0|
## |1339545600|null |null |2012-06-13 02:00:00.0|
## |1339632000|null |null |2012-06-14 02:00:00.0|
## |1339718400|null |null |2012-06-15 02:00:00.0|
## |1339804800|null |null |2012-06-16 02:00:00.0|
## |1339891200|null |null |2012-06-17 02:00:00.0|
## |1339977600|null |null |2012-06-18 02:00:00.0|
## |1340064000|2012-06-20 00:00:00.0|-2.669|2012-06-19 02:00:00.0|
## |1340150400|null |null |2012-06-20 02:00:00.0|
## |1340236800|null |null |2012-06-21 02:00:00.0|
## |1340323200|null |null |2012-06-22 02:00:00.0|
## |1340409600|null |null |2012-06-23 02:00:00.0|
## |1340496000|null |null |2012-06-24 02:00:00.0|
## |1340582400|null |null |2012-06-25 02:00:00.0|
## |1340668800|2012-06-27 00:00:00.0|0.245 |2012-06-26 02:00:00.0|
## +----------+---------------------+------+---------------------+
In Spark >= 3.1 replace
col("epoch").cast("timestamp")
with
from pyspark.sql.functions import timestamp_seconds
timestamp_seconds("epoch")
Using low level APIs it is possible to fill data like this as I've shown in my answer to Spark / Scala: forward fill with last observation. Using RDDs we could also avoid shuffling data twice (once for join, once for reordering).
But there is much more important problem here. Spark performs optimally when problem can be reduced to element wise or partition wise computations. While forward fill is the case when it is possible, as far as I am aware this is typically not the case with commonly used time series models and if some operation requires a sequential access then Spark won't provide any benefits at all.
So if you work with series which are large enough to require distributed data structure you'll probably want to aggregate it to some object that can be easily handled by a single machine and then use your favorite non-distributed tool to handle the rest.
If you work with multiple time series where each can be handled in memory then there is of course sparkts, but I know you're already aware of that.

I once answered a similar question, it'a bit of a hack but the idea makes sense in your case. Map every value on to a list, then flatten the list vertically.
From: Inserting records in a spark dataframe:
You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+

This is an old post, though I recently had to solve this with Spark 3.2. Here's the solution I came up with to both up-sample and down-sample the time-series to obtain exactly one data-point per object and per time period.
Assuming the following input data that we want to re-sample per day. Some variable have several data points per day, some have no data during several days:
from pyspark.sql.types import StructType, StringType, ArrayType, DoubleType, TimestampType
from pyspark.sql.functions import udf, date_trunc, row_number, desc, coalesce, datediff, lead, explode, col, lit
from pyspark.sql import Window, Row
from datetime import datetime, timedelta
df = spark.createDataFrame([
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-01T03:34:23.000"), value=1.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-01T04:34:23.000"), value=10.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-01T05:34:23.000"), value=100.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T01:34:23.000"), value=2.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=3.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T02:34:23.000"), value=200.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=200.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T10:34:23.000"), value=40.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T12:34:23.000"), value=42.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T14:34:23.000"), value=46.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-05T14:34:23.000"), value=6.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-07T09:34:23.000"), value=7.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-07T08:34:23.000"), value=70.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-07T05:34:23.000"), value=700.),
])
I first need this simple udf which essentially just builds a sequence of timestamps:
#udf(ArrayType(TimestampType()))
def pad_time(count: int, start_time: datetime):
if repeated_count is None:
return []
else:
return [start_time + timedelta(days=c) for c in range(count)]
Down-sampling can be done with a simple groupBy or partitionBy, keeping max 1 value per variable each day (I chose partitionBy in the example below).
Up-sampling with a "fill-forward" strategy can be done by measuring the size of a time gap between 2 successive rows, and then using this information to call the udf above.
df
# down-sampling by keeping the last value of each variable each day.
.withColumn("record_day", date_trunc("DAY", "record_ts"))
.withColumn("row_num",
row_number().over(
Window.partitionBy("variable", "record_day").orderBy(desc("record_ts"))
))
.where("row_num == 1")
# up-sampling part 1: counts the number of days to be filled (or 1 for the very last value)
.withColumn("gap",
coalesce(
datediff(
lead("record_day").over(Window.partitionBy("variable").orderBy("record_day")),
"record_day"),
lit(1))
)
.select(
# up-sampling part 2: just, pad the time axis as dictated by "gap", and the other two fields will be repeated
explode(pad_time("gap", "record_day")).alias("record_day"),
"variable",
"value",
)
.orderBy("record_day", "variable")
The results looks like that:
+-------------------+--------+-----+
| record_day|variable|value|
+-------------------+--------+-----+
|2021-10-01 00:00:00| A| 1.0|
|2021-10-01 00:00:00| B| 10.0|
|2021-10-01 00:00:00| C|100.0|
|2021-10-02 00:00:00| A| 3.0|
|2021-10-02 00:00:00| B| 10.0|
|2021-10-02 00:00:00| C|200.0|
|2021-10-03 00:00:00| A| 3.0|
|2021-10-03 00:00:00| B| 10.0|
|2021-10-03 00:00:00| C|200.0|
|2021-10-04 00:00:00| A| 3.0|
|2021-10-04 00:00:00| B| 46.0|
|2021-10-04 00:00:00| C|200.0|
|2021-10-05 00:00:00| A| 6.0|
|2021-10-05 00:00:00| B| 46.0|
|2021-10-05 00:00:00| C|200.0|
|2021-10-06 00:00:00| A| 6.0|
|2021-10-06 00:00:00| B| 46.0|
|2021-10-06 00:00:00| C|200.0|
|2021-10-07 00:00:00| A| 7.0|
|2021-10-07 00:00:00| B| 70.0|
|2021-10-07 00:00:00| C|700.0|
+-------------------+--------+-----+

Since Spark 2.4, you can use sequence built-in function with a window to generate all the timestamps between date of change and next date of change, and then use explode to flatten those timestamps.
If we start with the following dataframe df:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+
when we use the following code:
from pyspark.sql import Window
from pyspark.sql import functions as F
next_start_time = F.lead('time').over(Window.partitionBy('variable').orderBy('time'))
end_time = F.when(next_start_time.isNull(),
F.col('time')
).otherwise(
F.date_sub(next_start_time, 1)
)
result = df.withColumn('start', F.col('time')) \
.withColumn('stop', end_time) \
.withColumn('time', F.explode(F.sequence(
F.col('start'), F.col('stop'), F.expr("INTERVAL 1 DAY"))
)) \
.drop('start', 'stop')
You get the following result dataframe:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-14 00:00:00|0.694283 |
|852-YF-007|2012-06-15 00:00:00|0.694283 |
|852-YF-007|2012-06-16 00:00:00|0.694283 |
|852-YF-007|2012-06-17 00:00:00|0.694283 |
|852-YF-007|2012-06-18 00:00:00|0.694283 |
|852-YF-007|2012-06-19 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-21 00:00:00|-2.669195|
|852-YF-007|2012-06-22 00:00:00|-2.669195|
|852-YF-007|2012-06-23 00:00:00|-2.669195|
|852-YF-007|2012-06-24 00:00:00|-2.669195|
|852-YF-007|2012-06-25 00:00:00|-2.669195|
|852-YF-007|2012-06-26 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+

Related

PySpark: Time since previous True

I have a Spark dataframe, like so:
# For sake of simplicity only one id is shown, but there are multiple objects
+---+-------------------+------+
| id| timstm|signal|
+---+-------------------+------+
| X1|2022-07-01 00:00:00| null|
| X1|2022-07-02 00:00:00| true|
| X1|2022-07-03 00:00:00| null|
| X1|2022-07-05 00:00:00| null|
| X1|2022-07-09 00:00:00| true|
+---+-------------------+------+
And I want to create a new column that contains the time since the signal column was last true
+---+-------------------+------+---------+
| id| timstm|signal|time_diff|
+---+-------------------+------+---------+
| X1|2022-07-01 00:00:00| null| null|
| X1|2022-07-02 00:00:00| true| 0.0|
| X1|2022-07-03 00:00:00| null| 1.0|
| X1|2022-07-05 00:00:00| null| 3.0|
| X1|2022-07-09 00:00:00| true| 0.0|
+---+-------------------+------+---------+
Any ideas how to approach this? My intuition is to somehow use window and filter to achieve this, but I'm not sure

So this logic is a bit hard to express in native PySpark. It might be easier to express it as a pandas_udf. I will use the Fugue library to bring Python/Pandas code to a Pandas UDF, but if you don't want to use Fugue, you can still bring it to Pandas UDF, it just takes a lot more code.
Setup
Here I am just creating the DataFrame in the example. I know this is a Pandas DataFrame, we will convert it to Spark and run the solution on Spark later.
I suggest filling the null with False in the original DataFrame. This is because the Pandas code uses a group-by and NULL values are dropped by default in Pandas groupby. Filling the NULL with False will make it work properly (and I think it's also easier for conversion between Spark and Pandas).
import pandas as pd
df = pd.DataFrame({"id": ["X1"]*5,
"timestm": ["2022-07-01", "2022-07-02", "2022-07-03", "2022-07-05", "2022-07-09"],
"signal": [None, True, None, None, True]})
df['timestm'] = pd.to_datetime(df['timestm'])
df['signal'] = df['signal'].fillna(False)
Solution 1
So when we use Pandas-UDF, the important piece is that the function is applied per Spark partition. So the function just needs to be able to handle one id. And then we partition the Spark DataFrame by id and run the function for each one later.
Also be aware that order may not be guaranteed so we'll sort the data by time as the first step. The Pandas code I have is really just taken from another post here and modified.
def process(df: pd.DataFrame) -> pd.DataFrame:
df = df.sort_values('timestm')
df['days_since_last_event'] = df['timestm'].diff().apply(lambda x: x.days)
df.loc[:, 'days_since_last_event'] = df.groupby(df['signal'].shift().cumsum())['days_since_last_event'].cumsum()
df.loc[df['signal'] == True, 'days_since_last_event'] = 0
return df
process(df)
This will give us:
id timestm signal days_since_last_event
X1 2022-07-01 False NaN
X1 2022-07-02 True 0.0
X1 2022-07-03 False 1.0
X1 2022-07-05 False 3.0
X1 2022-07-09 True 0.0
Which looks right. Now we can bring it to Spark using Fugue with minimal additional lines of code. This will partition the data and run the function on each partition. Schema is a requirement for Pandas UDF so Fugue needs it also, but uses a simpler way to define it.
import fugue.api as fa
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
out = fa.transform(sdf, process, schema="*, days_since_last_event:int", partition={"by": "id"})
# out is a Spark DataFrame because a Spark DataFrame was passed in
out.show()
which gives us:
+---+-------------------+------+---------------------+
| id| timestm|signal|days_since_last_event|
+---+-------------------+------+---------------------+
| X1|2022-07-01 00:00:00| false| null|
| X1|2022-07-02 00:00:00| true| 0|
| X1|2022-07-03 00:00:00| false| 1|
| X1|2022-07-05 00:00:00| false| 3|
| X1|2022-07-09 00:00:00| true| 0|
+---+-------------------+------+---------------------+
Note to define the partition when running on the full data.

Conditional replacement of values in pyspark dataframe

I have the spark dataframe below:
+----------+-------------+--------------+------------+----------+-------------------+
| part| company| country| city| price| date|
+----------+-------------+--------------+------------+----------+-------------------+
| 52125-136| Brainsphere| null| Braga| 493.94€|2016-05-10 11:13:43|
| 70253-307|Chatterbridge| Spain| Barcelona| 969.29€|2016-05-10 13:06:30|
| 50563-113| Kanoodle| Japan| Niihama| ¥72909.95|2016-05-10 13:11:57|
|52380-1102| Flipstorm| France| Nanterre| 794.84€|2016-05-10 13:19:12|
| 54473-578| Twitterbeat| France| Annecy| 167.48€|2016-05-10 15:09:46|
| 76335-006| Ntags| Portugal| Lisbon| 373.07€|2016-05-10 15:20:22|
| 49999-737| Buzzbean| Germany| Düsseldorf| 861.2€|2016-05-10 15:21:51|
| 68233-011| Flipstorm| Greece| Athens| 512.89€|2016-05-10 15:22:03|
| 36800-952| Eimbee| France| Amiens| 219.74€|2016-05-10 21:22:46|
| 16714-295| Teklist| null| Arnhem| 624.4€|2016-05-10 21:57:15|
| 42254-213| Thoughtmix| Portugal| Amadora| 257.99€|2016-05-10 22:01:04|
From these columns, only the country column has null values. So what I want to do is to fill the null values with the country that corresponds to the city on the right. The dataframe is big and there are cases where Braga (for example) has the country that it belongs and other cases where this is not the case.
So, how can I fill those null values in the country column based on the city column on the right and at the same time take advantage of Spark's parallel computation?

You can use a window functions for that.
from pyspark.sql import functions as F, Window
df.withColumn(
"country",
F.coalesce(
F.col("country"),
F.first("country").over(Window.partitionBy("city").orderBy("city")),
),
).show()

Use coalesce function in spark to get first non null value from list of columns.
Example:
df.show()
#+--------+---------+
#| country| city|
#+--------+---------+
#| null| Braga|
#| Spain|Barcelona|
#| null| Arnhem|
#|portugal| Amadora|
#+--------+---------+
from pyspark.sql.functions import *
df.withColumn("country",coalesce(col("country"),col("city"))).show()
#+--------+---------+
#| country| city|
#+--------+---------+
#| Braga| Braga|
#| Spain|Barcelona|
#| Arnhem| Arnhem|
#|portugal| Amadora|
#+--------+---------+

StringIndexer where category levels passed as list

StringIndexer seems to infer the indices based on the unique values in the data. This is a problem when the data does not have every possible value. The toy example below considers three t-shirt sizes (Small, Medium, and Large), but only two (Small and Large) are in the data. I would like the StringIndexer to still consider all 3 possible sizes. Is there some way to create a column using the index of a string in supplied list? It would be preferable to do it as a Transformer() so that it could be re-used in a pipeline.
from pyspark.sql import Row
df = spark.createDataFrame([Row(id='0', size='Small'),
Row(id='1', size='Small'),
Row(id='2', size='Large')])
(
StringIndexer(inputCol="size", outputCol="size_idx")
.fit(df)
.transform(df)
.show()
)
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 1.0|
+---+-----+--------+
Desired output
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 2.0|
+---+-----+--------+

It looks like you can create the StringIndexer model directly from a set of labels instead of fitting from the data.
from pyspark.sql import Row
from pyspark.ml.feature import StringIndexerModel
df = spark.createDataFrame([Row(id='0', size='Small'),
Row(id='1', size='Small'),
Row(id='2', size='Large')])
si = StringIndexerModel.from_labels(['Small', 'Medium', 'Large'],
inputCol="size",
outputCol="size_idx")
si.transform(df).show()
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 2.0|
+---+-----+--------+

pyspark join two Dataframe and keep row by the recent date

I have two Dataframes A and B.
A
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
B
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
and I must create a new Dataframe where the score is updated by looking the date
result
+---+------+-----+----------+
|id |player|score|date |
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

You can join the two dataframes, and use pyspark.sql.functions.when() to pick the values for the score and date columns.
from pyspark.sql.functions import col, when
df_A.alias("a").join(df_B.alias("b"), on=["id", "player"], how="inner")\
.select(
"id",
"player",
when(
col("b.date") > col("a.date"),
col("b.score")
).otherwise(col("a.score")).alias("score"),
when(
col("b.date") > col("a.date"),
col("b.date")
).otherwise(col("a.date")).alias("date")
)\
.show()
#+---+------+-----+----------+
#| id|player|score| date|
#+---+------+-----+----------+
#| 1| alpha| 100|2019-02-13|
#| 2| beta| 6|2018-02-13|
#+---+------+-----+----------+
Read more on when: Spark Equivalent of IF Then ELSE

I am making an assumption that every player is allocated an id and it doesn't change. OP wants that the resulting dataframe should contain the score from the most current date.
# Creating both the DataFrames.
df_A = sqlContext.createDataFrame([(1,'alpha',5,'2018-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_A = df_A.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))
df_B = sqlContext.createDataFrame([(1,'alpha',100,'2019-02-13'),(2,'beta',6,'2018-02-13')],('id','player','score','date'))
df_B = df_B.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))
The idea is to make a union(), of these two dataframes and then take the distinct rows. The reason behind taking distinct rows afterwards is the following - Suppose there was no update for a player, then in the B dataframe, it's corresponding values will be the same as in dataframe A. So, we remove such duplicates.
# Importing the requisite packages.
from pyspark.sql.functions import col, max
from pyspark.sql import Window
df = df_A.union(df_B).distinct()
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 5|2018-02-13|
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+
Now, as a final step, use Window() function to loop over the unioned dataframe df and find the latestDate and filter out only those rows where the date is same as the latestDate. That way, all those rows corresponding to those players will be removed where there was an update (manifested by an updated date in dataframe B).
w = Window.partitionBy('id','player')
df = df.withColumn('latestDate', max('date').over(w))\
.where(col('date') == col('latestDate')).drop('latestDate')
df.show()
+---+------+-----+----------+
| id|player|score| date|
+---+------+-----+----------+
| 1| alpha| 100|2019-02-13|
| 2| beta| 6|2018-02-13|
+---+------+-----+----------+

PySpark: Randomize rows in dataframe

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).

It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.

This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark: how to resample frequencies - apache-spark

Related

PySpark: Time since previous True

Conditional replacement of values in pyspark dataframe

StringIndexer where category levels passed as list

pyspark join two Dataframe and keep row by the recent date

PySpark: Randomize rows in dataframe

Categories

Resources