StringIndexer where category levels passed as list - apache-spark

StringIndexer seems to infer the indices based on the unique values in the data. This is a problem when the data does not have every possible value. The toy example below considers three t-shirt sizes (Small, Medium, and Large), but only two (Small and Large) are in the data. I would like the StringIndexer to still consider all 3 possible sizes. Is there some way to create a column using the index of a string in supplied list? It would be preferable to do it as a Transformer() so that it could be re-used in a pipeline.
from pyspark.sql import Row
df = spark.createDataFrame([Row(id='0', size='Small'),
Row(id='1', size='Small'),
Row(id='2', size='Large')])
(
StringIndexer(inputCol="size", outputCol="size_idx")
.fit(df)
.transform(df)
.show()
)
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 1.0|
+---+-----+--------+
Desired output
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 2.0|
+---+-----+--------+

It looks like you can create the StringIndexer model directly from a set of labels instead of fitting from the data.
from pyspark.sql import Row
from pyspark.ml.feature import StringIndexerModel
df = spark.createDataFrame([Row(id='0', size='Small'),
Row(id='1', size='Small'),
Row(id='2', size='Large')])
si = StringIndexerModel.from_labels(['Small', 'Medium', 'Large'],
inputCol="size",
outputCol="size_idx")
si.transform(df).show()
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 2.0|
+---+-----+--------+

Related

Create column using Spark pandas_udf, with dynamic number of input columns

I have this df:
df = spark.createDataFrame(
[('row_a', 5.0, 0.0, 11.0),
('row_b', 3394.0, 0.0, 4543.0),
('row_c', 136111.0, 0.0, 219255.0),
('row_d', 0.0, 0.0, 0.0),
('row_e', 0.0, 0.0, 0.0),
('row_f', 42.0, 0.0, 54.0)],
['value', 'col_a', 'col_b', 'col_c']
)
I would like to use .quantile(0.25, axis=1) from Pandas which would add one column:
import pandas as pd
pdf = df.toPandas()
pdf['25%'] = pdf.quantile(0.25, axis=1)
print(pdf)
# value col_a col_b col_c 25%
# 0 row_a 5.0 0.0 11.0 2.5
# 1 row_b 3394.0 0.0 4543.0 1697.0
# 2 row_c 136111.0 0.0 219255.0 68055.5
# 3 row_d 0.0 0.0 0.0 0.0
# 4 row_e 0.0 0.0 0.0 0.0
# 5 row_f 42.0 0.0 54.0 21.0
Performance to me is important, so I assume pandas_udf from pyspark.sql.functions could do it in a more optimized way. But I struggle to make a performant and useful function. This is my best attempt:
from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('double')
def quartile1_on_axis1(a: pd.Series, b: pd.Series, c: pd.Series) -> pd.Series:
pdf = pd.DataFrame({'a':a, 'b':b, 'c':c})
return pdf.quantile(0.25, axis=1)
df = df.withColumn('25%', quartile1_on_axis1('col_a', 'col_b', 'col_c'))
I don't like that I need an argument for every column and later in the function addressing those arguments separately to create a df. All of those columns serve the same purpose, so IMHO there should be a way to address them all together, something like in this pseudocode:
def quartile1_on_axis1(*cols) -> pd.Series:
pdf = pd.DataFrame(cols)
This way I could use this function for any number of columns.
Is it necessary to create a pd.Dataframe inside the UDF? To me this seems the same as without UDF (Spark df -> Pandas df -> Spark df), as shown above. Without UDF it's even shorter. Should I really try to make it work with pandas_udf performance-wise? I think pandas_udf was designed specifically for this kind of purpose...
The udf approach will get you the result you need, and is definitely the most straightforward. However, if performance really is top priority you can create your own native Spark implementation for quantile. The basics can be coded quite easily, if you want to use any of the other pandas parameters you'll need to tweak it yourself.
Note: this is taken from the pandas API docs for interpolation='linear'. If you intent to use it, please test the performance and verify the results yourself on large datasets.
import math
from pyspark.sql import functions as f
def quantile(q, cols):
if q < 0 or q > 1:
raise ValueError("Parameter q should be 0 <= q <= 1")
if not cols:
raise ValueError("List of columns should be provided")
idx = (len(cols) - 1) * q
i = math.floor(idx)
j = math.ceil(idx)
fraction = idx - i
arr = f.array_sort(f.array(*cols))
return arr.getItem(i) + (arr.getItem(j) - arr.getItem(i)) * fraction
columns = ['col_a', 'col_b', 'col_c']
df.withColumn('0.25%', quantile(0.25, columns)).show()
+-----+--------+-----+--------+-----+-------+
|value| col_a|col_b| col_c|col_d| 0.25%|
+-----+--------+-----+--------+-----+-------+
|row_a| 5.0| 0.0| 11.0| 1| 2.5|
|row_b| 3394.0| 0.0| 4543.0| 1| 1697.0|
|row_c|136111.0| 0.0|219255.0| 1|68055.5|
|row_d| 0.0| 0.0| 0.0| 1| 0.0|
|row_e| 0.0| 0.0| 0.0| 1| 0.0|
|row_f| 42.0| 0.0| 54.0| 1| 21.0|
+-----+--------+-----+--------+-----+-------+
As a side note, there is also the pandas API on spark, however axis=1 is not (yet) implemented. Potentially this will be added in the future.
df.to_pandas_on_spark().quantile(0.25, axis=1)
NotImplementedError: axis should be either 0 or "index" currently.
I would use GroupedData. Because this requires you pass the df's schema, add a column with the required datatype and get the schema. Pass that schema when required. Code below;
#Generate new schema by adding new column
sch =df.withColumn('quantile25',lit(110.5)).schema
#udf
def quartile1_on_axis1(pdf):
pdf =pdf.assign(quantile25=pdf.quantile(0.25, axis=1))
return pdf
#apply udf
df.groupby('value').applyInPandas(quartile1_on_axis1, schema=sch).show()
#outcome
+-----+--------+-----+--------+----------+
|value| col_a|col_b| col_c|quantile25|
+-----+--------+-----+--------+----------+
|row_a| 5.0| 0.0| 11.0| 2.5|
|row_b| 3394.0| 0.0| 4543.0| 1697.0|
|row_c|136111.0| 0.0|219255.0| 68055.5|
|row_d| 0.0| 0.0| 0.0| 0.0|
|row_e| 0.0| 0.0| 0.0| 0.0|
|row_f| 42.0| 0.0| 54.0| 21.0|
+-----+--------+-----+--------+----------+
You also could use numpy in a udf to get this done. If you do not want to list all columns slice them(columns) by index.
quartile1_on_axis1=udf(lambda x: float(np.quantile(x, 0.25)),FloatType())
df.withColumn("0.25%", quartile1_on_axis1(array(df.columns[1:]))).show(truncate=False)
+-----+--------+-----+--------+-------+
|value|col_a |col_b|col_c |0.25% |
+-----+--------+-----+--------+-------+
|row_a|5.0 |0.0 |11.0 |2.5 |
|row_b|3394.0 |0.0 |4543.0 |1697.0 |
|row_c|136111.0|0.0 |219255.0|68055.5|
|row_d|0.0 |0.0 |0.0 |0.0 |
|row_e|0.0 |0.0 |0.0 |0.0 |
|row_f|42.0 |0.0 |54.0 |21.0 |
+-----+--------+-----+--------+-------+
You can pass a single struct column instead of using multiple columns like this:
#F.pandas_udf('double')
def quartile1_on_axis1(s: pd.DataFrame) -> pd.Series:
return s.quantile(0.25, axis=1)
cols = ['col_a', 'col_b', 'col_c']
df = df.withColumn('25%', quartile1_on_axis1(F.struct(*cols)))
df.show()
# +-----+--------+-----+--------+-------+
# |value| col_a|col_b| col_c| 25%|
# +-----+--------+-----+--------+-------+
# |row_a| 5.0| 0.0| 11.0| 2.5|
# |row_b| 3394.0| 0.0| 4543.0| 1697.0|
# |row_c|136111.0| 0.0|219255.0|68055.5|
# |row_d| 0.0| 0.0| 0.0| 0.0|
# |row_e| 0.0| 0.0| 0.0| 0.0|
# |row_f| 42.0| 0.0| 54.0| 21.0|
# +-----+--------+-----+--------+-------+
pyspark.sql.functions.pandas_udf
Note that the type hint should use pandas.Series in all cases but
there is one variant that pandas.DataFrame should be used for its
input or output type hint instead when the input or output column is
of pyspark.sql.types.StructType.
The following seems to do what's required, but instead of pandas_udf it uses a regular udf. It would be great if I could employ pandas_udf in a similar way.
from pyspark.sql import functions as F
import numpy as np
#F.udf('double')
def lower_quart(*cols):
return float(np.quantile(cols, 0.25))
df = df.withColumn('25%', lower_quart('col_a', 'col_b', 'col_c'))
df.show()
#+-----+--------+-----+--------+-------+
#|value| col_a|col_b| col_c| 25%|
#+-----+--------+-----+--------+-------+
#|row_a| 5.0| 0.0| 11.0| 2.5|
#|row_b| 3394.0| 0.0| 4543.0| 1697.0|
#|row_c|136111.0| 0.0|219255.0|68055.5|
#|row_d| 0.0| 0.0| 0.0| 0.0|
#|row_e| 0.0| 0.0| 0.0| 0.0|
#|row_f| 42.0| 0.0| 54.0| 21.0|
#+-----+--------+-----+--------+-------+

How to add column with alternate values in PySpark dataframe?

I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
|--------|----------|
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode
df.select(posexplode(array("end", "start")).alias("is_start", "date")).show()
#+--------+------+
#|is_start| date|
#+--------+------+
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
#+--------+------+
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df = df.select(['start', 'startv']).union(df.select(['end', 'endv']))
df.show()
+------+------+
| start|startv|
+------+------+
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
+------+------+
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df.show()
+------+----+
| start| end|
+------+----+
| start| end|
|start1|end1|
+------+----+
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
final_df.show()
+------+--------+
| date|is_start|
+------+--------+
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
+------+--------+

Apply a transformation to multiple columns pyspark dataframe

Suppose I have the following spark-dataframe:
+-----+-------+
| word| label|
+-----+-------+
| red| color|
| red| color|
| blue| color|
| blue|feeling|
|happy|feeling|
+-----+-------+
Which can be created using the following code:
sample_df = spark.createDataFrame([
('red', 'color'),
('red', 'color'),
('blue', 'color'),
('blue', 'feeling'),
('happy', 'feeling')
],
('word', 'label')
)
I can perform a groupBy() to get the counts of each word-label pair:
sample_df = sample_df.groupBy('word', 'label').count()
#+-----+-------+-----+
#| word| label|count|
#+-----+-------+-----+
#| blue| color| 1|
#| blue|feeling| 1|
#| red| color| 2|
#|happy|feeling| 1|
#+-----+-------+-----+
And then pivot() and sum() to get the label counts as columns:
import pyspark.sql.functions as f
sample_df = sample_df.groupBy('word').pivot('label').agg(f.sum('count')).na.fill(0)
#+-----+-----+-------+
#| word|color|feeling|
#+-----+-----+-------+
#| red| 2| 0|
#|happy| 0| 1|
#| blue| 1| 1|
#+-----+-----+-------+
What is the best way to transform this dataframe such that each row is divided by the total for that row?
# Desired output
+-----+-----+-------+
| word|color|feeling|
+-----+-----+-------+
| red| 1.0| 0.0|
|happy| 0.0| 1.0|
| blue| 0.5| 0.5|
+-----+-----+-------+
One way to achieve this result is to use __builtin__.sum (NOT pyspark.sql.functions.sum) to get the row-wise sum and then call withColumn() for each label:
labels = ['color', 'feeling']
sample_df.withColumn('total', sum([f.col(x) for x in labels]))\
.withColumn('color', f.col('color')/f.col('total'))\
.withColumn('feeling', f.col('feeling')/f.col('total'))\
.select('word', 'color', 'feeling')\
.show()
But there has to be a better way than enumerating each of the possible columns.
More generally, my question is:
How can I apply an arbitrary transformation, that is a function of the current row, to multiple columns simultaneously?
Found an answer on this Medium post.
First make a column for the total (as above), then use the * operator to unpack a list comprehension over the labels in select():
labels = ['color', 'feeling']
sample_df = sample_df.withColumn('total', sum([f.col(x) for x in labels]))
sample_df.select(
'word', *[(f.col(col_name)/f.col('total')).alias(col_name) for col_name in labels]
).show()
The approach shown on the linked post shows how to generalize this for arbitrary transformations.

PySpark: how to resample frequencies

Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different variables. This is because the timestamp is generated when the value of a variable changed and is recorded.
#Variable Time Value
#852-YF-007 2016-05-10 00:00:00 0
#852-YF-007 2016-05-09 23:59:00 0
#852-YF-007 2016-05-09 23:58:00 0
Problem I would like to put all variables into the same frequency (for instance 10min) using forward-fill. To visualize this, I copied a page from the Book "Python for Data Analysis". Question: How to do that on a Spark Dataframe in an efficient way?
Question: How to do that on a Spark Dataframe in an efficient way?
Spark DataFrame is simply not a good choice for an operation like this one. In general SQL primitives won't be expressive enough and PySpark DataFrame doesn't provide low level access required to implement it.
While re-sampling can be easily represented using epoch / timestamp arithmetics. With data like this:
from pyspark.sql.functions import col, max as max_, min as min_
df = (spark
.createDataFrame([
("2012-06-13", 0.694), ("2012-06-20", -2.669), ("2012-06-27", 0.245)],
["ts", "val"])
.withColumn("ts", col("ts").cast("date").cast("timestamp")))
we can re-sample input:
day = 60 * 60 * 24
epoch = (col("ts").cast("bigint") / day).cast("bigint") * day
with_epoch = df.withColumn("epoch", epoch)
min_epoch, max_epoch = with_epoch.select(min_("epoch"), max_("epoch")).first()
and join with reference:
# Reference range
ref = spark.range(
min_epoch, max_epoch + 1, day
).toDF("epoch")
(ref
.join(with_epoch, "epoch", "left")
.orderBy("epoch")
.withColumn("ts_resampled", col("epoch").cast("timestamp"))
.show(15, False))
## +----------+---------------------+------+---------------------+
## |epoch |ts |val |ts_resampled |
## +----------+---------------------+------+---------------------+
## |1339459200|2012-06-13 00:00:00.0|0.694 |2012-06-12 02:00:00.0|
## |1339545600|null |null |2012-06-13 02:00:00.0|
## |1339632000|null |null |2012-06-14 02:00:00.0|
## |1339718400|null |null |2012-06-15 02:00:00.0|
## |1339804800|null |null |2012-06-16 02:00:00.0|
## |1339891200|null |null |2012-06-17 02:00:00.0|
## |1339977600|null |null |2012-06-18 02:00:00.0|
## |1340064000|2012-06-20 00:00:00.0|-2.669|2012-06-19 02:00:00.0|
## |1340150400|null |null |2012-06-20 02:00:00.0|
## |1340236800|null |null |2012-06-21 02:00:00.0|
## |1340323200|null |null |2012-06-22 02:00:00.0|
## |1340409600|null |null |2012-06-23 02:00:00.0|
## |1340496000|null |null |2012-06-24 02:00:00.0|
## |1340582400|null |null |2012-06-25 02:00:00.0|
## |1340668800|2012-06-27 00:00:00.0|0.245 |2012-06-26 02:00:00.0|
## +----------+---------------------+------+---------------------+
In Spark >= 3.1 replace
col("epoch").cast("timestamp")
with
from pyspark.sql.functions import timestamp_seconds
timestamp_seconds("epoch")
Using low level APIs it is possible to fill data like this as I've shown in my answer to Spark / Scala: forward fill with last observation. Using RDDs we could also avoid shuffling data twice (once for join, once for reordering).
But there is much more important problem here. Spark performs optimally when problem can be reduced to element wise or partition wise computations. While forward fill is the case when it is possible, as far as I am aware this is typically not the case with commonly used time series models and if some operation requires a sequential access then Spark won't provide any benefits at all.
So if you work with series which are large enough to require distributed data structure you'll probably want to aggregate it to some object that can be easily handled by a single machine and then use your favorite non-distributed tool to handle the rest.
If you work with multiple time series where each can be handled in memory then there is of course sparkts, but I know you're already aware of that.
I once answered a similar question, it'a bit of a hack but the idea makes sense in your case. Map every value on to a list, then flatten the list vertically.
From: Inserting records in a spark dataframe:
You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+
This is an old post, though I recently had to solve this with Spark 3.2. Here's the solution I came up with to both up-sample and down-sample the time-series to obtain exactly one data-point per object and per time period.
Assuming the following input data that we want to re-sample per day. Some variable have several data points per day, some have no data during several days:
from pyspark.sql.types import StructType, StringType, ArrayType, DoubleType, TimestampType
from pyspark.sql.functions import udf, date_trunc, row_number, desc, coalesce, datediff, lead, explode, col, lit
from pyspark.sql import Window, Row
from datetime import datetime, timedelta
df = spark.createDataFrame([
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-01T03:34:23.000"), value=1.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-01T04:34:23.000"), value=10.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-01T05:34:23.000"), value=100.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T01:34:23.000"), value=2.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=3.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T02:34:23.000"), value=200.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-02T05:34:23.000"), value=200.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T10:34:23.000"), value=40.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T12:34:23.000"), value=42.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-04T14:34:23.000"), value=46.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-05T14:34:23.000"), value=6.),
Row(variable="A", record_ts=datetime.fromisoformat("2021-10-07T09:34:23.000"), value=7.),
Row(variable="B", record_ts=datetime.fromisoformat("2021-10-07T08:34:23.000"), value=70.),
Row(variable="C", record_ts=datetime.fromisoformat("2021-10-07T05:34:23.000"), value=700.),
])
I first need this simple udf which essentially just builds a sequence of timestamps:
#udf(ArrayType(TimestampType()))
def pad_time(count: int, start_time: datetime):
if repeated_count is None:
return []
else:
return [start_time + timedelta(days=c) for c in range(count)]
Down-sampling can be done with a simple groupBy or partitionBy, keeping max 1 value per variable each day (I chose partitionBy in the example below).
Up-sampling with a "fill-forward" strategy can be done by measuring the size of a time gap between 2 successive rows, and then using this information to call the udf above.
df
# down-sampling by keeping the last value of each variable each day.
.withColumn("record_day", date_trunc("DAY", "record_ts"))
.withColumn("row_num",
row_number().over(
Window.partitionBy("variable", "record_day").orderBy(desc("record_ts"))
))
.where("row_num == 1")
# up-sampling part 1: counts the number of days to be filled (or 1 for the very last value)
.withColumn("gap",
coalesce(
datediff(
lead("record_day").over(Window.partitionBy("variable").orderBy("record_day")),
"record_day"),
lit(1))
)
.select(
# up-sampling part 2: just, pad the time axis as dictated by "gap", and the other two fields will be repeated
explode(pad_time("gap", "record_day")).alias("record_day"),
"variable",
"value",
)
.orderBy("record_day", "variable")
The results looks like that:
+-------------------+--------+-----+
| record_day|variable|value|
+-------------------+--------+-----+
|2021-10-01 00:00:00| A| 1.0|
|2021-10-01 00:00:00| B| 10.0|
|2021-10-01 00:00:00| C|100.0|
|2021-10-02 00:00:00| A| 3.0|
|2021-10-02 00:00:00| B| 10.0|
|2021-10-02 00:00:00| C|200.0|
|2021-10-03 00:00:00| A| 3.0|
|2021-10-03 00:00:00| B| 10.0|
|2021-10-03 00:00:00| C|200.0|
|2021-10-04 00:00:00| A| 3.0|
|2021-10-04 00:00:00| B| 46.0|
|2021-10-04 00:00:00| C|200.0|
|2021-10-05 00:00:00| A| 6.0|
|2021-10-05 00:00:00| B| 46.0|
|2021-10-05 00:00:00| C|200.0|
|2021-10-06 00:00:00| A| 6.0|
|2021-10-06 00:00:00| B| 46.0|
|2021-10-06 00:00:00| C|200.0|
|2021-10-07 00:00:00| A| 7.0|
|2021-10-07 00:00:00| B| 70.0|
|2021-10-07 00:00:00| C|700.0|
+-------------------+--------+-----+
Since Spark 2.4, you can use sequence built-in function with a window to generate all the timestamps between date of change and next date of change, and then use explode to flatten those timestamps.
If we start with the following dataframe df:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+
when we use the following code:
from pyspark.sql import Window
from pyspark.sql import functions as F
next_start_time = F.lead('time').over(Window.partitionBy('variable').orderBy('time'))
end_time = F.when(next_start_time.isNull(),
F.col('time')
).otherwise(
F.date_sub(next_start_time, 1)
)
result = df.withColumn('start', F.col('time')) \
.withColumn('stop', end_time) \
.withColumn('time', F.explode(F.sequence(
F.col('start'), F.col('stop'), F.expr("INTERVAL 1 DAY"))
)) \
.drop('start', 'stop')
You get the following result dataframe:
+----------+-------------------+---------+
|variable |time |value |
+----------+-------------------+---------+
|852-YF-007|2012-06-13 00:00:00|0.694283 |
|852-YF-007|2012-06-14 00:00:00|0.694283 |
|852-YF-007|2012-06-15 00:00:00|0.694283 |
|852-YF-007|2012-06-16 00:00:00|0.694283 |
|852-YF-007|2012-06-17 00:00:00|0.694283 |
|852-YF-007|2012-06-18 00:00:00|0.694283 |
|852-YF-007|2012-06-19 00:00:00|0.694283 |
|852-YF-007|2012-06-20 00:00:00|-2.669195|
|852-YF-007|2012-06-21 00:00:00|-2.669195|
|852-YF-007|2012-06-22 00:00:00|-2.669195|
|852-YF-007|2012-06-23 00:00:00|-2.669195|
|852-YF-007|2012-06-24 00:00:00|-2.669195|
|852-YF-007|2012-06-25 00:00:00|-2.669195|
|852-YF-007|2012-06-26 00:00:00|-2.669195|
|852-YF-007|2012-06-27 00:00:00|0.245842 |
+----------+-------------------+---------+

PySpark: Randomize rows in dataframe

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).
It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+

Resources