I have this df:
df = spark.createDataFrame(
[('row_a', 5.0, 0.0, 11.0),
('row_b', 3394.0, 0.0, 4543.0),
('row_c', 136111.0, 0.0, 219255.0),
('row_d', 0.0, 0.0, 0.0),
('row_e', 0.0, 0.0, 0.0),
('row_f', 42.0, 0.0, 54.0)],
['value', 'col_a', 'col_b', 'col_c']
)
I would like to use .quantile(0.25, axis=1) from Pandas which would add one column:
import pandas as pd
pdf = df.toPandas()
pdf['25%'] = pdf.quantile(0.25, axis=1)
print(pdf)
# value col_a col_b col_c 25%
# 0 row_a 5.0 0.0 11.0 2.5
# 1 row_b 3394.0 0.0 4543.0 1697.0
# 2 row_c 136111.0 0.0 219255.0 68055.5
# 3 row_d 0.0 0.0 0.0 0.0
# 4 row_e 0.0 0.0 0.0 0.0
# 5 row_f 42.0 0.0 54.0 21.0
Performance to me is important, so I assume pandas_udf from pyspark.sql.functions could do it in a more optimized way. But I struggle to make a performant and useful function. This is my best attempt:
from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('double')
def quartile1_on_axis1(a: pd.Series, b: pd.Series, c: pd.Series) -> pd.Series:
pdf = pd.DataFrame({'a':a, 'b':b, 'c':c})
return pdf.quantile(0.25, axis=1)
df = df.withColumn('25%', quartile1_on_axis1('col_a', 'col_b', 'col_c'))
I don't like that I need an argument for every column and later in the function addressing those arguments separately to create a df. All of those columns serve the same purpose, so IMHO there should be a way to address them all together, something like in this pseudocode:
def quartile1_on_axis1(*cols) -> pd.Series:
pdf = pd.DataFrame(cols)
This way I could use this function for any number of columns.
Is it necessary to create a pd.Dataframe inside the UDF? To me this seems the same as without UDF (Spark df -> Pandas df -> Spark df), as shown above. Without UDF it's even shorter. Should I really try to make it work with pandas_udf performance-wise? I think pandas_udf was designed specifically for this kind of purpose...
The udf approach will get you the result you need, and is definitely the most straightforward. However, if performance really is top priority you can create your own native Spark implementation for quantile. The basics can be coded quite easily, if you want to use any of the other pandas parameters you'll need to tweak it yourself.
Note: this is taken from the pandas API docs for interpolation='linear'. If you intent to use it, please test the performance and verify the results yourself on large datasets.
import math
from pyspark.sql import functions as f
def quantile(q, cols):
if q < 0 or q > 1:
raise ValueError("Parameter q should be 0 <= q <= 1")
if not cols:
raise ValueError("List of columns should be provided")
idx = (len(cols) - 1) * q
i = math.floor(idx)
j = math.ceil(idx)
fraction = idx - i
arr = f.array_sort(f.array(*cols))
return arr.getItem(i) + (arr.getItem(j) - arr.getItem(i)) * fraction
columns = ['col_a', 'col_b', 'col_c']
df.withColumn('0.25%', quantile(0.25, columns)).show()
+-----+--------+-----+--------+-----+-------+
|value| col_a|col_b| col_c|col_d| 0.25%|
+-----+--------+-----+--------+-----+-------+
|row_a| 5.0| 0.0| 11.0| 1| 2.5|
|row_b| 3394.0| 0.0| 4543.0| 1| 1697.0|
|row_c|136111.0| 0.0|219255.0| 1|68055.5|
|row_d| 0.0| 0.0| 0.0| 1| 0.0|
|row_e| 0.0| 0.0| 0.0| 1| 0.0|
|row_f| 42.0| 0.0| 54.0| 1| 21.0|
+-----+--------+-----+--------+-----+-------+
As a side note, there is also the pandas API on spark, however axis=1 is not (yet) implemented. Potentially this will be added in the future.
df.to_pandas_on_spark().quantile(0.25, axis=1)
NotImplementedError: axis should be either 0 or "index" currently.
I would use GroupedData. Because this requires you pass the df's schema, add a column with the required datatype and get the schema. Pass that schema when required. Code below;
#Generate new schema by adding new column
sch =df.withColumn('quantile25',lit(110.5)).schema
#udf
def quartile1_on_axis1(pdf):
pdf =pdf.assign(quantile25=pdf.quantile(0.25, axis=1))
return pdf
#apply udf
df.groupby('value').applyInPandas(quartile1_on_axis1, schema=sch).show()
#outcome
+-----+--------+-----+--------+----------+
|value| col_a|col_b| col_c|quantile25|
+-----+--------+-----+--------+----------+
|row_a| 5.0| 0.0| 11.0| 2.5|
|row_b| 3394.0| 0.0| 4543.0| 1697.0|
|row_c|136111.0| 0.0|219255.0| 68055.5|
|row_d| 0.0| 0.0| 0.0| 0.0|
|row_e| 0.0| 0.0| 0.0| 0.0|
|row_f| 42.0| 0.0| 54.0| 21.0|
+-----+--------+-----+--------+----------+
You also could use numpy in a udf to get this done. If you do not want to list all columns slice them(columns) by index.
quartile1_on_axis1=udf(lambda x: float(np.quantile(x, 0.25)),FloatType())
df.withColumn("0.25%", quartile1_on_axis1(array(df.columns[1:]))).show(truncate=False)
+-----+--------+-----+--------+-------+
|value|col_a |col_b|col_c |0.25% |
+-----+--------+-----+--------+-------+
|row_a|5.0 |0.0 |11.0 |2.5 |
|row_b|3394.0 |0.0 |4543.0 |1697.0 |
|row_c|136111.0|0.0 |219255.0|68055.5|
|row_d|0.0 |0.0 |0.0 |0.0 |
|row_e|0.0 |0.0 |0.0 |0.0 |
|row_f|42.0 |0.0 |54.0 |21.0 |
+-----+--------+-----+--------+-------+
You can pass a single struct column instead of using multiple columns like this:
#F.pandas_udf('double')
def quartile1_on_axis1(s: pd.DataFrame) -> pd.Series:
return s.quantile(0.25, axis=1)
cols = ['col_a', 'col_b', 'col_c']
df = df.withColumn('25%', quartile1_on_axis1(F.struct(*cols)))
df.show()
# +-----+--------+-----+--------+-------+
# |value| col_a|col_b| col_c| 25%|
# +-----+--------+-----+--------+-------+
# |row_a| 5.0| 0.0| 11.0| 2.5|
# |row_b| 3394.0| 0.0| 4543.0| 1697.0|
# |row_c|136111.0| 0.0|219255.0|68055.5|
# |row_d| 0.0| 0.0| 0.0| 0.0|
# |row_e| 0.0| 0.0| 0.0| 0.0|
# |row_f| 42.0| 0.0| 54.0| 21.0|
# +-----+--------+-----+--------+-------+
pyspark.sql.functions.pandas_udf
Note that the type hint should use pandas.Series in all cases but
there is one variant that pandas.DataFrame should be used for its
input or output type hint instead when the input or output column is
of pyspark.sql.types.StructType.
The following seems to do what's required, but instead of pandas_udf it uses a regular udf. It would be great if I could employ pandas_udf in a similar way.
from pyspark.sql import functions as F
import numpy as np
#F.udf('double')
def lower_quart(*cols):
return float(np.quantile(cols, 0.25))
df = df.withColumn('25%', lower_quart('col_a', 'col_b', 'col_c'))
df.show()
#+-----+--------+-----+--------+-------+
#|value| col_a|col_b| col_c| 25%|
#+-----+--------+-----+--------+-------+
#|row_a| 5.0| 0.0| 11.0| 2.5|
#|row_b| 3394.0| 0.0| 4543.0| 1697.0|
#|row_c|136111.0| 0.0|219255.0|68055.5|
#|row_d| 0.0| 0.0| 0.0| 0.0|
#|row_e| 0.0| 0.0| 0.0| 0.0|
#|row_f| 42.0| 0.0| 54.0| 21.0|
#+-----+--------+-----+--------+-------+
Related
I have a Spark dataframe, like so:
# For sake of simplicity only one id is shown, but there are multiple objects
+---+-------------------+------+
| id| timstm|signal|
+---+-------------------+------+
| X1|2022-07-01 00:00:00| null|
| X1|2022-07-02 00:00:00| true|
| X1|2022-07-03 00:00:00| null|
| X1|2022-07-05 00:00:00| null|
| X1|2022-07-09 00:00:00| true|
+---+-------------------+------+
And I want to create a new column that contains the time since the signal column was last true
+---+-------------------+------+---------+
| id| timstm|signal|time_diff|
+---+-------------------+------+---------+
| X1|2022-07-01 00:00:00| null| null|
| X1|2022-07-02 00:00:00| true| 0.0|
| X1|2022-07-03 00:00:00| null| 1.0|
| X1|2022-07-05 00:00:00| null| 3.0|
| X1|2022-07-09 00:00:00| true| 0.0|
+---+-------------------+------+---------+
Any ideas how to approach this? My intuition is to somehow use window and filter to achieve this, but I'm not sure
So this logic is a bit hard to express in native PySpark. It might be easier to express it as a pandas_udf. I will use the Fugue library to bring Python/Pandas code to a Pandas UDF, but if you don't want to use Fugue, you can still bring it to Pandas UDF, it just takes a lot more code.
Setup
Here I am just creating the DataFrame in the example. I know this is a Pandas DataFrame, we will convert it to Spark and run the solution on Spark later.
I suggest filling the null with False in the original DataFrame. This is because the Pandas code uses a group-by and NULL values are dropped by default in Pandas groupby. Filling the NULL with False will make it work properly (and I think it's also easier for conversion between Spark and Pandas).
import pandas as pd
df = pd.DataFrame({"id": ["X1"]*5,
"timestm": ["2022-07-01", "2022-07-02", "2022-07-03", "2022-07-05", "2022-07-09"],
"signal": [None, True, None, None, True]})
df['timestm'] = pd.to_datetime(df['timestm'])
df['signal'] = df['signal'].fillna(False)
Solution 1
So when we use Pandas-UDF, the important piece is that the function is applied per Spark partition. So the function just needs to be able to handle one id. And then we partition the Spark DataFrame by id and run the function for each one later.
Also be aware that order may not be guaranteed so we'll sort the data by time as the first step. The Pandas code I have is really just taken from another post here and modified.
def process(df: pd.DataFrame) -> pd.DataFrame:
df = df.sort_values('timestm')
df['days_since_last_event'] = df['timestm'].diff().apply(lambda x: x.days)
df.loc[:, 'days_since_last_event'] = df.groupby(df['signal'].shift().cumsum())['days_since_last_event'].cumsum()
df.loc[df['signal'] == True, 'days_since_last_event'] = 0
return df
process(df)
This will give us:
id timestm signal days_since_last_event
X1 2022-07-01 False NaN
X1 2022-07-02 True 0.0
X1 2022-07-03 False 1.0
X1 2022-07-05 False 3.0
X1 2022-07-09 True 0.0
Which looks right. Now we can bring it to Spark using Fugue with minimal additional lines of code. This will partition the data and run the function on each partition. Schema is a requirement for Pandas UDF so Fugue needs it also, but uses a simpler way to define it.
import fugue.api as fa
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
out = fa.transform(sdf, process, schema="*, days_since_last_event:int", partition={"by": "id"})
# out is a Spark DataFrame because a Spark DataFrame was passed in
out.show()
which gives us:
+---+-------------------+------+---------------------+
| id| timestm|signal|days_since_last_event|
+---+-------------------+------+---------------------+
| X1|2022-07-01 00:00:00| false| null|
| X1|2022-07-02 00:00:00| true| 0|
| X1|2022-07-03 00:00:00| false| 1|
| X1|2022-07-05 00:00:00| false| 3|
| X1|2022-07-09 00:00:00| true| 0|
+---+-------------------+------+---------------------+
Note to define the partition when running on the full data.
StringIndexer seems to infer the indices based on the unique values in the data. This is a problem when the data does not have every possible value. The toy example below considers three t-shirt sizes (Small, Medium, and Large), but only two (Small and Large) are in the data. I would like the StringIndexer to still consider all 3 possible sizes. Is there some way to create a column using the index of a string in supplied list? It would be preferable to do it as a Transformer() so that it could be re-used in a pipeline.
from pyspark.sql import Row
df = spark.createDataFrame([Row(id='0', size='Small'),
Row(id='1', size='Small'),
Row(id='2', size='Large')])
(
StringIndexer(inputCol="size", outputCol="size_idx")
.fit(df)
.transform(df)
.show()
)
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 1.0|
+---+-----+--------+
Desired output
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 2.0|
+---+-----+--------+
It looks like you can create the StringIndexer model directly from a set of labels instead of fitting from the data.
from pyspark.sql import Row
from pyspark.ml.feature import StringIndexerModel
df = spark.createDataFrame([Row(id='0', size='Small'),
Row(id='1', size='Small'),
Row(id='2', size='Large')])
si = StringIndexerModel.from_labels(['Small', 'Medium', 'Large'],
inputCol="size",
outputCol="size_idx")
si.transform(df).show()
+---+-----+--------+
| id| size|size_idx|
+---+-----+--------+
| 0|Small| 0.0|
| 1|Small| 0.0|
| 2|Large| 2.0|
+---+-----+--------+
Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it:
output = output.select(col("ad").alias("ad_id"),
col("part").alias("part_id"),
func.round(col("new_bid"), 2).alias("bid"))
the new_bid column here is of type float - the resulting dataframe does not have the newly named bid column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.
I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!
Here are a couple of ways to do it with some toy data:
spark.version
# u'2.2.0'
import pyspark.sql.functions as func
df = spark.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631|
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261|
# | 0.6| 0.9| 2.7679|
# | 1.2| 1.0|9.87984|
# +----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+
# |col1|col2| col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631| 3.46|
# | 0.4| 1.4|2.82945| 2.83|
# | 0.5| 1.9|7.76261| 7.76|
# | 0.6| 0.9| 2.7679| 2.77|
# | 1.2| 1.0|9.87984| 9.88|
# +----+----+-------+--------+
# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | 0.0| 0.2|3.46|
# | 0.4| 1.4|2.83|
# | 0.5| 1.9|7.76|
# | 0.6| 0.9|2.77|
# | 1.2| 1.0|9.88|
# +----+----+----+
It's a personal taste, but I am not a great fan of either col or alias - I prefer withColumn and withColumnRenamed instead. Nevertheless, if you would like to stick with select and col, here is how you should adapt your own code snippet:
from pyspark.sql.functions import col
df4 = df.select(col("col1").alias("new_col1"),
col("col2").alias("new_col2"),
func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+
# |new_col1|new_col2|new_col3|
# +--------+--------+--------+
# | 0.0| 0.2| 3.46|
# | 0.4| 1.4| 2.83|
# | 0.5| 1.9| 7.76|
# | 0.6| 0.9| 2.77|
# | 1.2| 1.0| 9.88|
# +--------+--------+--------+
Suppose I have the following spark-dataframe:
+-----+-------+
| word| label|
+-----+-------+
| red| color|
| red| color|
| blue| color|
| blue|feeling|
|happy|feeling|
+-----+-------+
Which can be created using the following code:
sample_df = spark.createDataFrame([
('red', 'color'),
('red', 'color'),
('blue', 'color'),
('blue', 'feeling'),
('happy', 'feeling')
],
('word', 'label')
)
I can perform a groupBy() to get the counts of each word-label pair:
sample_df = sample_df.groupBy('word', 'label').count()
#+-----+-------+-----+
#| word| label|count|
#+-----+-------+-----+
#| blue| color| 1|
#| blue|feeling| 1|
#| red| color| 2|
#|happy|feeling| 1|
#+-----+-------+-----+
And then pivot() and sum() to get the label counts as columns:
import pyspark.sql.functions as f
sample_df = sample_df.groupBy('word').pivot('label').agg(f.sum('count')).na.fill(0)
#+-----+-----+-------+
#| word|color|feeling|
#+-----+-----+-------+
#| red| 2| 0|
#|happy| 0| 1|
#| blue| 1| 1|
#+-----+-----+-------+
What is the best way to transform this dataframe such that each row is divided by the total for that row?
# Desired output
+-----+-----+-------+
| word|color|feeling|
+-----+-----+-------+
| red| 1.0| 0.0|
|happy| 0.0| 1.0|
| blue| 0.5| 0.5|
+-----+-----+-------+
One way to achieve this result is to use __builtin__.sum (NOT pyspark.sql.functions.sum) to get the row-wise sum and then call withColumn() for each label:
labels = ['color', 'feeling']
sample_df.withColumn('total', sum([f.col(x) for x in labels]))\
.withColumn('color', f.col('color')/f.col('total'))\
.withColumn('feeling', f.col('feeling')/f.col('total'))\
.select('word', 'color', 'feeling')\
.show()
But there has to be a better way than enumerating each of the possible columns.
More generally, my question is:
How can I apply an arbitrary transformation, that is a function of the current row, to multiple columns simultaneously?
Found an answer on this Medium post.
First make a column for the total (as above), then use the * operator to unpack a list comprehension over the labels in select():
labels = ['color', 'feeling']
sample_df = sample_df.withColumn('total', sum([f.col(x) for x in labels]))
sample_df.select(
'word', *[(f.col(col_name)/f.col('total')).alias(col_name) for col_name in labels]
).show()
The approach shown on the linked post shows how to generalize this for arbitrary transformations.
Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it:
output = output.select(col("ad").alias("ad_id"),
col("part").alias("part_id"),
func.round(col("new_bid"), 2).alias("bid"))
the new_bid column here is of type float - the resulting dataframe does not have the newly named bid column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.
I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!
Here are a couple of ways to do it with some toy data:
spark.version
# u'2.2.0'
import pyspark.sql.functions as func
df = spark.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631|
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261|
# | 0.6| 0.9| 2.7679|
# | 1.2| 1.0|9.87984|
# +----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+
# |col1|col2| col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631| 3.46|
# | 0.4| 1.4|2.82945| 2.83|
# | 0.5| 1.9|7.76261| 7.76|
# | 0.6| 0.9| 2.7679| 2.77|
# | 1.2| 1.0|9.87984| 9.88|
# +----+----+-------+--------+
# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | 0.0| 0.2|3.46|
# | 0.4| 1.4|2.83|
# | 0.5| 1.9|7.76|
# | 0.6| 0.9|2.77|
# | 1.2| 1.0|9.88|
# +----+----+----+
It's a personal taste, but I am not a great fan of either col or alias - I prefer withColumn and withColumnRenamed instead. Nevertheless, if you would like to stick with select and col, here is how you should adapt your own code snippet:
from pyspark.sql.functions import col
df4 = df.select(col("col1").alias("new_col1"),
col("col2").alias("new_col2"),
func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+
# |new_col1|new_col2|new_col3|
# +--------+--------+--------+
# | 0.0| 0.2| 3.46|
# | 0.4| 1.4| 2.83|
# | 0.5| 1.9| 7.76|
# | 0.6| 0.9| 2.77|
# | 1.2| 1.0| 9.88|
# +--------+--------+--------+