How to bin in PySpark? - apache-spark

How to bin in PySpark? - apache-spark

For example, I'd like to classify a DataFrame of people into the following 4 bins according to age.
age_bins = [0, 6, 18, 60, np.Inf]
age_labels = ['infant', 'minor', 'adult', 'senior']
I would use pandas.cut() to do this in pandas. How do I do this in PySpark?

You can use Bucketizer feature transfrom from ml library in spark.
values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)]
df = spark.createDataFrame(values, ["name", "ages"])
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets")
df_buck = bucketizer.setHandleInvalid("keep").transform(df)
df_buck.show()
output
+----+----+-------+
|name|ages|buckets|
+----+----+-------+
| a| 23| 2.0|
| b| 45| 2.0|
| c| 10| 1.0|
| d| 60| 3.0|
| e| 56| 2.0|
| f| 2| 0.0|
| g| 25| 2.0|
| h| 40| 2.0|
| j| 33| 2.0|
+----+----+-------+
If you want names for each bucket you can use udf to create a new column with bucket names
from pyspark.sql.functions import udf
from pyspark.sql.types import *
t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"}
udf_foo = udf(lambda x: t[x], StringType())
df_buck.withColumn("age_bucket", udf_foo("buckets")).show()
output
+----+----+-------+----------+
|name|ages|buckets|age_bucket|
+----+----+-------+----------+
| a| 23| 2.0| adult|
| b| 45| 2.0| adult|
| c| 10| 1.0| minor|
| d| 60| 3.0| senior|
| e| 56| 2.0| adult|
| f| 2| 0.0| infant|
| g| 25| 2.0| adult|
| h| 40| 2.0| adult|
| j| 33| 2.0| adult|
+----+----+-------+----------+

You could also write a PySpark UDF:
def categorizer(age):
if age < 6:
return "infant"
elif age < 18:
return "minor"
elif age < 60:
return "adult"
else:
return "senior"
Then:
bucket_udf = udf(categorizer, StringType() )
bucketed = df.withColumn("bucket", bucket_udf("age"))

In my case I had to randomly bucket a string value column, so it required me some extra steps:
from pyspark.sql.types import LongType, IntegerType
import pyspark.sql.functions as F
buckets_number = 4 # number of buckets desired
df.withColumn("sub", F.substring(F.md5('my_col'), 0, 16)) \
.withColumn("translate", F.translate("sub", "abcdefghijklmnopqrstuvwxyz", "01234567890123456789012345").cast(LongType())) \
.select("my_col",
(F.col("translate") % (buckets_number + 1)).cast(IntegerType()).alias("bucket_my_col"))
hash it with MD5
substring the result to 16 characters (otherwise would have a too big number in following steps)
translate letters generated by MD5 in numbers
apply modulo function based on the number of desired buckets

In case you know the bin width, then you can use division with a cast. The result is multiplied by the bin width to get the lower bound of the bin as a label.
from pyspark.sql.types import IntegerType
def categorize(df, bin_width):
df = df.withColumn('bucket', (col('value') / bin_width).cast(IntegerType()) * bin_width)
return df
values = [("a", 23), ("b", 45), ("e", 56), ("f", 2)]
df = spark.createDataFrame(values, ["name", "value"])
categorize(df, bin_width=10).show()
Output:
+----+---+------+
|name|age|bucket|
+----+---+------+
| a| 23| 20|
| b| 45| 40|
| e| 56| 50|
| f| 2| 0|
+----+---+------+
Notice that it also works for floating point attributes:
values = [("a", .23), ("b", .45), ("e", .56), ("f", .02)]
df = spark.createDataFrame(values, ["name", "value"])
categorize(df, bin_width=.10).show()
Output:
+----+-----+------+
|name|value|bucket|
+----+-----+------+
| a| 0.23| 0.2|
| b| 0.45| 0.4|
| e| 0.56| 0.5|
| f| 0.02| 0.0|
+----+-----+------+

Related

How to fill up null values in Spark Dataframe based on other columns' value?

Given this dataframe:
+-----+-----+----+
|num_a|num_b| sum|
+-----+-----+----+
| 1| 1| 2|
| 12| 15| 27|
| 56| 11|null|
| 79| 3| 82|
| 111| 114| 225|
+-----+-----+----+
How would you fill up Null values in sum column if the value can be gathered from other columns? In this example 56+11 would be the value.
I've tried df.fillna with an udf, but that doesn't seems to work, as it was just getting the column name not the actual value. I would want to compute the value just for the rows with missing values, so creating a new column would not be a viable option.

If your requirement is UDF, then it can be done as:
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
F.udf(returnType=LongType)
def fill_with_sum(num_a, num_b, sum):
return sum if sum is None else (num_a + num_b)
df = df.withColumn("sum", fill_with_sum(F.col("num_a"), F.col("num_b"), F.col("sum")))
[Out]:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+

You can use coalesce function. Check this sample code
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
df.withColumn("sum", f.coalesce(f.col("sum"), f.col("num_a") + f.col("num_b"))).show()
Output is:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+

Select column name per row for max value in PySpark

I have a dataframe like this, shown only two columns however there are many columns in original dataframe
data = [(("ID1", 3, 5)), (("ID2", 4, 12)), (("ID3", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3| 5|
|ID2| 4| 12|
|ID3| 8| 3|
+---+----+----+
I want to extract the name of the column per row, which has the max value. Hence the expected output is like this
+---+----+----+-------+
| ID|colA|colB|Max_col|
+---+----+----+-------+
|ID1| 3| 5| colB|
|ID2| 4| 12| colB|
|ID3| 8| 3| colA|
+---+----+----+-------+
In case of tie, where colA and colB have same value, choose the first column.
How can I achieve this in pyspark

You can use UDF on each row for row wise computation and use struct to pass multiple columns to udf. Hope this helps.
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from operator import itemgetter
data = [(("ID1", 3, 5,78)), (("ID2", 4, 12,45)), (("ID3", 70, 3,67))]
df = spark.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 78|
|ID2| 4| 12| 45|
|ID3| 70| 3| 70|
+---+----+----+----+
cols = df.columns
# to get max of values in a row
maxcol = F.udf(lambda row: max(row), IntegerType())
maxDF = df.withColumn("maxval", maxcol(F.struct([df[x] for x in df.columns[1:]])))
maxDF.show()
+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3 |5 |78 |78 |
|ID2|4 |12 |45 |45 |
|ID3|70 |3 |67 |70 |
+---+----+----+----+-------+
# to get max of value & corresponding column name
schema=StructType([StructField('maxval',IntegerType()),StructField('maxval_colname',StringType())])
maxcol = F.udf(lambda row: max(row,key=itemgetter(0)), schema)
maxDF = df.withColumn('maxfield', maxcol(F.struct([F.struct(df[x],F.lit(x)) for x in df.columns[1:]]))).\
select(df.columns+['maxfield.maxval','maxfield.maxval_colname'])
+---+----+----+----+------+--------------+
| ID|colA|colB|colC|maxval|maxval_colname|
+---+----+----+----+------+--------------+
|ID1| 3 | 5 | 78 | 78 | colC |
|ID2| 4 | 12 | 45 | 45 | colC |
|ID3| 70 | 3 | 67 | 68 | colA |
+---+----+----+----+------+--------------+

There are multiple options to achieve this. I am a providing example for one and can provide a hint for rest-
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
from pyspark.sql import types as T
data = [(("ID1", 3, 5)), (("ID2", 4, 12)), (("ID3", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3| 5|
|ID2| 4| 12|
|ID3| 8| 3|
+---+----+----+
#Below F.array creates an array of column name and value pair like [['colA', 3], ['colB', 5]] then F.explode break this array into rows like different column and value pair should be in different rows
df = df.withColumn(
"max_val",
F.explode(
F.array([
F.array([F.lit(cl), F.col(cl)]) for cl in df.columns[1:]
])
)
)
df.show()
+---+----+----+----------+
| ID|colA|colB| max_val|
+---+----+----+----------+
|ID1| 3| 5| [colA, 3]|
|ID1| 3| 5| [colB, 5]|
|ID2| 4| 12| [colA, 4]|
|ID2| 4| 12|[colB, 12]|
|ID3| 8| 3| [colA, 8]|
|ID3| 8| 3| [colB, 3]|
+---+----+----+----------+
#Then select columns so that column name and value should be in different columns
df = df.select(
"ID",
"colA",
"colB",
F.col("max_val").getItem(0).alias("col_name"),
F.col("max_val").getItem(1).cast(T.IntegerType()).alias("col_value"),
)
df.show()
+---+----+----+--------+---------+
| ID|colA|colB|col_name|col_value|
+---+----+----+--------+---------+
|ID1| 3| 5| colA| 3|
|ID1| 3| 5| colB| 5|
|ID2| 4| 12| colA| 4|
|ID2| 4| 12| colB| 12|
|ID3| 8| 3| colA| 8|
|ID3| 8| 3| colB| 3|
+---+----+----+--------+---------+
# Rank column values based on ID in desc order
df = df.withColumn(
"rank",
F.rank().over(W.partitionBy("ID").orderBy(F.col("col_value").desc()))
)
df.show()
+---+----+----+--------+---------+----+
| ID|colA|colB|col_name|col_value|rank|
+---+----+----+--------+---------+----+
|ID2| 4| 12| colB| 12| 1|
|ID2| 4| 12| colA| 4| 2|
|ID3| 8| 3| colA| 8| 1|
|ID3| 8| 3| colB| 3| 2|
|ID1| 3| 5| colB| 5| 1|
|ID1| 3| 5| colA| 3| 2|
+---+----+----+--------+---------+----+
#Finally Filter rank = 1 as max value have rank 1 because we ranked desc value
df.where("rank=1").show()
+---+----+----+--------+---------+----+
| ID|colA|colB|col_name|col_value|rank|
+---+----+----+--------+---------+----+
|ID2| 4| 12| colB| 12| 1|
|ID3| 8| 3| colA| 8| 1|
|ID1| 3| 5| colB| 5| 1|
+---+----+----+--------+---------+----+
Other Options are -
Use UDF on your base df and return column name having a max value
In the same example after making the column name and value column instead of rank use group by ID take max col_value. Then join with the previous df.

You can use the RDD API to add the new column:
df.rdd.map(lambda r: r.asDict())\
.map(lambda r: Row(Max_col=max([i for i in r.items() if i[0] != 'ID'],
key=lambda kv: kv[1])[0], **r) )\
.toDF()
Resulting in:
+---+-------+----+----+
| ID|Max_col|colA|colB|
+---+-------+----+----+
|ID1| colB| 3| 5|
|ID2| colB| 4| 12|
|ID3| colA| 8| 3|
+---+-------+----+----+

Extending what Suresh has done.... returning appropriate the column name
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType
import numpy as np
data = [(("ID1", 3, 5,78)), (("ID2", 4, 12,45)), (("ID3", 68, 3,67))]
df = spark.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())
maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)
+---+----+----+----+------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+------+
|ID1|3 |5 |78 |colC |
|ID2|4 |12 |45 |colC |
|ID3|68 |3 |67 |colA |
+---+----+----+----+------+

try the following:
from pyspark.sql import functions as F
data = [(("ID1", 3, 5)), (("ID2", 4, 12)), (("ID3", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.withColumn('max_col',
F.when(F.col('colA') > F.col('colB'), 'colA').
otherwise('colB')).show()
Yields:
+---+----+----+-------+
| ID|colA|colB|max_col|
+---+----+----+-------+
|ID1| 3| 5| colB|
|ID2| 4| 12| colB|
|ID3| 8| 3| colA|
+---+----+----+-------+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!

With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])

I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

Unpivot in Spark SQL / PySpark

I have a problem statement at hand wherein I want to unpivot table in Spark SQL / PySpark. I have gone through the documentation and I could see there is support only for pivot, but no support for un-pivot so far.
Is there a way I can achieve this?
Let my initial table look like this:
When I pivot this in PySpark:
df.groupBy("A").pivot("B").sum("C")
I get this as the output:
Now I want to unpivot the pivoted table. In general, this operation may/may not yield the original table based on how I've pivoted the original table.
Spark SQL as of now doesn't provide out of the box support for unpivot. Is there a way I can achieve this?

You can use the built in stack function, for example in Scala:
scala> val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: int ... 2 more fields]
scala> df.show
+---+----+---+----+
| A| X| Y| Z|
+---+----+---+----+
| G| 4| 2|null|
| H|null| 4| 5|
+---+----+---+----+
scala> df.select($"A", expr("stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)")).where("C is not null").show
+---+---+---+
| A| B| C|
+---+---+---+
| G| X| 4|
| G| Y| 2|
| H| Y| 4|
| H| Z| 5|
+---+---+---+
Or in pyspark:
In [1]: df = spark.createDataFrame([("G",4,2,None),("H",None,4,5)],list("AXYZ"))
In [2]: df.show()
+---+----+---+----+
| A| X| Y| Z|
+---+----+---+----+
| G| 4| 2|null|
| H|null| 4| 5|
+---+----+---+----+
In [3]: df.selectExpr("A", "stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show()
+---+---+---+
| A| B| C|
+---+---+---+
| G| X| 4|
| G| Y| 2|
| H| Y| 4|
| H| Z| 5|
+---+---+---+

Spark 3.4+
df = df.melt(['A'], ['X', 'Y', 'Z'], 'B', 'C')
# OR
df = df.unpivot(['A'], ['X', 'Y', 'Z'], 'B', 'C')
+---+---+----+
| A| B| C|
+---+---+----+
| G| Y| 2|
| G| Z|null|
| G| X| 4|
| H| Y| 4|
| H| Z| 5|
| H| X|null|
+---+---+----+
To filter out nulls: df = df.filter("C is not null")
Spark 3.3 and below
to_melt = {'X', 'Y', 'Z'}
new_names = ['B', 'C']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
).filter(f"!{new_names[1]} is null")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([("G", 4, 2, None), ("H", None, 4, 5)], list("AXYZ"))
to_melt = {'X', 'Y', 'Z'}
new_names = ['B', 'C']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
).filter(f"!{new_names[1]} is null")
df.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | G| Y| 2|
# | G| X| 4|
# | H| Y| 4|
# | H| Z| 5|
# +---+---+---+

How to get the min of each row in PySpark DataFrame [duplicate]

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)

We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+

You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))

Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+

Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to bin in PySpark? - apache-spark

For example, I'd like to classify a DataFrame of people into the following 4 bins according to age. age_bins = [0, 6, 18, 60, np.Inf] age_labels = ['infant', 'minor', 'adult', 'senior'] I would use pandas.cut() to do this in pandas. How do I do this in PySpark?

You could also write a PySpark UDF: def categorizer(age): if age < 6: return "infant" elif age < 18: return "minor" elif age < 60: return "adult" else: return "senior" Then: bucket_udf = udf(categorizer, StringType() ) bucketed = df.withColumn("bucket", bucket_udf("age"))

Related

How to fill up null values in Spark Dataframe based on other columns' value?

Select column name per row for max value in PySpark

PySpark : change column names of a df based on relations defined in another df

Unpivot in Spark SQL / PySpark

How to get the min of each row in PySpark DataFrame [duplicate]

Categories

Resources