The below UDF doesn't work - am I passing 2 columns in correctly & calling the function in the right way?
Thanks!!
def shield(x, y):
if x == '':
shield = y
else:
shield = x
return shield
df3.withColumn("shield", shield(df3.custavp1, df3.custavp1))
I think the passing the arguments to the udf is incorrect.
The correct way is given below:
>>> ls
[[1, 2, 3, 4], [5, 6, 7, 8]]
>>> from pyspark.sql import Row
>>> R = Row("A1", "A2")
>>> df = sc.parallelize([R(*r) for r in zip(*ls)]).toDF()
>>> df.show
<bound method DataFrame.show of DataFrame[A1: bigint, A2: bigint]>
>>> df.show()
+---+---+
| A1| A2|
+---+---+
| 1| 5|
| 2| 6|
| 3| 7|
| 4| 8|
+---+---+
>>> def foo(x,y):
... if x%2 == 0:
... return x
... else:
... return y
...
>>>
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import IntegerType
>>>
>>> custom_udf = udf(foo, IntegerType())
>>> df1 = df.withColumn("res", custom_udf(col("A1"), col("A2")))
>>> df1.show()
+---+---+---+
| A1| A2|res|
+---+---+---+
| 1| 5| 5|
| 2| 6| 2|
| 3| 7| 7|
| 4| 8| 4|
+---+---+---+
Let me know if it helps.
Related
Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.
when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
I have a pyspark dataframe with the following schema
+-----------+---------+----------+-----------+
| userID|grouping1| grouping2| features|
+-----------+---------+----------+-----------+
|12462563356| 1| A | [5.0,43.0]|
|12462563701| 2| A | [1.0,8.0]|
|12462563701| 1| B | [2.0,12.0]|
|12462564356| 1| C | [1.0,1.0]|
|12462565487| 3| C | [2.0,3.0]|
|12462565698| 2| D | [1.0,1.0]|
|12462565698| 1| E | [1.0,1.0]|
|12462566081| 2| C | [1.0,2.0]|
|12462566081| 1| D | [1.0,15.0]|
|12462566225| 2| E | [1.0,1.0]|
|12462566225| 1| A | [9.0,85.0]|
|12462566526| 2| C | [1.0,1.0]|
|12462566526| 1| D | [3.0,79.0]|
|12462567006| 2| D |[11.0,15.0]|
|12462567006| 1| B |[10.0,15.0]|
|12462567006| 3| A |[10.0,15.0]|
|12462586595| 2| B | [2.0,42.0]|
|12462586595| 3| D | [2.0,16.0]|
|12462589343| 3| E | [1.0,1.0]|
+-----------+---------+----------+-----------+
For values in grouping2 A, B, C and D I need to apply UDF_A, UDF_B, UDF_C and UDF_D respectively. Is there a way I can write something along the lines of
dataset = dataset.withColumn('outputColName', selectUDF(**params))
where selectUDF is defined as
def selectUDF(**params):
if row[grouping2] == A:
return UDF_A(**params)
elif row[grouping2] == B:
return UDF_B(**params)
elif row[grouping2] == C:
return UDF_C(**params)
elif row[grouping2] == D:
return UDF_D(**params)
Using the following example to illustrate what I'm trying to do
Yes i thought so too. I'm using the following toy code to check this
>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
>>> def udf1(col):
... return col1*col1
...
>>> def udf2(col):
... return col2*col2*col2
...
>>> def select_udf(col1, col2):
... if col1 == 2:
... return udf1(col2)
... elif col1 == 3:
... return udf2(col2)
... else:
... return 0
...
>>> from pyspark.sql.functions import col
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import IntegerType
>>> select_udf = udf(select_udf, IntegerType())
>>> udf1 = udf(udf1, IntegerType())
>>> udf2 = udf(udf2, IntegerType())
>>> df.withColumn("outCol", select_udf(col("b"), col("c"))).show()
[Stage 9:============================================> (3 + 1) / 4]
This seems to be stuck at this stage forever. Can anyone suggest what might be wrong here?
You don't need a selectUDF, simply use when expression to apply the desired udf depending on the value of grouping2 column:
from pyspark.sql.functions import col, when
df = df.withColumn(
"outCol",
when(col("grouping2") == "A", UDF_A(*params))
.when(col("grouping2") == "B", UDF_B(*params))
.when(col("grouping2") == "C", UDF_C(*params))
.when(col("grouping2") == "D", UDF_D(*params))
)
I am trying to do binning on a particular column in dataframe based on the data given in the dictionary.
Below is the dataframe I used:
df
SNo,Name,CScore
1,x,700
2,y,850
3,z,560
4,a,578
5,b,456
6,c,678
I have created the below function,it is working fine if I use it seperately.
def binning(column,dict):
finalColumn=[]
lent = len(column)
for i in range(lent):
for j in range(len(list(dict))):
if( int(column[i]) in range(list(dict)[j][0],list(dict)[j][1])):
finalColumn.append(dict[list(dict)[j]])
return finalColumn
I have used the above function in the below statement.
newDf = df.withColumn("binnedColumn",binning(df.select("CScore").rdd.flatMap(lambda x: x).collect(),{(1,400):'Low',(401,900):'High'}))
I am getting the below error:
Traceback (most recent call last):
File "", line 1, in
File "C:\spark_2.4\python\pyspark\sql\dataframe.py", line 1988, in withColumn
assert isinstance(col, Column), "col should be Column"
AssertionError: col should be Column
Any help to solve this issue will be of great help.Thanks.
Lets start with creating the data:
rdd = sc.parallelize([[1,"x",700],[2,"y",850],[3,"z",560],[4,"a",578],[5,"b",456],[6,"c",678]])
df = rdd.toDF(["SNo","Name","CScore"])
>>> df.show()
+---+----+------+
|SNo|Name|CScore|
+---+----+------+
| 1| x| 700|
| 2| y| 850|
| 3| z| 560|
| 4| a| 578|
| 5| b| 456|
| 6| c| 678|
+---+----+------+
if your final goal is to provide a binning_dictionary like your example do, to find the corresponding categorical value.
udf is the solution.
the following is your normal function.
bin_lookup = {(1,400):'Low',(401,900):'High'}
def binning(value, lookup_dict=bin_lookup):
for key in lookup_dict.keys():
if key[0] <= value <= key[1]:
return lookup_dict[key]
to register as a udf and run it via dataframe:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
binning_udf = udf(binning, StringType())
>>> df.withColumn("binnedColumn", binning_udf("CScore")).show()
+---+----+------+------------+
|SNo|Name|CScore|binnedColumn|
+---+----+------+------------+
| 1| x| 700| High|
| 2| y| 850| High|
| 3| z| 560| High|
| 4| a| 578| High|
| 5| b| 456| High|
| 6| c| 678| High|
+---+----+------+------------+
directly apply to rdd:
>>> rdd.map(lambda row: binning(row[-1], bin_lookup)).collect()
['High', 'High', 'High', 'High', 'High', 'High']
I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+
You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))
Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+
Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+
I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+