I get a StackOverflowException error when I use multiple times withColumn() to update the values of a column in Pyspark.
My code with I got the StackOverflowException was:
df = df.withColumn("element", when(df["element"] == 1,"first").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 2,"second").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 3,"third").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 4,"fourth").otherwise(df["element"]))
The Spark documentation suggests to use the select() function. So I tried:
df = df.select("*", (when(df["element"] == 1,"first")).alias("element"))
df = df.select("*", (when(df["element"] == 2,"second")).alias("element"))
df = df.select("*", (when(df["element"] == 3,"third")).alias("element"))
df = df.select("*", (when(df["element"] == 4,"fourth")).alias("element"))
But I recieve an error because of the column "element" isn't updated, another column with the same name is created. The error is this:
Py4JJavaError: An error occurred while calling o3723.apply.
: org.apache.spark.sql.AnalysisException: Reference 'element' is ambiguous, could be: element, element.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:229)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1282)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1249)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
How could I do it?
Thank you in advance!
I think you can use .when multiple times, then .otherwise. Also you should name the new column something different so that you don't get an ambiguous column error:
df = df.withColumn("element_new", when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]))
Using .select:
df = df.select("*",when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]).alias("element_new"))
Example output:
+-------+-----------+
|element|element_new|
+-------+-----------+
| 1| first|
| 2| second|
| 3| third|
| 4| fourth|
| 5| 5|
+-------+-----------+
Related
I have setup a Spark cluster version 3.1.2. I am using Python API for Spark. I have some JSON data that I have loaded in dataframe. I have to parse a nested column (ADSZ_2) that looks like following format
ADSZ_2: [{key,value}, {key,value}, {key,value}]
I have developed following code for this purpose
...
...
def parseCell(array_data):
final_list = []
if array_data is not None:
for record in array_data:
record_dict = record.asDict()
if "string1" in record_dict:
string1 = remover(record_dict["string1"])
record_dict["string1"] = string1
if "string2" in record_dict:
string2 = remover(record_dict["string2"])
record_dict["string2"] = string2
final_list.append(Row(**record_dict))
return final_list
df = spark.read.load(data_path, multiline="false", format="json")
udf_fun = udf(lambda row: parseCell(row), ArrayType(StructType()))
df.withColumn("new-name", udf_fun(col("ADSZ_2"))).show()
...
When I run above code, I got following exception
21/10/07 09:09:07 ERROR Executor: Exception in task 0.0 in stage 116.0 (TID 132)
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:136)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec.$anonfun$evaluate$6(BatchEvalPythonExec.scala:94)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/10/07 09:09:07 WARN TaskSetManager: Lost task 0.0 in stage 116.0 (TID 132) (hadoop-master.local executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:136)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec.$anonfun$evaluate$6(BatchEvalPythonExec.scala:94)
I have tried various options as given at 1 but None of these solutions works. Where is the problem ?
Is there any better way to do this job ?
I will propose an alternative solution where you transform your rows with the rdd of the dataframe. Here is a self-contained example that I have tried to adopt to your data:
import pyspark.sql.functions as F
from pyspark.sql import Row
import pyspark.sql.types as T
df = spark.createDataFrame([Row(ADSZ_2=[{"string1": "a", "string2": "b"}, {"string1": "c", "string2": "d"}]),
Row(ADSZ_2=[{"string1": "e", "string2": "f"}, {"string1": "g", "not_taken" : "1", "string2": "h"}]),
Row(ADSZ_2=[{"string1": "i", "string2": "j"}, {"string1": "k", "not_taken" : "1", "string2": "l"}]),
Row(ADSZ_2=None),
Row(ADSZ_2=[None, {"string1": "m", "not_taken" : "1", "string2": "n"}])])
df.show(20, False)
def parseCell(row):
final_list = []
l = row["ADSZ_2"]
if l:
for record_dict in l:
if record_dict:
new_dict = {key : val for key,val in record_dict.items() if key in ["string1", "string2"]}
if new_dict:
final_list.append(Row(**new_dict))
return final_list
df_rdd = df.rdd.flatMap(lambda row: parseCell(row))
new_df = spark.createDataFrame(df_rdd)
new_df.show()
output:
+----------------------------------------------------------------------------+
|ADSZ_2 |
+----------------------------------------------------------------------------+
|[{string1 -> a, string2 -> b}, {string1 -> c, string2 -> d}] |
|[{string1 -> e, string2 -> f}, {not_taken -> 1, string1 -> g, string2 -> h}]|
|[{string1 -> i, string2 -> j}, {not_taken -> 1, string1 -> k, string2 -> l}]|
|null |
|[null, {not_taken -> 1, string1 -> m, string2 -> n}] |
+----------------------------------------------------------------------------+
+-------+-------+
|string1|string2|
+-------+-------+
| a| b|
| c| d|
| e| f|
| g| h|
| i| j|
| k| l|
| m| n|
+-------+-------+
You need to make sure that all the rows that you generate in parseCell contains the correct number of columns.
I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))
Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
Is there any elegant way to explode map column in Pyspark 2.2 without loosing null values? Explode_outer was introduced in Pyspark 2.3
The schema of the affected column is:
|-- foo: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- first: long (nullable = true)
| | |-- last: long (nullable = true)
I would like to replace empty Map with some dummy values to be able to explode whole dataframe without loosing null values. I have tried something like this, but i get an error:
from pyspark.sql.functions import when, size, col
df = spark.read.parquet("path").select(
when(size(col("foo")) == 0, {"key": [0, 0]}).alias("bar")
)
And the error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.when.
: java.lang.RuntimeException: Unsupported literal type class java.util.HashMap {key=[0, 0]}
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at org.apache.spark.sql.functions$.when(functions.scala:1256)
at org.apache.spark.sql.functions.when(functions.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
So I have finally made it. I have replaced empty map with some dummy values and then used explode and drop original column.
replace_empty_map = udf(lambda x: {"key": [0, 1]} if len(x) == 0 else x,
MapType(StringType(),
StructType(
[StructField("first", LongType()), StructField("last", LongType())]
)
)
)
df = df.withColumn("foo_replaced",replace_empty_map(df["foo"])).drop("foo")
df = df.select('*', explode('foo_replaced').alias('foo_key', 'foo_val')).drop("foo_replaced")
This question already has answers here:
Passing a data frame column and external list to udf under withColumn
(4 answers)
Closed 5 years ago.
I am using python3 on Spark(2.2.0). I want to apply my UDF to a specified list of strings.
df = ['Apps A','Chrome', 'BBM', 'Apps B', 'Skype']
def calc_app(app, app_list):
browser_list = ['Chrome', 'Firefox', 'Opera']
chat_list = ['WhatsApp', 'BBM', 'Skype']
sum = 0
for data in app:
name = data['name']
if name in app_list:
sum += 1
return sum
calc_appUDF = udf(calc_app)
df = df.withColumn('app_browser', calc_appUDF(df['apps'], browser_list))
df = df.withColumn('app_chat', calc_appUDF(df['apps'], chat_list))
But it failed and returns : 'Unsupported literal type class java.util.ArrayList'
If I understood your requirement correctly then you should try this
from pyspark.sql.functions import udf, col
#sample data
df_list = ['Apps A','Chrome', 'BBM', 'Apps B', 'Skype']
df = sqlContext.createDataFrame([(l,) for l in df_list], ['apps'])
df.show()
#some lists definition
browser_list = ['Chrome', 'Firefox', 'Opera']
chat_list = ['WhatsApp', 'BBM', 'Skype']
#udf definition
def calc_app(app, app_list):
if app in app_list:
return 1
else:
return 0
def calc_appUDF(app_list):
return udf(lambda l: calc_app(l, app_list))
#add new columns
df = df.withColumn('app_browser', calc_appUDF(browser_list)(col('apps')))
df = df.withColumn('app_chat', calc_appUDF(chat_list)(col('apps')))
df.show()
Sample input:
+------+
| apps|
+------+
|Apps A|
|Chrome|
| BBM|
|Apps B|
| Skype|
+------+
Output is:
+------+-----------+--------+
| apps|app_browser|app_chat|
+------+-----------+--------+
|Apps A| 0| 0|
|Chrome| 1| 0|
| BBM| 0| 1|
|Apps B| 0| 0|
| Skype| 0| 1|
+------+-----------+--------+
I'm applying an aggregate function (max) to a column, which I'm then referencing in a join.
The column becomes max(column_name) in the data frame. So, to make it easier to reference using Python's dot notation, I aliased the column, but I'm still getting an error:
tmp = hiveContext.sql("SELECT * FROM s3_data.nate_glossary WHERE profile_id_guid='ffaff64b-e87c-4a43-b593-b0e4bccc2731'"
)
max_processed = tmp.groupby('profile_id_guid','profile_version_id','record_type','scope','item_id','attribute_key') \
.agg(max("processed_date").alias("max_processed_date"))
df = max_processed.join(tmp, [max_processed.profile_id_guid == tmp.profile_id_guid,
max_processed.profile_version_id == tmp.profile_version_id,
max_processed.record_type == tmp.record_type,
max_processed.scope == tmp.scope,
max_processed.item_id == tmp.item_id,
max_processed.attribute_key == tmp.attribute_key,
max_processed.max_processed_date == tmp.processed_date])
The error:
File "", line 7, in File
"/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/dataframe.py", line
650, in join
jdf = self._jdf.join(other._jdf, on._jc, "inner") File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in call File
"/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py", line 51, in
deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'resolved attribute(s)
processed_date#10 missing from
record_type#41,scope#4,item_id#5,profile_id_guid#1,data_type#44,attribute_value#47,logical_id#45,profile_version_id#40,profile_version_id#2,attribute_key#8,max_processed_date#37,attribute_key#46,processed_date#48,scope#42,record_type#3,item_id#43,profile_id_guid#39,ems_system_id#38
in operator !Join Inner, Some((((((((profile_id_guid#1 =
profile_id_guid#1) && (profile_version_id#2 = profile_version_id#2))
&& (record_type#3 = record_type#3)) && (scope#4 = scope#4)) &&
(item_id#5 = item_id#5)) && (attribute_key#8 = attribute_key#8)) &&
(max_processed_date#37 = processed_date#10)));'
Note the error message: "processed_date#10 missing". I see processed_date#48 and processed_date#10 in the list of attributes.
See:
# DataFrame transformation
tmp -> max_processed -> df
The above three DataFrame share the same lineage, so if you want to use same column more than once, you need to use alias.
For example:
tmp = spark.createDataFrame([(1, 3, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
max_processed = tmp.groupBy(['key1', 'key2']).agg(f.max(tmp['val']).alias('max_val'))\
.withColumnRenamed('key1', 'max_key1').withColumnRenamed('key2', 'max_key2')\
df = max_processed.join(tmp, on=[max_processed['max_key1'] == tmp['key1'],
max_processed['max_key2'] == tmp['key2'],
max_processed['max_val'] == tmp['val']])
df.show()
+--------+--------+-------+----+----+---+
|max_key1|max_key2|max_val|key1|key2|val|
+--------+--------+-------+----+----+---+
| 1| 3| 1| 1| 3| 1|
| 2| 3| 1| 2| 3| 1|
+--------+--------+-------+----+----+---+
I still think this is a defect in spark lineage, to be honest