Spark error using a variable in map operation [duplicate] - apache-spark

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I am trying to iterate over a DataFrame and apply map operation over it's rows.
import spark.implicits._
import org.apache.spark.sql.Row
case class SomeData(name:String, value: Int)
val input = Seq(SomeData("a",2), SomeData("b", 3)).toDF
val SOME_STRING = "some_string"
input.map(row =>
SOME_STRING
).show
The above code fails with following exception,
ERROR TaskSetManager: Task 0 in stage 4.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 14, ip-xxxx, executor 4): java.lang.NoClassDefFoundError: L$iw;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
at java.lang.Class.getDeclaredField(Class.java:2068)
However, if the variable is replaced with string, the code works.
input.map(row =>
"some_string"
).show
+-----------+
| value|
+-----------+
|some_string|
|some_string|
+-----------+
Is there anything wrong with the above code? Is it possible to use variables and function calls inside map operation.

it's normal, your variable is define inside the driver then use inside a worker, so your worker dont know the variable.
what you can do is :
input.map(row =>
val SOME_STRING = "some_string"
).show
you can also check broadcast variable : https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables

Related

How to find a median value in pyspark

These are the values of my dateframe:
+-------+----------+
| ID| Date_Desc|
+-------+----------+
|8951354|2012-12-31|
|8951141|2012-12-31|
|8952745|2012-12-31|
|8952223|2012-12-31|
|8951608|2012-12-31|
|8950793|2012-12-31|
|8950760|2012-12-31|
|8951611|2012-12-31|
|8951802|2012-12-31|
|8950706|2012-12-31|
|8951585|2012-12-31|
|8951230|2012-12-31|
|8955530|2012-12-31|
|8950570|2012-12-31|
|8954231|2012-12-31|
|8950703|2012-12-31|
|8954418|2012-12-31|
|8951685|2012-12-31|
|8950586|2012-12-31|
|8951367|2012-12-31|
+-------+----------+
I tried to create a median value of a date column in pyspark:
df1 = df1.groupby('Date_Desc').agg(f.expr('percentile(ID, array(0.25))')[0].alias('%25'),
f.expr('percentile(ID, array(0.50))')[0].alias('%50'),
f.expr('percentile(ID, array(0.75))')[0].alias('%75'))
But I get this an error:
Py4JJavaError: An error occurred while calling o198.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 29.0 failed 1 times, most recent failure: Lost task
1.0 in stage 29.0 (TID 427, 5bddc801333f, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result
due to the upgrading of Spark 3.0: Fail to parse '11/23/04 9:00' in
the new parser. You can set spark.sql.legacy.timeParserPolicy to
LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED
and treat it as an invalid datetime string.
With Spark ≥ 3.1.0 :
from pyspark.sql.functions import percentile_approx
df1.groupBy("Date_Desc").agg(percentile_approx("ID", 0.5).alias("%50"))

How to read/write a hive table from within the spark executors

I have a requirement wherein I am using DStream to retrieve the messages from Kafka. Now after getting message or RDD now i use a map operation to process the messages independently on the executors. The one challenge I am facing is i need to read/write to a hive table from within the executors and for this i need access to SQLContext. But as far as i know SparkSession is available at driver side only and should not be used within the executors. Now without the spark session (in spark 2.1.1) i can't get hold of SQLContext. To summarize
My driver codes looks something like:
if (inputDStream_obj.isSuccess) {
val inputDStream = inputDStream_obj.get
inputDStream.foreachRDD(rdd => {
if (!rdd.isEmpty) {
val rdd1 = rdd.map(idocMessage => SegmentLoader.processMessage(props, idocMessage.value(), true))
}
}
So after this rdd.map the next code is executed on the executors and there I have something like:
val sqlContext = spark.sqlContext
import sqlContext.implicits._
spark.sql("USE " + databaseName)
val result = Try(df.write.insertInto(tableName))
Passing sparksession or sqlcontext gives error when they are used on the executor:
When I try to obtain the existing sparksession: org.apache.spark.SparkException: A master URL must be set in your configuration
When I broadcast session variable:User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
When i pass sparksession object: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
Let me know if you can suggest how to query/update a hive table from within the executors.
Thanks,
Ritwick

DataFrame : Convert array inside a column to RDD[Array[String]]

Given a dataframe :
+---+----------+
|key| value|
+---+----------+
|foo| bar|
|bar| one, two|
+---+----------+
Then I'd like to use the value column as entry to FPGrowth which must look like RDD[Array[String]]
val transactions: RDD[Array[String]] = df.select("value").rdd.map(x => x.getList(0).toArray.map(_.toString))
import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)
I get exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 141.0 failed 1 times, most recent failure: Lost task 7.0 in stage 141.0 (TID 2232, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to scala.collection.Seq
Any suggestion welcome !
Instead of
val transactions: RDD[Array[String]] = df.select("value").rdd.map(x => x.getList(0).toArray.map(_.toString))
try using
val transactions= df.select("value").rdd.map(_.toString.stripPrefix("[").stripSuffix("]").split(","))
It gives a desired ouptut as expected i.e. RDD[Array[String]]
val transactions= df.select("value").rdd.map(_.toString.stripPrefix("[").stripSuffix("]").split(","))
transactions: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[10] at map at <console>:33
scala> transactions.take(2)
res21: Array[Array[String]] = Array(Array(bar), Array(one, two))
To remove the "[" and "]" ,one can use stripPrefix and stripSuffix function before split function.

Handle Null Values when using CustomSchema in apache spark

I am importing data based on a customSchema which I have defined in the following way
import org.apache.spark.sql.types.{StructType, StructField,DoubleType,StringType };
val customSchema_train = StructType(Array(
StructField("x53",DoubleType,true),
StructField("x95",DoubleType,true),
StructField("x88",DoubleType,true),
StructField("x30",DoubleType,true),
StructField("x42",DoubleType,true),
StructField("x28",DoubleType,true)
))
val train_orig = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(customSchema_train).option("nullValue","null").load("/....../train.csv").cache
Now I know there are null values in my data which are there as "null" and I have tried to handle that accordingly. The import happens without any error but when I try to describe the data I get this error
train_df.describe().show
SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 1 times, most recent failure: Lost task 0.0 in stage 46.0 (TID 56, localhost): java.text.ParseException: Unparseable number: "null"
How Can handle this error?

ClassNotFoundException: org.apache.zeppelin.spark.ZeppelinContext when using Zeppelin input value inside spark DataFrame's filter method

I'm having a trouble for two days already, and can't find any solutions.
I'm getting
ClassNotFoundException: org.apache.zeppelin.spark.ZeppelinContext
when using input value inside spark DataFrame's filter method.
val city = z.select("City",cities).toString
oDF.select("city").filter(r => city.equals(r.getAs[String]("city"))).count()
I even tried copying the input value to another val with
new String(bytes[])
but still get the same error.
The same code work seamlessly if instead of getting the value from z.select
I declare as a String literal
city: String = "NY"
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 49.0 failed 4 times, most recent failure: Lost task 0.3 in stage
49.0 (TID 277, 10.6.60.217): java.lang.NoClassDefFoundError:
Lorg/apache/zeppelin/spark/ZeppelinContext;
You are taking this in the wrong direction:
val city="NY"
gives you a scala String with NY as the string, but when you say
z.select("City",cities)
then this returns you dataFrame and then you are converting this object to String using method toString and then trying to compare.!
This wont work !
What you can do is either collect one dF and then pass the scala String accordingly into the other Df or you can do a join if you want to do it for multiple values.
But this approach will not work for sure !

Resources