Why can't I read these dataframes - apache-spark

I'm having trouble with reading several dataframes. I have this function
def readDF(hdfsPath:String, more arguments): DataFrame = {//function goes here}
it takes an hdfs path for a partition and returns a dataframe (it basically uses spark.read.parquet but I have to use it). I'm trying to read several of them by using show partitions in the following fashion:
val dfs = spark.sql("show partitions table")
.where(col("partition").contains(someFilterCriteria))
.map(partition => {
val hdfsPath = s"hdfs/path/to/table/$partition"
readDF(hdfsPath)
}).reduce(_.union(_))
but it gives me this error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 3.0 failed 4 times, most recent failure: Lost task 12.3 in stage 3.0 (TID 44, csmlcsworki0021.unix.aacc.corp, executor 1): java.lang.NullPointerException
I think it's because I'm doing spark.read.parquet inside a map operation for a dataframe, because if I change my code for this one
val dfs = spark.sql("show partitions table")
.where(col("partition").contains(someFilterCriteria))
.map(row=> row.getString(0))
.collect
.toSeq
.map(partition => {
val hdfsPath = s"hdfs/path/to/table/$partition"
readDF(hdfsPath)
}).reduce(_.union(_))
it loads the data correctly. However, I don't want to use collect if possible. How can achieve my purpose?

readDF is creating a data frame from parquet files in HDFS. It must be executed on driver side. The first version, in which you execute using a map function over the rows of the original dataframe, suggest you're trying to create a DF in the executors, and this is not feasible.

Related

Pyspark RDD collect method shows nothing [duplicate]

I am using Spark 1.4.0 on my local system. Whenever I create an RDD and call collect on it through Scala shell of Spark, it works fine. But when I create a standalone application and call 'collect' action on the RDD, I don't see the result , despite the Spark messages during the run say that certain number of bytes have been set to driver:-
INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1991 bytes result sent to driver
INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1948 bytes result sent to driver
This is the code:-
object Test
{ def main(args:Array[String])
{
val conf = new SparkConf()
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("input.txt")
val rdd2 = rdd1.map(_.split(",")).map(x=>( (x(0),x(1)) ))
rdd2.collect
}
}
If I change the last statement to the following, it does display the result:-
rdd2.collect.foreach(println)
So the question is, why only calling 'collect' does not print anything?
collect by itself on a console app would not display anything as all it does is return the data. You have to do something to display it, as you are doing with the foreach(println). Or, do something with it in general, like saving it to disk.
Now, if you were to run that code in the spark-shell (minus SparkContext creation), then you would indeed see output* as the shell always calls the toString of the objects that are returned.
*Noting that toString is not the same as foreach(println) since the shell would truncate at some point

Will dataframe count trigger spark.drive.maxResultSize limitation?

I have a spark(2.4) job failed with exception saying "org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 5252 tasks is bigger than spark.driver.maxResultSize"
Here's my code snippet, it involves two dataframe join
val df_a = ... //load from HDFS
val df_b = ... //load from HDFF
val a_deduped = df_a.dropDuplicates("id")
val a_duplicates = df.exceptAll(a_deduped)
val duplicates = a_deduped.join(df_b, col("id")===col("history_id"), "left_outer").where(col("history_id").isNotNull)
val df_c = a_deduped.union(duplicates)
df_c.count
The code triggers this failure is df_c.count.
Just wondering how dataframe count work? My understanding is that it sums number of rows for every partition, and it returns an integer to driver, hence the data transfer to driver should be minimal. But why dirver.maxResultSize limitation is met? Any idea?

Handling corrupt JSON rows in Spark 2.11 - different behaviour than 1.6

We have snappy files that we read with sql context. e.g.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("s3://bucket/problemfile.snappy")
In spark 1.6 we would handle corrupt records by something like the below:
invalidJSON = rawEvents.select("*").where("_corrupt_record is not null");
validJSON = rawEvents.select("*").where("_corrupt_record is null");
In Spark 2.11, we are not even able to read the corrupted record e.g
scala> df.select("*").where("_corrupt_record is null").count()
18/03/31 00:45:06 ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-31-48-73.ec2.internal, executor 2):
java.io.CharConversionException: Unsupported UCS-4 endianness (3412) detected
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.reportWeirdUCS4(ByteSourceJsonBootstrapper.java:469)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.checkUTF32(ByteSourceJsonBootstrapper.java:434)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:141)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1287)
I know we can set spark.sql.files.ignoreCorruptFiles=true in 2.X but that we'd potentially lose records depending on where the corrupted record was.
Is there any other way we can skip over the corrupted record?
Thanks
You could do something like this:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read
.option("mode", "DROPMALFORMED")
.json("s3://bucket/problemfile.snappy")
This way Spark will drop invalid JSON for you, but you won't see any corrupt record.

How to read/write a hive table from within the spark executors

I have a requirement wherein I am using DStream to retrieve the messages from Kafka. Now after getting message or RDD now i use a map operation to process the messages independently on the executors. The one challenge I am facing is i need to read/write to a hive table from within the executors and for this i need access to SQLContext. But as far as i know SparkSession is available at driver side only and should not be used within the executors. Now without the spark session (in spark 2.1.1) i can't get hold of SQLContext. To summarize
My driver codes looks something like:
if (inputDStream_obj.isSuccess) {
val inputDStream = inputDStream_obj.get
inputDStream.foreachRDD(rdd => {
if (!rdd.isEmpty) {
val rdd1 = rdd.map(idocMessage => SegmentLoader.processMessage(props, idocMessage.value(), true))
}
}
So after this rdd.map the next code is executed on the executors and there I have something like:
val sqlContext = spark.sqlContext
import sqlContext.implicits._
spark.sql("USE " + databaseName)
val result = Try(df.write.insertInto(tableName))
Passing sparksession or sqlcontext gives error when they are used on the executor:
When I try to obtain the existing sparksession: org.apache.spark.SparkException: A master URL must be set in your configuration
When I broadcast session variable:User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
When i pass sparksession object: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
Let me know if you can suggest how to query/update a hive table from within the executors.
Thanks,
Ritwick

ClassNotFoundException: org.apache.zeppelin.spark.ZeppelinContext when using Zeppelin input value inside spark DataFrame's filter method

I'm having a trouble for two days already, and can't find any solutions.
I'm getting
ClassNotFoundException: org.apache.zeppelin.spark.ZeppelinContext
when using input value inside spark DataFrame's filter method.
val city = z.select("City",cities).toString
oDF.select("city").filter(r => city.equals(r.getAs[String]("city"))).count()
I even tried copying the input value to another val with
new String(bytes[])
but still get the same error.
The same code work seamlessly if instead of getting the value from z.select
I declare as a String literal
city: String = "NY"
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 49.0 failed 4 times, most recent failure: Lost task 0.3 in stage
49.0 (TID 277, 10.6.60.217): java.lang.NoClassDefFoundError:
Lorg/apache/zeppelin/spark/ZeppelinContext;
You are taking this in the wrong direction:
val city="NY"
gives you a scala String with NY as the string, but when you say
z.select("City",cities)
then this returns you dataFrame and then you are converting this object to String using method toString and then trying to compare.!
This wont work !
What you can do is either collect one dF and then pass the scala String accordingly into the other Df or you can do a join if you want to do it for multiple values.
But this approach will not work for sure !

Resources