Is it possible to ignore failed tasks in Spark - apache-spark

I've some large datasets where some records cause a UDF to crash. Once such a record is processed, the task will fail which leads to job failes. The problems here are native (we use a native fortran library with JNA), so I cannot catch them in the UDF.
What I'd like to have is a mechanism of fault-tolerance which would allow me to skip/ingore/blacklist bad partitions/tasks such that my spark-app will not fail.
Is there a way to do this?
The only thing I could figure at would be to process small chunks of data in a foreach loop :
val dataFilters: Seq[Column] = ???
val myUDF: UserDefinedFunction = ???
dataFilters.foreach(filter =>
try {
ss.table("sourcetable")
.where(filter)
.withColumn("udf_result", myUDF($"inputcol"))
.write.insertInto("targettable")
}
This is not ideal because spark is rel. slow in processing small amount of data. E.g. the input table is read many times

Related

Optimal (low-latency) spark settings for small datasets

I'm aware that spark is designed for large datasets for which it's great. But under certain circumstances I don't need this scalability, e.g. for unit tests or for data exploration on small datasets. Under these conditions spark performs relatively bad compared implementation in pure scala/python/matlab/R etc.
Note that I don't want to drop spark entirely, I want to keep the framework for larger workloads without re-implementing everything.
How can I disable sparks overhead as much as possible on small datasets (say 10-1000s of records)? I'm tried using only 1 partition in local mode (setting spark.sql.shuffle.partitions=1 and spark.default.parallelism=1)? Even which these settings, simple queries on 100 records take on the order of 1-2 seconds.
Note that I'm not trying to reduce the time for SparkSession instantiation, just the execution time given SparkSession exists.
operations in spark have same signature as the scala collections.
You could implement something like:
val useSpark = false
val rdd: RDD[String]
val list: List[String] = Nil
def mapping: String => Int = s => s.length
if (useSpark) {
rdd.map(mapping)
} else {
list.map(mapping)
}
I think this code could be abstracted even more.

Iterating on ResultSet in Spark

When I execute a query using jdbc against a SQL db, I get a ResultSet object in return. Say I want to iterate through each row returned in the ResultSet, and then do operations on each, my question is whether or not the initial iterating through the ResultSet would be handled by the Driver or the Executors?
For example, say I have a service where I want to handle many WordCount jobs in large batches. Perhaps I have a DB with the following schema:
JobId: int
Input: string(hdfs location)
Output: string (hdfs path)
Status: (not started, in progress, complete, etc.)
Every time my Spark application runs, I want to use jdbc to read from the DB and get every row where the status is "not started". This gets returned as a ResultSet and each result is basically parameters for Spark to run the WordCount on. When I iterate on the ResultSet, does it get split up and the executors iterate over small chunks? Or does the driver handle iterating on each? If the former, what happens once I start loading DataFrames for the given input locations and running the necessary transformations and actions to get word count? Would the executors further split up the DataFrames to other executors for processing?
Sorry if this question is unclear, I'm still learning about Spark and having trouble wrapping my mind around some of this. As well, would this generally be considered a good approach to handling multiple requests in one large batch? Or is there a better way to go about doing this?
Thanks!

Multiples computations in one iteration with Spark

How can I iterate over a big collection of files producing different results in just one step with Spark? For example:
val tweets : RDD[Tweet] = ...
val topWords : RDD[String] = getTopWords(tweets)
val topHashtags : RDD[String] = getTopHashtags(tweets)
topWords.collect().foreach(println)
topHashtags.collect().foreach(println)
It looks like Spark is going to iterate twice over the tweets dataset. Is there any way to prevent this? Is Spark smart enough to make this kind of optimizations?
Thanks in advance,
Spark will keep data loaded into CPU cache as long as it can, but that's not something you should rely on, so your best bet is to tweets.cache so that after the initial load then it will be working off of a memory store. The only other solution you would have is to combine your two functions and return a tuple of (resultType1, resultType2)

Parallel writing from same DataFrame in Spark

Let's say I have a DataFrame in Spark and I need to write the results of it to two databases, where one stores the original data frame but the other stores a slightly modified version (e.g. drops some columns). Since both operations can take a few moments, is it possible/advisable to run these operations in parallel or will that cause problems because Spark is working on the same object in parallel?
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def write1(){
//your save statement for first dataframe
}
def write2(){
//your save statement for second dataframe
}
def writeAllTables() {
Future{ write1()}
Future{ write2()}
}
Let me ask you, do you really need to do it? If you are not sure, then, you most probably don't.
So, lets assume below scenario similar to one you explained:
val df1 = spark.read.csv('someFile.csv') // Original Dataframe
val df2 = df1.withColumn("newColumn", concat(col("oldColumn"), lit(" is blah!"))) // Modified Dataframe, FYI, this df2 is a different object
df1.write('db_loc1') // Write to DB1, already parallelised & uses spark resources optimally
df2.write('db_loc2') // Write to DB2, already parallelised & uses spark resources optimally
Spark scheduler divides the first DataFrame df1 into partitions and writes them in parallel in db_loc1.
It picks up the second DataFrame df2 and again breaks it into partitions and writes these partitions in parallel in db_loc2.
By default, the degree of parallelisation per write is speculated in order to optimally use available cluster resources.
Small writes might not be repartitioned as mostly the write time is low and repartitioning will only increase overhead. In a extraordinary case where you have a lot of small writes, it might make a good case for trying to parallelise these writes. But, the best way to do so is to redesign your code to run one spark job per DataFrame instead of trying to parallelise DataFrame.write() call in same driver program.
Large writes will probably use all available resources in parallel during the single DataFrame write itself. Hence, if spark allowed issuing another write operation for a different DataFrame at the same time, it would only delay both operations as now they are racing with each other for resources. Not to mention, there may be some performance slowdown due to increased overhead because of sheer increase in number of tasks that spark now needs to manage and track.
Also, You can read this answer to and learn more about this

Why does Spark DataFrame run out of memory when same process on RDD completes fine?

I'm working with a fairly large amount of data (a few TBs). When I use a subset of the data, I find that Spark dataframes are great to work with. However, when I try calculations on my full dataset the same code returns me a dreaded "java.lang.OutOfMemoryError: GC overhead limit exceeded". What surprised me is that the process completes fine doing the same thing with an RDD. I thought dataframes were supposed to have better optimization. Is this a mistake in my approach or a limitation of dataframes?
For example, here is a simple task using dataframes that completes fine for a subset of my data and chokes on the full sample:
val records = sqlContext.read.avro(datafile)
val uniqueIDs = records.select("device_id").dropDuplicates(Array("device_id"))
val uniqueIDsCount = uniqueIDs.count().toDouble
val sampleIDs = uniqueIDs.sample(withReplacement = false, 100000/uniqueIDsCount)
sampleIDs.write.format("com.databricks.spark.csv").option("delimiter", "|").save(outputfile)
In this case it even chokes on the count.
However, when I try the same thing using RDDs in the following way it calculates fine (and pretty quickly at that).
val rawinput = sc.hadoopFile[AvroWrapper[Observation],NullWritable,
AvroInputFormat[Observation]](rawinputfile).map(x=> x._1.datum)
val tfdistinct = rawinput.map(x => x.getDeviceId).distinct
val distinctCount = tfdistinct.count().toDouble
tfdistinct.sample(false, 100000/distinctCount.toDouble).saveAsTextFile(outputfile)
I'd love to keep using dataframes in the future, am I approaching this wrong?

Resources