How can I reproduce the indeterminacy exception in Spark? - apache-spark

Here's the indeterminacy I'm looking to reproduce: ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 401 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
As I understand, this happens because a part of logic is not deterministic (e.g. using LocalTime.now() or using scala.util.Random). I wrote a piece of code that was retried and confirmed to be non-deterministic, but it still didn't fail with the indeterminacy exception. In fact, it succeeded.
import spark.implicits._
val data = spark
.range(0, 100, 1)
.map(identity)
.repartition(7)
.map { x =>
if ((x % 10) == 0) {
Thread.sleep(1000)
}
val result = Entry("key" + x, x + RichDate.now.timestamp)
println(x + " " + result)
result
}
.map { item =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 0 && TaskContext.get
.partitionId() < 3) {
println(
"Throw " + TaskContext.get.attemptNumber() + " " + TaskContext.get
.partitionId() + " " + TaskContext.get.stageAttemptNumber()
)
throw new Exception("pkill -f -n java".!!)
}
Output(item.toString)
}
// write to S3
println(x + " " + result) confirms that the specific part of code is retried and produces different result at the second try.
What did I do wrong? And does anyone have a sample code that can reproduce the indeterminacy exception?

Related

Spark HOF: reduce not identified

While experimenting with spark's HOFs, I see transform, filter, exists etc working fine. But reduce is throwing error.
Code is as:
spark.sql("SELECT" +
" celsius," +
" reduce(celsius,0,(t, acc) -> t + acc, acc -> (acc div size(celsius) * 9 div 5) + 32 ) as avgFahrenheit" +
" FROM celsiusView")
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Undefined function: 'reduce'. This function is neither a registered
temporary function nor a permanent function registered in the database
'default'.;

Guidance: converting parallelly increamental nested loop to streams

I am trying to convert the following loop to Java stream.
def a = [12,34,5,64,24,56], b = [1,23,45]
for(int i=0;i<a.size();)
for(int j=0;j<b.size() && a[i];j++)
println a[i++]+","+b[j]
Output:
12,1
34,23
5,45
64,1
24,23
56,45
I tried few ways but I am not sure how to increment outer loop from inner loop. Any guidance is appreciated. The following code is the furthest I have done.
a.stream().forEach({x ->
b.stream().filter({y-> y%2 != 0}).forEach({ y->
println x+","+y
});
});
Output:
12,1
12,23
12,45
34,1
34,23
34,45
5,1
5,23
5,45
64,1
64,23
64,45
24,1
24,23
24,45
56,1
56,23
56,45
IntStream.range(0, left.length)
.mapToObj(x -> left[x] + " " + right[x % right.length])
.forEachOrdered(System.out::println);
where left is a and right is b

Function doesn't execute inside loop

I'm making a simple terminal calculator but for some reason a function isn't executing inside a while loop but executes outside the loop.
Given this input: ((1 + 2) + (3 + 4))
It should output:10
But gets stuck in an infinite loop because it doesn't replace the innermost expressions with their result.
The function that doesn't execute is s.replace(basicOp, answer);
Here is a snippet of the problem:
public static function processInput(s:String):String
{
var result:Null<Float> = parseNumber(s);
if (result != null)
{
return Std.string(result);
}
var closeParPos:Int = 0;
var openParPos:Int = 0;
var basicOp:String;
var answer:String = "";
// ERROR HERE
while (Std.string(answer) != s)
{
closeParPos = s.indexOf(")");
openParPos = s.lastIndexOf("(", closeParPos);
basicOp = s.substring(openParPos, closeParPos + 1);
answer = processBasicOp(basicOp);
// This isn't executed
s.replace(basicOp, answer);
trace("Input: " + s + " basicOp: " + basicOp + " Answer: " + answer);
}
return (result == null)? "": Std.string(result);
}
All the code is here just run make test
The input syntax is: ([number] [operator] [number]) or ([operator] [number])
There must be a single space between number and operators.
There shouldn't be any space between numbers and parenthesis
Supported operations:
+ - / *
% (remainder),
div (quotient),
sqr (square),
sqroot (square root),
sin cos tan (in degrees, bugged)
fact (factorial)
It isn't completed yet, there may be other problems, but this problem prevents me from advancing.
Can someone help me find the solution?
Thank you.
I can't actually get this to run, but StringTools.replace() doesn't modify the string in-place.
Try changing s.replace(basicOp, answer); to s = s.replace(basicOp, answer);

Spark is taking too much time and creating thousands of jobs for some tasks

Machine Config :
RAM: 16 gb
Processor: 4 cores(Xeon E3 3.3 GHz)
Problem:
Time Consuming : Taking more than 18 minutes
Case Scenario :
Spark Mode: Local
Database: Using Cassandra 2.1.12
I am fetching 3 tables into dataframes , which is having less than 10 rows. yes, less than 10 (ten).
After fetching it into dataframes I performing joins,count,show and collect operation many times. When I execute my program Spark is creating 40404 jobs 4 times. it indicates that count requires to perform those jobs. I am using count 4-5 times in program. After waiting for more than 18 minutes(approx 18.5 to 20) it gives me expected output.
why Spark is creating that much of jobs?
is it obvious ('ok') to take this much time (18 minutes) to execute this number of jobs(40404 * 4 approx)?
Thanks in advance.
Sample code 1:
def getGroups(id: Array[String], level: Int): DataFrame = {
var lvl = level
if (level >= 0) {
for (iterated_id <- id) {
val single_level_group = supportive_df.filter("id = '" + iterated_id + "' and level = " + level).select("family_id")
//single_level_group.show()
intermediate_df = intermediate_df.unionAll(single_level_group)
//println("for loop portion...")
}
final_df = final_df.unionAll(intermediate_df)
lvl -= 1
val user_id_param = intermediate_df.collect().map { row => row.getString(0) }
intermediate_df = empty_df
//println("new method...if portion...")
getGroups(user_id_param, lvl)
} else {
//println("new method...")
final_df.distinct()
}
}
Sample code 2:
setGetGroupsVars("u_id", user_id.toString(), sa_user_df)
var user_belong_groups: DataFrame = empty_df
val user_array = Array[String](user_id.toString())
val user_levels = sa_user_df.filter("id = '" + user_id + "'").select("level").distinct().collect().map { x => x.getInt(0) }
println(user_levels.length+"...rapak")
println(user_id.toString())
for (u_lvl <- user_levels) {
val x1 = getGroups(user_array, u_lvl)
x1.show()
empty_df.show()
user_belong_groups.show()
user_belong_groups = user_belong_groups.unionAll(x1)
x1.show()
}
setGetGroupsVars("obj_id", obj_id.toString(), obj_type_specific_df)
var obj_belong_groups: DataFrame = empty_df
val obj_array = Array[String](obj_id.toString())
val obj_levels = obj_type_specific_df.filter("id = '" + obj_id + "'").select("level").distinct().collect().map { x => x.getInt(0) }
println(obj_levels.length)
for (ob_lvl <- obj_levels) {
obj_belong_groups = obj_belong_groups.unionAll(getGroups(obj_array, ob_lvl))
}
user_belong_groups = user_belong_groups.distinct()
obj_belong_groups = obj_belong_groups.distinct()
var user_obj_joined_df = user_belong_groups.join(obj_belong_groups)
user_obj_joined_df.show()
println("vbgdivsivbfb")
var user_obj_access_df = user_obj_joined_df
.join(sa_other_access_df, user_obj_joined_df("u_id") === sa_other_access_df("user_id")
&& user_obj_joined_df("obj_id") === sa_other_access_df("object_id"))
user_obj_access_df.show()
println("KDDD..")
val user_obj_access_cond1 = user_obj_access_df.filter("u_id = '" + user_id + "' and obj_id != '" + obj_id + "'")
if (user_obj_access_cond1.count() == 0) {
val user_obj_access_cond2 = user_obj_access_df.filter("u_id != '" + user_id + "' and obj_id = '" + obj_id + "'")
if (user_obj_access_cond2.count() == 0) {
val user_obj_access_cond3 = user_obj_access_df.filter("u_id != '" + user_id + "' and obj_id != '" + obj_id + "'")
if (user_obj_access_cond3.count() == 0) {
default_df
} else {
val result_ugrp_to_objgrp = user_obj_access_cond3.select("permission").agg(max("permission"))
println("cond4")
result_ugrp_to_objgrp
}
} else {
val result_ugrp_to_ob = user_obj_access_cond2.select("permission")
println("cond3")
result_ugrp_to_ob
}
} else {
val result_u_to_obgrp = user_obj_access_cond1.select("permission")
println("cond2")
result_u_to_obgrp
}
} else {
println("cond1")
individual_access
}
These two are major code block in my prog where the execution is taking too longer. It generally takes much time at show or count operation.
First you can check in GUI that which stage of your program is taking long time.
Second is you are using distinct() many times, So while use distinct() you have to look how many number of partitions are comes after distinct. I thought that's the reason why spark creating thousand of jobs.
If that is the reason you can use coalesce() after distinct().
Ok, so let's remember some basics !
Spark is a lazy, and show and count are actions.
An action triggers transformations, which you have loads of. And in case you are pooling data from Cassandra (or any other source) this cost a lot since you don't seem to be caching your transformations !
So, you need to consider caching when you compute intensively on a DataFrame or RDD, that will make your actions get performed faster !
Concerning the reason why you have that many tasks (jobs) is of-course explain by spark parallelism mechanism to perform you actions times the number of transformations/actions you are executing, not mentioning the loops !
Nevertheless, still with the information given and the quality of the code snippets posted in the question, this is as far as my answer goes.
I hope this helps !

ForkJoinPool for parallel processing

I am trying to run run some code 1 million times. I initially wrote it using Threads but this seemed clunky. I started doing some more reading and I came across ForkJoin. This seemed like exactly what I needed but I cant figure out how to translate what I have below into "scala-style". Can someone explain the best way to use ForkJoin in my code?
val l = (1 to 1000000) map {_.toLong}
println("running......be patient")
l.foreach{ x =>
if(x % 10000 == 0) println("got to: "+x)
val thread = new Thread {
override def run {
//my code (API calls) here. writes to file if call success
}
}
}
The easiest way is to use par (it will use ForkJoinPool automatically):
val l = (1 to 1000000) map {_.toLong} toList
l.par.foreach { x =>
if(x % 10000 == 0) println("got to: " + x) //will be executed in parallel way
//your code (API calls) here. will also be executed in parallel way (but in same thread with `println("got to: " + x)`)
}
Another way is to use Future:
import scala.concurrent._
import ExecutionContext.Implicits.global //import ForkJoinPool
val l = (1 to 1000000) map {_.toLong}
println("running......be patient")
l.foreach { x =>
if(x % 10000 == 0) println("got to: "+x)
Future {
//your code (API calls) here. writes to file if call success
}
}
If you need work stealing - you should mark blocking code with scala.concurrent.blocking:
Future {
scala.concurrent.blocking {
//blocking API call here
}
}
It will tell ForkJoinPool to compensate blocked thread with new one - so you can avoid thread starvation (but there is some disadvantages).
In Scala, you can use Future and Promise:
val l = (1 to 1000000) map {
_.toLong
}
println("running......be patient")
l.foreach { x =>
if (x % 10000 == 0) println("got to: " + x)
Future{
println(x)
}
}

Resources