When joining vertices, am I forced to use MEMORY_ONLY caching? - apache-spark

Looking at the source of outerJoinVertices
I wonder if this is a bug or a feature
override def outerJoinVertices[U: ClassTag, VD2: ClassTag]
(other: RDD[(VertexId, U)])
(updateF: (VertexId, VD, Option[U]) => VD2)
(implicit eq: VD =:= VD2 = null): Graph[VD2, ED] = {
// The implicit parameter eq will be populated by the compiler if VD and VD2 are equal, and left
// null if not
if (eq != null) {
vertices.cache() // <===== what if I wanted it serialized?
// updateF preserves type, so we can use incremental replication
val newVerts = vertices.leftJoin(other)(updateF).cache()
val changedVerts = vertices.asInstanceOf[VertexRDD[VD2]].diff(newVerts)
val newReplicatedVertexView = replicatedVertexView.asInstanceOf[ReplicatedVertexView[VD2, ED]]
.updateVertices(changedVerts)
new GraphImpl(newVerts, newReplicatedVertexView)
} else {
// updateF does not preserve type, so we must re-replicate all vertices
val newVerts = vertices.leftJoin(other)(updateF)
GraphImpl(newVerts, replicatedVertexView.edges)
}
}
Questions
If my graph / joined vertices are already cached via another StorageLevel (e.g. MEMORY_ONLY_SER) - is this what causes the org.apache.spark.graphx.impl.ShippableVertexPartitionOps ... WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.?
If this is the case, then is this a bug in Spark (this is from 1.3.1)? couldn't find the JIRA issue on this if it is (but I didn't look too hard...)
Why is it not as trivial to fix as to provide this method a new StorageLevel?
What are the workarounds for this? (one that I can think of, is to do create a new Graph with vertices.join(otherVertices) and originalGraph.edges or something... but it feels wrong...

Well, I think it's actually not a bug.
Looking at the code for VertexRDD it overrides the cache method, and uses the original StorageLevel used to create this vertex.
override def cache(): this.type = {
partitionsRDD.persist(targetStorageLevel)
this
}

Related

NullPointer in broadcast value being passed as parameter

:)
I'd like to say that I'm new in Spark, as many of these posts start..but the truth is I'm not that new.
Still, I'm facing this issue with broadcast variables.
When a variable is broadcast, each executor receives a copy of it. Later on, when this variable is referenced in the part of the code that is executed in the executors (let's say map or foreach), if the variable reference that was set in the driver is not passed to it, the executor does not know what are we talking about. Which I think is perfectly explain here
My problem is I am getting a nullPointerException even tough I passed the broadcast reference to the executors.
class A {
var broadcastVal: Broadcast[Dataframe] = _
...
def method1 {
broadcastVal = otherMethodWhichSendBroadcast
doSomething(broadcastVal, others)
}
}
class B {
def doSomething(...) {
forEachPartition {x => doSomethingElse(x, broadcasVal)}
}
}
object C {
def doSomethingElse(...) {
broadcastVal.value.show --> Exception
}
}
What am I missing?
Thanks in advance!
RDD and DataFrames are already distributed structures, no need to broadcast them as local variable .(org.apache.spark.sql.functions.broadcast() function (which is used while doing joins) is not local variable broadcast )
Even if you try the code syntax wise it wont show any compilation error, rather it will throw RuntimeException like NullPointerException which is 100% valid.
Example to Explain the behavior :
package examples
import org.apache.log4j.Level
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastCheck extends App {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
val df = spark.range(100).toDF()
var broadcastVal: Broadcast[DataFrame] = sc.broadcast(df)
val t1 = sc.parallelize(0 until 10)
val t2 = sc.broadcast(2) // this is right since its local variable can be primitive or map or any scala collection
val t3 = t1.filter(_ % t2.value == 0).persist() //this is the way of ha
t3.foreach {
x =>
println(x)
// broadcastVal.value.toDF().show // null pointer wrong way
// spark.range(100).toDF().show // null pointer wrong way
}
}
Result : (if you un comment broadcastVal.value.toDF().show or spark.range(100).toDF().show in above code)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics$lzycompute(WholeStageCodegenExec.scala:528)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics(WholeStageCodegenExec.scala:527)
Further read the difference between broadcast variable and broadcast function here...

Camel-Olingo2: The metadata constraints '[Nullable=true, MaxLength=16]' do not match the literal

I'm using camel-olingo2 component for query SAP SuccessFactors on ODataV2 endpoints. The route is:
from("direct:start")
.to(olingoEndpoint)
.process(paging)
.loopDoWhile(simple("\${header.CamelOlingo2.\$skiptoken} != null"))
.to(olingoEndpoint)
.process(paging)
.end()
Paging processor is:
Processor paging = new Processor() {
#Override
void process(Exchange g) throws Exception {
ODataDeltaFeed feed = g.in.getMandatoryBody(ODataDeltaFeed)
if (consumer) feed.getEntries().forEach(consumer)
String next = feed.getFeedMetadata().getNextLink()
if (next) {
List<NameValuePair> lst = URLEncodedUtils.parse(new URI(next), StandardCharsets.UTF_8)
NameValuePair skiptoken = lst.find { it.name == "\$skiptoken" }
g.out.headers."CamelOlingo2.\$skiptoken" = skiptoken.value
} else {
g.out.headers.remove("CamelOlingo2.\$skiptoken")
}
}
}
Everything is OK with most of entities but there are fields for several entities with wrong nullability or data length so I got:
Caused by: org.apache.olingo.odata2.api.edm.EdmSimpleTypeException: The metadata constraints '[Nullable=true, MaxLength=16]' do not match the literal 'Bor.Kralja Petra I.16'.
at org.apache.olingo.odata2.core.edm.EdmString.internalValueOfString(EdmString.java:62)
at org.apache.olingo.odata2.core.edm.AbstractSimpleType.valueOfString(AbstractSimpleType.java:91)
at org.apache.olingo.odata2.core.ep.consumer.JsonPropertyConsumer.readSimpleProperty(JsonPropertyConsumer.java:236)
at org.apache.olingo.odata2.core.ep.consumer.JsonPropertyConsumer.readPropertyValue(JsonPropertyConsumer.java:169)
In documentation for Olingo2 camel component I cannot find the way to disable this checking or other walkaround. Can you suggest me the good way?
Please do not recommend server-side data changes, ex metadata modifiyng, it's out of scope for this task.
I have plan B: to use HTTPS requests with JSON parsing, it's quite simple but little boring.

Spark Broadcasting a HashMap no nullpointer but it doesnt fetch any values aswell

I am broadcasting a hashmap and returning a map from the below method
public static Map<Object1, Object2> lkpBC (JavaSparkContext ctx, String FilePath) {
Broadcast<Map<Object1, Object2>> CodeBC = null;
Map<Object1, Object2> codePairMap=null;
try {
Map<Object1, Object2> CodepairMap = LookupUtil.loadLookup(ctx, FilePath);
CodeBC = ctx.broadcast(codePairMap);
codePairMap= CodeBC.value();
} catch (Exception e) {
LOG.error("Error while broadcasting ", e);
}
return codePairMap;
}
and passing the map to the below method
public static JavaRDD<Object3> fetchDetails(
JavaSparkContext ctx,
JavaRDD<Object3> CleanFileRDD,
String FilePath,
Map<Object1, Object2> BcMap
) {
JavaRDD<Object3r> assignCd = CleanFileRDD.map(row -> {
object3 FileData = null;
try {
FileData = row;
if (BCMap.containsKey("some key")) {......}
} catch (Exception e) {
LOG.error("Error in Map function ", e);
}
return some object;
});
return assignCd;
}
in the local mode it works fine without any issues but when i run this on a spark standalone cluster(1 master 3 slaves) on EC2 this doesn't fetch any values nor throws an error. All the objects you see in the methods are serialized. Does it matter if i am calling these methods from a main class or any other different class?
PS: We use Kyro serializer in the spark conf
I think what's going on is you are not accessing the broadcast variable inside the closure of your map function. I think you are directly accessing the underlying BcMap (or BCMap, not sure if they are supposed to be different).
Line if (BCMap.containsKey("some key")) isn't accessing the broadcast variable CodeBC. Since it seems the type of BCMap is Map, not Broadcast.
To access the broadcast variable you would call CodeBC.value.containsKey.
Spark is designed in a functional way, it doesn't "do" anything to the underlying map, it makes a copy of it, broadcasts the copy, and wraps that copy in a Broadcast type.
I don't know what LookupUtil.loadLookup does, but I guess if the file doesn't exist or is empty does it return an empty map?
Here is an example of how you would do it in Scala:
val bcMap = ctx.broadcast(LookupUtil.loadLookup(ctx, FilePath))
cleanFileRDD.map(row =>
if (bcMap.value.containsKey("some key") ...
else ...)
I think you will solve your situation by following the wise words of a friend of mine "first solve all the obvious issues, then the harder issues seem to solve themselves". In your case they are:
Using mutable variables that get initialised to null
Using try catches that log errors but don't re-throw them. Just let exceptions bubble up.
Prematurely splitting things out into lots of different methods before you have it working as just one method.
And just because something works locally doesn't mean it will work when distributed. There are a lot of differences between running something locally and across a cluster, like: a) Data locality b) Serialization c) Closure capture d) Number of threads e) execution order ... etc

How do I pass functions into Spark transformations during scalatest?

I am using Flatspec to run a test and keep hitting an error because I pass a function into map. I've encountered this problem a few times, but just found a workaround by just using an anonymous function. That doesn't seem to be possible in this case. Is there a way of passing functions into transformations in scalatest?
code:
“test” should “fail” in {
val expected = sc.parallelize(Array(Array(“foo”, “bar”), Array(“bar”, “qux”)))
def validateFoos(firstWord: String): Boolean = {
if (firstWord == “foo”) true else false
}
val validated = expected.map(x => validateFoos(x(0)))
val trues = expected.map(row => true)
assert(None === RDDComparisons.compareWithOrder(validated, trues))
}
error:
org.apache.spark.SparkException: Task not serializable
*This uses Holden Karau's Spark testing base:
https://github.com/holdenk/spark-testing-base
The "normal" way of handing this is to define the outer class to be serilizable, this is a bad practice in anything except for tests since you don't want to ship a lot of data around.

Actor's value sometimes returns null

I have an Actor and some other object:
object Config {
val readValueFromConfig() = { //....}
}
class MyActor extends Actor {
val confValue = Config.readValueFromConfig()
val initValue = Future {
val a = confValue // sometimes it's null
val a = Config.readValueFromConfig() //always works well
}
//..........
}
The code above is a very simplified version of what I actually have. The odd thing is that sometimes val a = confValue returns null, whereas if I replace it with val a = Config.readValueFromConfig() then it always works well.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor) or val a = self ! GetConfigValue and read the result afterwards?
val readValueFromConfig() = { //....}
This gives me a compile error. I assume you mean without parentheses?
val readValueFromConfig = { //....}
Same logic with different timing gives different result = a race condition.
val confValue = Config.readValueFromConfig() is always executed during construction of MyActor objects (because it's a field of MyActor). Sometimes this is returning null.
val a = Config.readValueFromConfig() //always works well is always executed later - after MyActor is constructed, when the Future initValue is executed by it's Executor. It seems this never returns null.
Possible causes:
Could be explained away if the body of readValueFromConfig was dependent upon another
parallel/async operation having completed. Any chance you're reading the config asynchronously? Given the name of this method, it probably just reads synchronously from a file - meaning this is not the cause.
Singleton objects are not threadsafe?? I compiled your code. Here's the decompilation of your singleton object java class:
public final class Config
{
public static String readValueFromConfig()
{
return Config..MODULE$.readValueFromConfig();
}
}
public final class Config$
{
public static final MODULE$;
private final String readValueFromConfig;
static
{
new ();
}
public String readValueFromConfig()
{
return this.readValueFromConfig;
}
private Config$()
{
MODULE$ = this;
this.readValueFromConfig = // ... your logic here;
}
}
Mmmkay... Unless I'm mistaken, that ain't thread-safe.
IF two threads are accessing readValueFromConfig (say Thread1 accesses it first), then inside method private Config$(), MODULE$ is unsafely published before this.readValueFromConfig is set (reference to this prematurely escapes the constructor). Thread2 which is right behind can read MODULE$.readValueFromConfig before it is set. Highly likely to be a problem if '... your logic here' is slow and blocks the thread - which is precisely what synchronous I/O does.
Moral of story: avoid stateful singleton objects from Actors (or any Threads at all, including Executors) OR make them become thread-safe through very careful coding style. Work-Around: change to a def, which internally caches the value in a private val.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor)
Just because it's not an actor, doesn't mean it's necessarily safe. It probably isn't.
or val a = self ! GetConfigValue and read the result afterwards?
That's almost right. You mean self ? GetConfigValue, I think - that will return a Future, which you can then map over. ! doesn't return anything.
You cannot read from an actor's variables directly inside a Future because (in general) that Future could be running on any thread, on any processor core, and you don't have any memory barrier there to force the CPU caches to reload the value from main memory.

Resources