Spark Broadcasting a HashMap no nullpointer but it doesnt fetch any values aswell - apache-spark

I am broadcasting a hashmap and returning a map from the below method
public static Map<Object1, Object2> lkpBC (JavaSparkContext ctx, String FilePath) {
Broadcast<Map<Object1, Object2>> CodeBC = null;
Map<Object1, Object2> codePairMap=null;
try {
Map<Object1, Object2> CodepairMap = LookupUtil.loadLookup(ctx, FilePath);
CodeBC = ctx.broadcast(codePairMap);
codePairMap= CodeBC.value();
} catch (Exception e) {
LOG.error("Error while broadcasting ", e);
}
return codePairMap;
}
and passing the map to the below method
public static JavaRDD<Object3> fetchDetails(
JavaSparkContext ctx,
JavaRDD<Object3> CleanFileRDD,
String FilePath,
Map<Object1, Object2> BcMap
) {
JavaRDD<Object3r> assignCd = CleanFileRDD.map(row -> {
object3 FileData = null;
try {
FileData = row;
if (BCMap.containsKey("some key")) {......}
} catch (Exception e) {
LOG.error("Error in Map function ", e);
}
return some object;
});
return assignCd;
}
in the local mode it works fine without any issues but when i run this on a spark standalone cluster(1 master 3 slaves) on EC2 this doesn't fetch any values nor throws an error. All the objects you see in the methods are serialized. Does it matter if i am calling these methods from a main class or any other different class?
PS: We use Kyro serializer in the spark conf

I think what's going on is you are not accessing the broadcast variable inside the closure of your map function. I think you are directly accessing the underlying BcMap (or BCMap, not sure if they are supposed to be different).
Line if (BCMap.containsKey("some key")) isn't accessing the broadcast variable CodeBC. Since it seems the type of BCMap is Map, not Broadcast.
To access the broadcast variable you would call CodeBC.value.containsKey.
Spark is designed in a functional way, it doesn't "do" anything to the underlying map, it makes a copy of it, broadcasts the copy, and wraps that copy in a Broadcast type.
I don't know what LookupUtil.loadLookup does, but I guess if the file doesn't exist or is empty does it return an empty map?
Here is an example of how you would do it in Scala:
val bcMap = ctx.broadcast(LookupUtil.loadLookup(ctx, FilePath))
cleanFileRDD.map(row =>
if (bcMap.value.containsKey("some key") ...
else ...)
I think you will solve your situation by following the wise words of a friend of mine "first solve all the obvious issues, then the harder issues seem to solve themselves". In your case they are:
Using mutable variables that get initialised to null
Using try catches that log errors but don't re-throw them. Just let exceptions bubble up.
Prematurely splitting things out into lots of different methods before you have it working as just one method.
And just because something works locally doesn't mean it will work when distributed. There are a lot of differences between running something locally and across a cluster, like: a) Data locality b) Serialization c) Closure capture d) Number of threads e) execution order ... etc

Related

the right way to return a Single from a CompletionStage

I'm playing around with reactive flows using RxJava2, Micronaut and Cassandra. I'm new to rxjava and not sure what is the correct way to return a of List Person in the best async manner?
data is coming from a Cassandra Dao interface
public interface PersonDAO {
#Query("SELECT * FROM cass_drop.person;")
CompletionStage<MappedAsyncPagingIterable<Person>> getAll();
}
that gets injected into a micronaut controller
return Single.just(personDAO.getAll().toCompletableFuture().get().currentPage())
.subscribeOn(Schedulers.io())
.map(people -> HttpResponse.ok(people));
OR
return Single.just(HttpResponse.ok())
.subscribeOn(Schedulers.io())
.map(it -> it.body(personDAO.getAll().toCompletableFuture().get().currentPage()));
OR switch to RxJava3
return Single.fromCompletionStage(personDAO.getAll())
.map(page -> HttpResponse.ok(page.currentPage()))
.onErrorReturn(throwable -> HttpResponse.ok(Collections.emptyList()));
Not a pro of RxJava nor Cassandra :
In your first and second example, you are blocking the thread executing the CompletionStage with get, even if you are doing it in the IO thread, I would not recommand doing so.
You are also using a Single wich can emit, only one value, or an error. Since you want to return a List, I would sugest to go for at least an Observable.
Third point, the result from Cassandra is paginated, I don't know if it's intentionnaly but you list only the first page, and miss the others.
I would try a solution like the one below, I kept using the IO thread (the operation may be costly in IO) and I iterate over the pages Cassandra fetch :
/* the main method of your controller */
#Get()
public Observable<Person> listPersons() {
return next(personDAO.getAll()).subscribeOn(Schedulers.io());
}
private Observable<Person> next(CompletionStage<MappedAsyncPagingIterable<Person>> pageStage) {
return Single.fromFuture(pageStage.toCompletableFuture())
.flatMapObservable(personsPage -> {
var o = Observable.fromIterable(personsPage.currentPage());
if (!personsPage.hasMorePages()) {
return o;
}
return o.concatWith(next(personsPage.fetchNextPage()));
});
}
If you ever plan to use reactor instead of RxJava, then you can give cassandra-java-driver-reactive-mapper a try.
The syntax is fairly simple and works in compile-time only.

In Kotlin Native, how to keep an object around in a separate thread, and mutate its state from any other thead without using C pointers?

I'm exploring Kotlin Native and have a program with a bunch of Workers doing concurrent stuff
(running on Windows, but this is a general question).
Now, I wanted to add simple logging. A component that simply logs strings by appending them as new lines to a file that is kept open in 'append' mode.
(Ideally, I'd just have a "global" function...
fun log(text:String) {...} ]
...that I would be able to call from anywhere, including from "inside" other workers and that would just work. The implication here is that it's not trivial to do this because of Kotlin Native's rules regarding passing objects between threads (TLDR: you shouldn't pass mutable objects around. See: https://github.com/JetBrains/kotlin-native/blob/master/CONCURRENCY.md#object-transfer-and-freezing ).
Also, my log function would ideally accept any frozen object. )
What I've come up with are solutions using DetachedObjectGraph:
First, I create a detached logger object
val loggerGraph = DetachedObjectGraph { FileLogger("/foo/mylogfile.txt")}
and then use loggerGraph.asCPointer() ( asCPointer() ) to get a COpaquePointer to the detached graph:
val myPointer = loggerGraph.asCPointer()
Now I can pass this pointer into the workers ( via the producer lambda of the Worker's execute function ), and use it there. Or I can store the pointer in a #ThreadLocal global var.
For the code that writes to the file, whenever I want to log a line, I have to create a DetachedObjectGraph object from the pointer again,
and attach() it in order to get a reference to my fileLogger object:
val fileLogger = DetachedObjectGraph(myPointer).attach()
Now I can call a log function on the logger:
fileLogger.log("My log message")
This is what I've come up with looking at the APIs that are available (as of Kotlin 1.3.61) for concurrency in Kotlin Native,
but I'm left wondering what a better approach would be ( using Kotlin, not resorting to C ). Clearly it's bad to create a DetachedObjectGraph object for every line written.
One could pose this question in a more general way: How to keep a mutable resource open in a separate thread ( or worker ), and send messages to it.
Side comment: Having Coroutines that truly use threads would solve this problem, but the question is about how to solve this task with the APIs currently ( Kotlin 1.3.61 ) available.
You definitely shouldn't use DetachedObjectGraph in the way presented in the question. There's nothing to prevent you from trying to attach on multiple threads, or if you pass the same pointer, trying to attach to an invalid one after another thread as attached to it.
As Dominic mentioned, you can keep the DetachedObjectGraph in an AtomicReference. However, if you're going to keep DetachedObjectGraph in an AtomicReference, make sure the type is AtomicRef<DetachedObjectGraph?> and busy-loop while the DetachedObjectGraph is null. That will prevent the same DetachedObjectGraph from being used by multiple threads. Make sure to set it to null, and repopulate it, in an atomic way.
However, does FileLogger need to be mutable at all? If you're writing to a file, it doesn't seem so. Even if so, I'd isolate the mutable object to a separate worker and send log messages to it rather than doing a DetachedObjectGraph inside an AtomicRef.
In my experience, DetachedObjectGraph is super uncommon in production code. We don't use it anywhere at the moment.
To isolate mutable state to a Worker, something like this:
class MutableThing<T:Any>(private val worker:Worker = Worker.start(), producer:()->T){
private val arStable = AtomicReference<StableRef<T>?>(null)
init {
worker.execute(TransferMode.SAFE, {Pair(arStable, producer).freeze()}){
it.first.value = StableRef.create(it.second()).freeze()
}
}
fun <R> access(block:(T)->R):R{
return worker.execute(TransferMode.SAFE, {Pair(arStable, block).freeze()}){
it.second(it.first.value!!.get())
}.result
}
}
object Log{
private val fileLogger = MutableThing { FileLogger() }
fun log(s:String){
fileLogger.access { fl -> fl.log(s) }
}
}
class FileLogger{
fun log(s:String){}
}
The MutableThing uses StableRef internally. producer makes the mutable state you want to isolate. To log something, call Log.log, which will wind up calling the mutable FileLogger.
To see a basic example of MutableThing, run the following test:
#Test
fun goIso(){
val mt = MutableThing { mutableListOf("a", "b")}
val workers = Array(4){Worker.start()}
val futures = mutableListOf<Future<*>>()
repeat(1000) { rcount ->
val future = workers[rcount % workers.size].execute(
TransferMode.SAFE,
{ Pair(mt, rcount).freeze() }
) { pair ->
pair.first.access {
val element = "ttt ${pair.second}"
println(element)
it.add(element)
}
}
futures.add(future)
}
futures.forEach { it.result }
workers.forEach { it.requestTermination() }
mt.access {
println("size: ${it.size}")
}
}
The approach you've taken is pretty much correct and the way it's supposed to be done.
The thing I would add is, instead of passing around a pointer around. You should pass around a frozen FileLogger, which will internally hold a reference to a AtomicRef<DetachedObjectGraph>, the the attaching and detaching should be done internally. Especially since DetachedObjectGraphs are invalid once attached.

Actor's value sometimes returns null

I have an Actor and some other object:
object Config {
val readValueFromConfig() = { //....}
}
class MyActor extends Actor {
val confValue = Config.readValueFromConfig()
val initValue = Future {
val a = confValue // sometimes it's null
val a = Config.readValueFromConfig() //always works well
}
//..........
}
The code above is a very simplified version of what I actually have. The odd thing is that sometimes val a = confValue returns null, whereas if I replace it with val a = Config.readValueFromConfig() then it always works well.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor) or val a = self ! GetConfigValue and read the result afterwards?
val readValueFromConfig() = { //....}
This gives me a compile error. I assume you mean without parentheses?
val readValueFromConfig = { //....}
Same logic with different timing gives different result = a race condition.
val confValue = Config.readValueFromConfig() is always executed during construction of MyActor objects (because it's a field of MyActor). Sometimes this is returning null.
val a = Config.readValueFromConfig() //always works well is always executed later - after MyActor is constructed, when the Future initValue is executed by it's Executor. It seems this never returns null.
Possible causes:
Could be explained away if the body of readValueFromConfig was dependent upon another
parallel/async operation having completed. Any chance you're reading the config asynchronously? Given the name of this method, it probably just reads synchronously from a file - meaning this is not the cause.
Singleton objects are not threadsafe?? I compiled your code. Here's the decompilation of your singleton object java class:
public final class Config
{
public static String readValueFromConfig()
{
return Config..MODULE$.readValueFromConfig();
}
}
public final class Config$
{
public static final MODULE$;
private final String readValueFromConfig;
static
{
new ();
}
public String readValueFromConfig()
{
return this.readValueFromConfig;
}
private Config$()
{
MODULE$ = this;
this.readValueFromConfig = // ... your logic here;
}
}
Mmmkay... Unless I'm mistaken, that ain't thread-safe.
IF two threads are accessing readValueFromConfig (say Thread1 accesses it first), then inside method private Config$(), MODULE$ is unsafely published before this.readValueFromConfig is set (reference to this prematurely escapes the constructor). Thread2 which is right behind can read MODULE$.readValueFromConfig before it is set. Highly likely to be a problem if '... your logic here' is slow and blocks the thread - which is precisely what synchronous I/O does.
Moral of story: avoid stateful singleton objects from Actors (or any Threads at all, including Executors) OR make them become thread-safe through very careful coding style. Work-Around: change to a def, which internally caches the value in a private val.
I wonder, is this due to the fact that the only way to interact with an actor is sending it a message? Therefore, since val confValue is not a local variable, I must either use val a = Config.readValueFromConfig() (a different object, not an actor)
Just because it's not an actor, doesn't mean it's necessarily safe. It probably isn't.
or val a = self ! GetConfigValue and read the result afterwards?
That's almost right. You mean self ? GetConfigValue, I think - that will return a Future, which you can then map over. ! doesn't return anything.
You cannot read from an actor's variables directly inside a Future because (in general) that Future could be running on any thread, on any processor core, and you don't have any memory barrier there to force the CPU caches to reload the value from main memory.

ConcurrentModificationException with WeakHashMap

I have the code below but I'm getting ConcurrentModificationException, how should I avoid this issue? (I have to use WeakHashMap for some reason)
WeakHashMap<String, Object> data = new WeakHashMap<String, Object>();
// some initialization code for data
for (String key : data.keySet()) {
if (data.get(key) != null && data.get(key).equals(value)) {
//do something to modify the key
}
}
The Javadoc for WeakHashMap class explains why this would happen:
Map invariants do not hold for this class. Because the garbage
collector may discard keys at any time, a WeakHashMap may behave as
though an unknown thread is silently removing entries
Furthermore, the iterator generated under the hood by the enhanced for-loop you're using is of fail-fast type as per quoted explanation in that javadoc.
The iterators returned by the iterator method of the collections
returned by all of this class's "collection view methods" are
fail-fast: if the map is structurally modified at any time after the
iterator is created, in any way except through the iterator's own
remove method, the iterator will throw a
ConcurrentModificationException. Thus, in the face of concurrent
modification, the iterator fails quickly and cleanly, rather than
risking arbitrary, non-deterministic behavior at an undetermined time
in the future.
Therefore your loop can throw this exception for these reasons:
Garbage collector has removed an object in the keyset.
Something outside the code added an object to that map.
A modification occurred inside the loop.
As your intent appears to be processing the objects that are not GC'd yet, I would suggest using an iterator as follows:
Iterator<String> it = data.keySet().iterator();
int count = 0;
int maxTries = 3;
while(true) {
try {
while (it.hasNext()) {
String str = it.next();
// do something
}
break;
} catch (ConcurrentModificationException e) {
it = data.keySet().iterator(); // get a new iterator
if (++count == maxTries) throw e;
}
}
You can clone the key set first, but note that you hold the strong reference after that:
Set<KeyType> keys;
while(true) {
try {
keys = new HashSet<>(weakHashMap.keySet());
break;
} catch (ConcurrentModificationException ignore) {
}
}
for (KeyType key : keys) {
// ...
}
WeakHashMap's entries are automatically removed when no ordinary use of the key is realized anymore, this may happens in a different thread. While cloning the keySet() into a different Set a concurrent Thread may remove entries meanwhile, in this case a ConcurrentModificationException will 100% be thrown! You must synchronize the cloning.
Example:
Collections.synchronizedMap(data);
Please understand that
Collections.synchronizedSet(data.keySet());
Can not be used because data.keySet() rely on data's instance who is not synchronized here! More detail: synchronize(keySet) prevents the execution of methods on the keySet but keySet's remove-method is never called but WeakHashMap's remove-method is called so you have to synchronize over WeakHashMap!
Probably because your // do something in the iteration is actually modifying the underlying collection.
From ConcurrentModificationException:
For example, if a thread modifies a collection directly while it is iterating over the collection with a fail-fast iterator, the iterator will throw this exception.
And from (Weak)HashMap's keySet():
Returns a Set view of the keys contained in this map. The set is backed by the map, so changes to the map are reflected in the set, and vice-versa. If the map is modified while an iteration over the set is in progress (except through the iterator's own remove operation), the results of the iteration are undefined.

System.Linq.Dynamic .Select("new ...") does not appear to be thread safe

I grabbed System.Linq.Dynamic.DynamicQueryable from here:
http://weblogs.asp.net/scottgu/archive/2008/01/07/dynamic-linq-part-1-using-the-linq-dynamic-query-library.aspx
The issue that I am running into is in code that looks like this:
var results = dataContext.GetTable<MyClass>.Select("new (MyClassID, Name, Description)").Take(5);
It appears that if that line of code is executed by multiple threads near simultaneously, Microsoft's dynamic Linq code crashes in their ClassFactory.GetDynamicClass() method, which looks like this:
public Type GetDynamicClass(IEnumerable<DynamicProperty> properties)
{
rwLock.AcquireReaderLock(Timeout.Infinite);
try
{
Signature signature = new Signature(properties);
Type type;
if (!classes.TryGetValue(signature, out type))
{
type = CreateDynamicClass(signature.properties);
classes.Add(signature, type); // <-- crashes over here!
}
return type;
}
finally
{
rwLock.ReleaseReaderLock();
}
}
The crash is a simple dictionary error: "An item with the same key has already been added."
In Ms code, The rwLock variable is a ReadWriterLock class, but it does nothing to block multiple threads from getting inside classes.TryGetValue() if statement, so clearly, the Add will fail.
I can replicate this error pretty easily in any code that creates a two or more threads that try to execute the Select("new") statement.
Anyways, I'm wondering if anyone else has run into this issue, and if there are fixes or workarounds I can implement.
Thanks.
I did the following (requires .NET 4 or later to use System.Collections.Concurrent):
changed the classes field to a ConcurrentDictionary<Signature, Type> ,
removed all the ReaderWriterLock rwLock field and all the code referring to it,
updated GetDynamicClass to:
public Type GetDynamicClass(IEnumerable<DynamicProperty> properties) {
var signature = new Signature(properties);
return classes.GetOrAdd(signature, sig => CreateDynamicClass(sig.properties));
}
removed the classCount field and updated CreateDynamicClass to use classes.Count instead:
Type CreateDynamicClass(DynamicProperty[] properties) {
string typeName = "DynamicClass" + Guid.NewGuid().ToString("N");
...

Resources