Why Spark Accumulator's output type need to be thread safe? - multithreading

According to AccumulatorV2, the output of it should be
a type that can be read atomically (e.g., Int, Long), or thread-safely
(e.g., synchronized collections) because it will be read from other threads.
Let's say I have a class called CheckSumAccumulator which extends from AccumulatorV2, CheckSumAccumulator's output type is CheckSum, CheckSumAccumulator has a private field called checkSum; CheckSum has a private field called count and it has public setting and getter methods.
public class CheckSumAccumulator extends AccumulatorV2<String, CheckSum> {
private CheckSum checkSum;
...
}
public class CheckSum extends Serializable {
private long count;
public long getCount() {
return count;
}
public void setCount(long count) {
this.count = count;
}
}
What could go wrong?
Does Accumulator instance runs in single thread in each Executor?

Spark Accumulator is a shared variable that can be used to accumulate values across multiple tasks and stages in a Spark job. Because it is shared across multiple threads, it is important for it to be thread-safe to ensure that updates to the accumulator's value are atomic and consistent across all tasks and stages. If the accumulator were not thread-safe, it could lead to race conditions and inconsistent results. Thread-safety is typically achieved by using synchronization mechanisms such as locks or atomic operations.

Related

Can a same lombok builder be access/update by multiple threads?

#Builder
public class X {
#Nonnull String a;
#Nonnull String b;
}
main () {
X.XBuilder builder = X.builder();
//thread 1
CompletableFuture.runAsync(()-> {
builder.a("some");
});
//thread 2
CompletableFuture.runAsync(()-> {
builder.b("thing");
});
}
Here the same object is being accessed and modified at the same time.
So will this code be thread safe?
Usecase is like wants to call multiple api's, each api results is to populate the fields of class X.
If you want to know how the stuff that Lombok generates works, you can always use the delombok tool.
With regard to thread safety of #Builder, you will see that you can in fact access a builder instance from multiple threads, but only under these constraints:
Don't call the same field setter from different threads. Otherwise you'll never know which value makes it to the builder eventually.
Make sure that all value-setting threads have terminated before you call build(). (If you want to call build() in a thread, too, make sure you create this thread after all value-setting threads have terminated.)
This is necessary because #Builder wasn't designed for concurrency (as that's not something you typically do with a builder). In particular, #Builder does not use synchronization or volatile fields, so you have to create a happens-before relation for all setter calls with the build() call.

In a scala object, is an immutable val thread safe?

object Users {
val userCountByAgeMap = readFromFile(); // read from file returns immutable map
}
As in above example in scala, Users will be Singleton object and userCountByAgeMap will be initialized lazily.
So is this initialization atomic? i.e. one and only one thread will be able to initialize it.
Suppose userCountByAgeMap is initialized by Thread A, will it be visible to Thread B.
If initialization is not atomic/memory visibility not ensured, will making userCountByAgeMap variable as lazy val fix it?
In Scala, an object is initialized in a static block so thread safety is guaranteed by JVM (Java static initializers are thread safe). You can use JAD decompiler to analyzer bytecode. Here is the code:
object Users {
val userCountByAgeMap = Map.empty[String, Int]
}
And decompiled Users$.class file:
// Decompiled by Jad v1.5.8g. Copyright 2001 Pavel Kouznetsov.
// Jad home page: http://www.kpdus.com/jad.html
// Decompiler options: packimports(3)
// Source File Name: Users.scala
import scala.Predef$;
import scala.collection.immutable.Map;
import scala.collection.immutable.Map$;
public final class Users$
{
public Map userCountByAgeMap()
{
return userCountByAgeMap;
}
private Users$()
{
userCountByAgeMap = Predef$.MODULE$.Map().empty();
}
public static final Users$ MODULE$ = this;
private final Map userCountByAgeMap;
static
{
new Users$();
}
}
As you're using immutable Map which automatically gives you thread safety. So it's ok to access this field from different threads.
Yes, val is thread safe in object, you don't need to change it to lazy val for thread safe. As #Artavazd Balayan bytecode, the Scala object is equal to Java's singleton object. so it's equal to:
class Users {
public static Users users;
static {
users = new Users();
}
val userCountByAgeMap = readFromFile(); // read from file returns immutable map
}
As we know Java's static block will init when class(Users) is loaded, so it's thread safe.
And need to call out, the lazy val thread safe is for solving field is lazy, and only init when it invoke. but still keep it's thread safe between multiple threads when init by Double checking locking.
http://docs.scala-lang.org/sips/pending/improved-lazy-val-initialization.html

Template methode in threaded contexts

Let's say we have a template method that looks like this
abstract class Worker
{
public void DoJob()
{
BeforJob()
DoRealJob();
AfterJob();
}
abstract void DoRealJob();
}
subclasses that inherit from the Wroker classe should implemente the DoRealJob() method,
when the implementation is running under the same thread everything is fine, the three part of the DoJob() method get executed in this order
BeforJob()
DoRealJob()
AfterJob()
but when DoRealJob() runs under another thread, AfterJob() may get executed before DoRealJob() is completed
my actual solution is to let the subclasses call AfterJob() but this doesn't prevent a subclass from forgetting to call it, and we loose the benefit of a template method.
are there other ways to get consistent call order despite the fact the DoRealJob() is blocking or not?
You can't get both the simple inheritance(signature and hooking) and support asynchronous operations in your code.
These two goals are mutually exclusive.
The inheritors must be aware about callback mechanisms in either direct (Tasks, async) or indirect (events, callback functions, Auto(Manual)ResetEvents or other synchronization constructs). Some of them new, some old. And it is difficult to say which one will be better for the concrete case of use.
Well, it may look like there is a simple way with multithreaded code, but what if your DoRealJob will actually run in another process or use some remote job queuing, so the real job will be executed even outside your app?
So:
If you really consider that your class will be used as the basis for some
async worker, then you should design it accordingly.
If not - do not overengineer. You can't consider any possible
scenario. Just document your class well enough and I doubt that
anyone will try to implement the DoRealJob asynchronously,
especially if you name it DoRealJobSynchronously. If someone tries to
do it then in that case your conscience can be pristinely clean.
EDIT:
Do you think it would be correct if I provide both versions, sync and
async, of DoRealJob and a flag IsAsynchronous so I can decide which
one to call
As I have already said I don't know your actual usage scenarios. And it is unrealistic to consider that the design will be able to effectively handle all of them.
Also there are two very important questions to consider that pertain to your overall Worker class and its DoJob method:
1) You have to determine whether you want the DoJob method to be synchronous or asynchronous, or do you want to have both the synchronous and asynchronous versions? It is not directly related to your question, but it is still very important design decision, because it will have great impact on your object model. This question could be rephrased as:
Do you want the DoJob method to block any actions after it is called until it does its job or do you want to call it as some StartJob method, that will just launch the real processing but it is up to other mechanisms to notify you when the job has ended(or to stop it manually):
//----------------Sync worker--------------------------
SyncWorker syncWorker = CreateSyncStringWriter("The job is done");
Console.WriteLine("SyncWorker will be called now");
syncWorker.DoJob(); // "The job is done" is written here
Console.WriteLine("SyncWorker call ended");
//----------------Async worker--------------------------
Int32 delay = 1000;
AsyncWorker asyncWorker = CreateAsyncStringWriter("The job is done", delay);
Console.WriteLine("AsyncWorker will be called now");
asyncWorker.StartDoJob(); // "The job is done" won't probably be written here
Console.WriteLine("AsyncWorker call ended");
// "The job is done" could be written somewhere here.
2) If you want DoJob to be async(or to have async version) you should consider whether you want to have some mechanisms that will notify when DoJob finishes the processing - Async Programming Patterns , or it is absolutely irrelevant for you when or whether at all it ends.
SO:
Do you have the answers to these two questions?
If yes - that is good.
If not - refine and consider your requirements.
If you are still unsure - stick with simple sync methods.
If you, however, think that you need some async based infrastructure, then, taking into account that it is C# 3.0, you should use Asynchronouse Programming Model.
Why this one and not the event based? Because IAsyncResult interface despite its cumbersomeness is quite generic and can be easily used in Task-based model, simplifying future transition to higher .NET versions.
It will be something like:
/// <summary>
/// Interface for both the sync and async job.
/// </summary>
public interface IWorker
{
void DoJob();
IAsyncResult BeginDoJob(AsyncCallback callback);
public void EndDoJob(IAsyncResult asyncResult);
}
/// <summary>
/// Base class that provides DoBefore and DoAfter methods
/// </summary>
public abstract class Worker : IWorker
{
protected abstract void DoBefore();
protected abstract void DoAfter();
public IAsyncResult BeginDoJob(AsyncCallback callback)
{
return new Action(((IWorker)this).DoJob)
.BeginInvoke(callback, null);
}
//...
}
public abstract class SyncWorker : Worker
{
abstract protected void DoRealJobSync();
public void DoJob()
{
DoBefore();
DoRealJobSync();
DoAfter();
}
}
public abstract class AsyncWorker : Worker
{
abstract protected IAsyncResult BeginDoRealJob(AsyncCallback callback);
abstract protected void EndDoRealJob(IAsyncResult asyncResult);
public void DoJob()
{
DoBefore();
IAsyncResult asyncResult = this.BeginDoRealJob(null);
this.EndDoRealJob(asyncResult);
DoAfter();
}
}
P.S.: This example is incomplete and not tested.
P.P.S: You may also consider to use delegates in place of abstract(virtual) methods to express your jobs:
public class ActionWorker : Worker
{
private Action doRealJob;
//...
public ActionWorker(Action doRealJob)
{
if (doRealJob == null)
throw new ArgumentNullException();
this.doRealJob = doRealJob;
}
public void DoJob()
{
this.DoBefore();
this.doRealJob();
this.DoAfter();
}
}
DoBefore and DoAfter can be expressed in a similar way.
P.P.P.S: Action delegate is a 3.5 construct, so you will probably have to define your own delegate that accepts zero parameters and returns void.
public delegate void MyAction()
Consider change the DoRealJob to DoRealJobAsync and give it a Task return value. So you can await the eventual asynchronous result.
So your code would look like
abstract class Worker
{
public void DoJob()
{
BeforJob()
await DoRealJobAsync();
AfterJob();
}
abstract Task DoRealJob();
}
If you don't have .net 4.0 and don't want to us the old 3.0 CTP of async you could use the normale task base style:
abstract class Worker
{
public void DoJob()
{
BeforJob()
var task = DoRealJobAsync();
.ContinueWith((prevTask) =>
{
AfterJob()
});
}
abstract Task DoRealJob();
}

Thread-Safe Properties in C#

I know that this subject is slightly "Played Out", but I am still terribly confused. I have a class with properties that will be updates by multiple threads and I am trying to allow the properties to be updated in a Threadsafe manner.
Below, I have included a few examples of what I have tried thus far (the class is contained within a BindingList so its properties call a PropertyChangingEventHandler event).
Method 1 - Doubles
private double _Beta;
public double Beta
{
get
{
return _Beta;
}
}
private readonly BetaLocker = new object();
public void UpdateBeta(double Value)
{
lock (BetaLocker)
{
_Beta = Value;
NotifyPropertyChanged("Beta");
}
}
Method 2 - Ints
private int _CurrentPosition;
public int CurrentPosition
{
get
{
return _CurrentPosition;
}
}
public void UpdatePosition(int UpdateQuantity)
{
Interlocked.Add(ref _CurrentPosition, UpdateQuantity);
NotifyPropertyChanged("CurrentPosition");
}
Basically - is the current way that I am creating properties completely threadsafe for both ints and doubles?
You have to ask yourself what it means to be Thread Safe (yes, it's a link to wikipedia and it's blacked out ^_^):
A piece of code is thread-safe if it only manipulates shared data structures in a manner that guarantees safe execution by multiple threads at the same time. There are various strategies for making thread-safe data structure
So now you have to determine if your code guarantees safe execution if executed by multiple threads: the quick answer is that both of your code samples are thread safe! However (and this is a big one), you also have to consider the usage of the object and determine if it is Thread Safe also... here is an example:
if(instance.Beta==10.0)
{
instance.UpdateBeta(instance.Beta*10.0);
}
// what's instance.Beta now?
In this case you have absolutely no guarantee that Beta will be 100.0 because beta could have changed after you checked it. Imagine this situation:
Thread 2: UpdateBeta(10.0)
Thread 1: if(Beta == 10.00)
Thread 2: UpdateBeta(20.0)
Thread 1: UpdateBeta(Beta*10.0)
// Beta is now 200.0!!!
The quick and dirty way to fix this is to use a double-checked lock:
if(instance.Beta==10.0)
{
lock(instance)
{
if(instance.Beta==10.0)
{
instance.UpdateBeta(instance.Beta*10.0);
}
}
}
The same is true for CurrentPosition.

java:singleton, static variable and thread safety

class MyClass
{
private static MyClass obj;
public static MyClass getInstance()
{
if(obj==null)
{
obj = new MyClass();
}
return obj;
}
In the above java code sample, because obj is a static variable inside the class,
will getInstance still be non-thread safe? Because static variables are shared by all threads, 2 simultaneous threads shall be using the same object. Isnt it?
Vipul Shah
Because static variables are so widely shared they are extremely un-thread safe.
Consider what happens if two threads call your getInstance at the same time. Both threads will be looking at the shared static obj and both threads will see that obj is null in the if check. Both threads will then create a new obj.
You may think: "hey, it is thread safe since obj will only ever have one value, even if it is initialized multiple times." There are several problems with that statement. In our previous example, the callers of getInstance will both get their own obj back. If both callers keep their references to obj then you will have multiple instances of your singleton being used.
Even if the callers in our previous example just did: MyClass.getInstance(); and didn't save a reference to what MyClass.getInstance(); returned, you can still end up getting different instances back from getInstance on those threads. You can even get into the condition where new instances of obj are created even when the calls to getInstance do not happen concurrently!
I know my last claim seems counter-intuitive since the last assignment to obj would seem to be the only value that could be returned from future calls to MyClass.getInstance(). You need to remember, however, that each thread in the JVM has its own local cache of main memory. If two threads call getInstance, their local caches could have different values assigned to obj and future calls to getInstance from those threads will return what is in their caches.
The simplest way to make sure that getInstance thread safe would be to make the method synchronized. This will ensure that
Two threads can not enter getInstance at the same time
Threads trying to use obj will never get a stale value of obj from their cache
Don't try to get clever and use double checked locking:
http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html
Good explanation can be found here:
http://en.wikipedia.org/wiki/Singleton_pattern
The wiki article highlights various thread-safe approaches along with some of their pros and cons.
in this case getInstance() is not thread-safe, even if you use static variable. only synchronization makes this thread-safe.
The following example shows a weird thread save modified single ton pattern which supports generics as well.
To have it just thread save and synchronization save just take the synchronized block and the transient and volatile keywords.
Notice, that there is a double check, the synchronized block is inside an if. This brings more performance, because synchronized is expensive.
Of course for a real singleton do not use maps, I said it is a modified one.
public class Edge<T> {
#SuppressWarnings({"unchecked"})
private static transient volatile HashMap<Object,HashMap<Object, Edge>> instances = new HashMap<Object, HashMap<Object,Edge>>();
/**
* This function is used to get an Edge instance
* #param <T> Datatype of the nodes.
* #param node1, the source node
* #param node2, the destination node
* #return the edge of the two nodes.
*/
#SuppressWarnings({"unchecked"})
public static <T> Edge<T> getInstance(T node1, T node2){
if(!(instances.containsKey(node1) && instances.get(node1).containsKey(node2))){
synchronized (Edge.class) {
if(!(instances.containsKey(node1) && instances.get(node1).containsKey(node2))){
Edge<T> edge = new Edge<T>(node1, node2);
if(!instances.containsKey(node1)){
instances.put(node1, new HashMap<Object, Edge>());
}
instances.get(node1).put(node2, edge);
}
}
}
return (Edge<T>)instances.get(node1).get(node2);
}
public class Singleton{
private static transient volatile Singleton instance;
public static Singleton getInstance(){
if(instance==null)synchronized(Singleton.class){
if(instance==null){
instance = new Singleton();
}
}
return instance;
}
private Singleton(){
/*....*/
}
}
Page 182:
http://books.google.com/books?id=GGpXN9SMELMC&printsec=frontcover&dq=design+patterns&hl=de&ei=EFGCTbyaIozKswbHyaiCAw&sa=X&oi=book_result&ct=result&resnum=2&ved=0CDMQ6AEwAQ#v=onepage&q&f=false
Think this can be tagged as answered now.
class MyClass
{
private static MyClass obj;
private MyClass(){
// your initialization code
}
public static synchronized MyClass getInstance()
{
if(obj==null)
{
obj = new MyClass();
}
return obj;
}
I'll agree with #Manoj.
I believe the above will be one of the best methods to achieve singleton object.
And synchronization makes the object thread safe.
Even, it's static :)

Resources