Serialization error when using a non-serializable object in driver code

Serialization error when using a non-serializable object in driver code - apache-spark

I'm using Spark Streaming to process a stream by processing each partition (saving events to HBase), then ack the last event in each RDD from the driver to the receiver, so the receiver can ack it to its source in turn.
public class StreamProcessor {
final AckClient ackClient;
public StreamProcessor(AckClient ackClient) {
this.ackClient = ackClient;
}
public void process(final JavaReceiverInputDStream<Event> inputDStream)
inputDStream.foreachRDD(rdd -> {
JavaRDD<Event> lastEvents = rdd.mapPartition(events -> {
// ------ this code executes on the worker -------
// process events one by one; I don't use ackClient here
// return the event with the max delivery tag here
});
// ------ this code executes on the driver -------
Event lastEvent = .. // find event with max delivery tag across partitions
ackClient.ack(lastEvent); // use ackClient to ack last event
});
}
}
The problem here is that I get the following error (even though everything seems to work fine):
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
at org.apache.spark.api.java.JavaRDDLike$class.mapPartitions(JavaRDDLike.scala:141)
at org.apache.spark.api.java.JavaRDD.mapPartitions(JavaRDD.scala:32)
...
Caused by: java.io.NotSerializableException: <some non-serializable object used by AckClient>
...
It seems that Spark is trying to serialize AckClient to send it to the workers, but I thought that only code inside mapPartitions is serialized/shipped to the workers, and that the code at the RDD level (i.e. inside foreachRDD but not inside mapPartitions) would not be serialized/shipped to the workers.
Can someone confirm if my thinking is correct or not? And if it is correct, should this be reported as a bug?

You are correct, this was fixed in 1.1. However, if you look at the stack trace, the cleaner that is throwing is being invoked in the mapPartitions
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
So, the problem has to do with your mapPartitions. Make sure that you aren't accidentally wrapping this, as that is a common issue

Related

Scala Iterator for multithreading

I am using scala Iterator for waiting loop in synchronized block:
anObject.synchronized {
if (Try(anObject.foo()).isFailure) {
Iterator.continually {
anObject.wait()
Try(anObject.foo())
}.dropWhile(_.isFailure).next()
}
anObject.notifyAll()
}
Is it acceptable to use Iterator with concurrency and multithreading? If not, why? And then what to use and how?
There are some details, if it matters. anObject is a mutable queue. And there are multiple producers and consumers to the queue. So the block above is a code of such producer or consumer. anObject.foo is a common simplified declaration of function that either enqueue (for producer) or dequeue (for consumer) data to/from the queue.

Iterator is mutable internally, so you have to take that into consideration if you use it in multi-threaded environment. If you guaranteed that you won't end up in situation when e.g.
2 threads check hasNext()
one of them calls next() - it happens to be the last element
the other calls next() - NPE
(or similar) then you should be ok. In your example Iterator doesn't even leave the scope, so the errors shouldn't come from Iterator.
However, in your code I see the issue with having aObject.wait() and aObject.notifyAll() next to each other - if you call .wait then you won't reach .notifyAll which would unblock it. You can check in REPL that this hangs:
# val anObject = new Object { def foo() = throw new Exception }
anObject: {def foo(): Nothing} = ammonite.$sess.cmd21$$anon$1#126ae0ca
# anObject.synchronized {
if (Try(anObject.foo()).isFailure) {
Iterator.continually {
anObject.wait()
Try(anObject.foo())
}.dropWhile(_.isFailure).next()
}
anObject.notifyAll()
}
// wait indefinitelly
I would suggest changing the design to NOT rely on wait and notifyAll. However, from your code it is hard to say what you want to achieve so I cannot tell if this is more like Promise-Future case, monix.Observable, monix.Task or something else.
If your use case is a queue, produces and consumers, then it sound like a use case for reactive streams - e.g. FS2 + Monix, but it could be FS2+IO or something from Akka Streams
val queue: Queue[Task, Item] // depending on use case queue might need to be bounded
// in one part of the application
queue.enqueu1(item) // Task[Unit]
// in other part of the application
queue
.dequeue
.evalMap { item =>
// ...
result: Task[Result]
}
.compile
.drain
This approach would require some change in thinking about designing an application, because you would no longer work on thread directly, but rather designed a flow data and declaring what is sequential and what can be done in parallel, where threads become just an implementation detail.

Spring Integration Java DSL: How to continue after error situation with the split and the aggregate methods?

My program does the following in the high level
Task 1
get the data from the System X
the Java DSL split
post the data to the System Y
post the reply data to the X
the Java DSL aggregate
Task 2
get the data from the System X
the Java DSL split
post the data to the System Y
post the reply data to the X
the Java DSL aggregate
...
The problem is that when one post the data to the System Y sub task fails, the error message is correctly send back to the System X, but after that any other sub tasks or tasks are not executed.
My error handler does this:
...
Message<String> newMessage = MessageBuilder.withPayload("error occurred")
.copyHeadersIfAbsent(message.getPayload().getFailedMessage().getHeaders()).build();
...
Set some extra headers etc.
...
return newMessage;
What could be the problem?
Edit:
I debugged the Spring Integration. In the error situation only first error message comes to the method AbstractCorrelatingMessageHandler.handleMessageInternal. Other successfull and failing messages not come to the method.
If there are not errors all the messages come to the method and finally the group is released.
What could be wrong in my program?
Edit 2:
This is working:
Added the advice for the Http.outboundGateway:
.handle(Http.outboundGateway(...,
c -> c.advice(myAdvice()))
and the myAdvice bean
#Bean
private Advice myAdvice() {
return new MyAdvice();
}
and the MyAdvice class
public class MyAdvice<T> extends AbstractRequestHandlerAdvice {
#SuppressWarnings("unchecked")
#Override
protected Object doInvoke(final ExecutionCallback callback, final Object target, final Message<?> message)
throws Exception {
...
try {
result = (MessageBuilder<T>) callback.execute();
} catch (final MessageHandlingException e) {
take the exception cause for the new payload
}
return new message with the old headers and replyChannel header and result.payload or the exception cause as a payload
}
}

There is nothing wrong with your program. That's exactly how regular loop works in Java. To catch exception for each iteration and continue with other remaining item you definitely need a try..catch in the Java loop. So, something similar you need to apply here for the splitter. It can be achieved with the ExpressionEvaluatingRequestHandlerAdvice, an ExectutorChannel as an output from the splitter or with the gateway call via service activator on the splitter output channel.
Since the story is about an aggregator afterward, you still need to finish a group somehow and this can be done only with some error compensation message to be emitted from the error handling to return back to the aggregator's input channel. In this case you need to ensure to copy request headers from the failedMessage of the MessagingException thrown to the error flow. After aggregation of the group you would need to sever messages with error from the normal ones. That can be done only with the special payload or you may just have an exception as a payload for the proper distinguishing errors from normal messages in the final result from the aggregator.

UndeliverableException while calling onError of ObservableEmitter in RXjava2

I have a method which creates an emitter like below, there are a problem(maybe it is normal behavior) with calling onError in retrofit callback. I got UndeliverableException when try to call onError.
I can solve this by checking subscriber.isDiposed() by I wonder how can call onError coz i need to notify my UI level.
Addition 1
--> RxJava2CallAdapterFactoryalready implemented
private static Retrofit.Builder builderSwift = new Retrofit.Builder()
.baseUrl(URL_SWIFT)
.addCallAdapterFactory(RxJava2CallAdapterFactory.create())
.addConverterFactory(GsonConverterFactory.create())
.addConverterFactory(new ToStringConverterFactory());
--> When i added below code to application class app won't crash
--> but i get java.lang.exception instead of my custom exception
RxJavaPlugins.setErrorHandler(Functions<Throwable>emptyConsumer());
#Override
public void onFileUploadError(Throwable e) {
Log.d(TAG, "onFileUploadError: " + e.getMessage());
}
public Observable<UploadResponseBean> upload(final UploadRequestBean uploadRequestBean, final File file) {
return Observable.create(new ObservableOnSubscribe<UploadResponseBean>() {
#Override
public void subscribe(#NonNull final ObservableEmitter<UploadResponseBean> subscriber) throws Exception {
// ---> There are no problem with subscriber while calling onError
// ---> Retrofit2 service request
ftsService.upload(token, uploadRequestBean, body).enqueue(new Callback<UploadResponseBean>() {
#Override
public void onResponse(Call<UploadResponseBean> call, Response<UploadResponseBean> response) {
if (response.code() == 200){
// ---> calling onNext works properly
subscriber.onNext(new UploadResponseBean(response.body().getUrl()));
}
else{
// ---> calling onError throws UndeliverableException
subscriber.onError(new NetworkConnectionException(response.message()));
}
}
#Override
public void onFailure(Call call, Throwable t) {
subscriber.onError(new NetworkConnectionException(t.getMessage()));
}
});
}
});
}

Since version 2.1.1 tryOnError is available:
The emitter API (such as FlowableEmitter, SingleEmitter, etc.) now
features a new method, tryOnError that tries to emit the Throwable if
the sequence is not cancelled/disposed. Unlike the regular onError, if
the downstream is no longer willing to accept events, the method
returns false and doesn't signal an UndeliverableException.
https://github.com/ReactiveX/RxJava/blob/2.x/CHANGES.md

The problem is like you say you need to check if Subscriber is already disposed, that's because RxJava2 is more strict regarding errors that been thrown after Subscriber already disposed.
RxJava2 deliver this kind of error to RxJavaPlugins.onError that by default print to stack trace and calls to thread uncaught exception handler. you can read full explanation here.
Now what's happens here, is that you probably unsubscribed (dispose) from this Observable before query was done and error delivered and as such - you get the UndeliverableException.
I wonder how can call onError coz i need to notify my UI level.
as this is happened after your UI been unsubscribed the UI shouldn't care. in normal flow this error should delivered properly.
Some general points regarding your implementation:
the same issue will happen at the onError in case you've been unsubscribed before.
there is no cancellation logic here (that's what causing this problem) so request continue even if Subscriber unsubscribed.
even if you'll implement this logic (using ObservableEmitter.setCancellable() / setDisposable()) you will still encounter this problem in case you will unsubscribe before request is done - this will cause cancellation and your onFailure logic will call onError() and the same issue will happen.
as you performing an async call via Retrofit the specified subscription Scheduler will not make the actual request happen on the Scheduler thread but just the subscription. you can use Observable.fromCallable and Retrofit blocking call execute to gain more control over the actual thread call is happened.
to sum it up -
guarding calls to onError() with ObservableEmitter.isDiposed() is a good practice in this case.
But I think the best practice is to use Retrofit RxJava call adapter, so you'll get wrapped Observable that doing the Retrofit call and already have all this considerations.

I found out that this issue was caused by using incorrect context when retrieving view model in Fragment:
ViewModelProviders.of(requireActivity(), myViewModelFactory).get(MyViewModel.class);
Because of this, the view model lived in context of activity instead of fragment. Changing it to following code fixed the problem.
ViewModelProviders.of(this, myViewModelFactory).get(MyViewModel.class);

Consumer/Producer with order and constraint on consumed items

I have the following scenario
I am writing a server that process files (jobs)
a file has a "prefix" and a time
the files should be processed according to time (older file first) but also take into account the prefix (files with same prefix can't be processed concurrently)
I have a thread (Task with Timer) that watches over a directory and adds files to a "queue" (producer)
I have several consumers that take the file from "queue" (consumer) - they should conform to the above rules.
the job of each task is kept in some list (this indicates the constraints)
There are several consumers, the number of consumers is determined at startup.
One of the requirement is to be able to gracefully stop the consumers (either immediately or let ongoing processes to finish).
I did something along this line:
while (processing)
{
//limits number of concurrent tasks
_processingSemaphore.Wait(queueCancellationToken);
//Take next job when available or wait for cancel signal
currentwork = workQueue.Take(taskCancellationToken);
//check that it can actually process this work
if (CanProcess(currnetWork)
{
var task = CreateTask(currentwork)
task.ContinueWith((t) => { //release processing slot });
}
else
//release slot, return job? something else?
}
The cancellation tokens sources are in the caller code and can be cancelled. There are two in order to be able to stop queuing while not cancelling running tasks.
I tired to implement the "queue" as BlockingCollection wrapping a "safe" SortedSet. The general idea work (ordering by time) except the case in which I need to find a new job that matches the constraint. If I return the job to the queue and try to take again I will get the same one.
It is possible to take jobs from the queue until I find a proper one and then returning the "illegal" jobs back but this may cause issues with other consumers processing out of order jobs
Another alternative is to pass a simple collection and a way to lock it and just lock and do a simple search according to current constraints. Again, this means writing code that will possibly not be thread-safe.
Any other suggestion / pointers / data structures that can help?

I think Hans is right: if you already have a thread-safe SortedSet (that implements IProducerConsumerCollection, so it can be used in BlockingCollection), then all you need is to put only files that can be processed right now into the collection. If you finish a file which makes another file available for processing, add the other file to the collection at this point, not earlier.

I would have implemented your requirement(s) with TPL Dataflow. Look at the way you could implement the Producer-Consumer pattern with it. I believe this will meet all the requirements you have (including cancellation on the consumers).
EDIT (for those that do not like to read documentation, but who does...)
Here is an example of how you could implement the requirements with TPL Dataflow. The beauty of this implementation is that consumers are not bound to a single thread and only uses a pool thread when it needs to process data.
static void Main(string[] args)
{
BufferBlock<string> source = new BufferBlock<string>();
var cancellation = new CancellationTokenSource();
LinkConsumer(source, "A", cancellation.Token);
LinkConsumer(source, "B", cancellation.Token);
LinkConsumer(source, "C", cancellation.Token);
// Link an action that will process source values that are not processed by other
source.LinkTo(new ActionBlock<string>((s) => Console.WriteLine("Default action")));
while (cancellation.IsCancellationRequested == false)
{
ConsoleKey key = Console.ReadKey(true).Key;
switch (key)
{
case ConsoleKey.Escape:
cancellation.Cancel();
break;
default:
Console.WriteLine("Posted value {0} on thread {1}.", key, Thread.CurrentThread.ManagedThreadId);
source.Post(key.ToString());
break;
}
}
source.Complete();
Console.WriteLine("Done.");
Console.ReadLine();
}
private static void LinkConsumer(ISourceBlock<string> source, string prefix, CancellationToken token)
{
// Link a consumer that will buffer and process all input of the specified prefix
var consumer = new ActionBlock<string>(new Action<string>(Process), new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 1, SingleProducerConstrained = true, CancellationToken = token, TaskScheduler = TaskScheduler.Default });
var linkDisposable = source.LinkTo(consumer, (p) => p == prefix);
// Dispose the link (remove the link) when cancellation is requested.
token.Register(linkDisposable.Dispose);
}
private static void Process(string arg)
{
Console.WriteLine("Processed value {0} in thread {1}", arg, Thread.CurrentThread.ManagedThreadId);
// Simulate work
Thread.Sleep(500);
}

Async Logger. Can I lose/delay log entries?

I'm implementing my own logging framework. Following is my BaseLogger which receives the log entries and push it to the actual Logger which implements the abstract Log method.
I use the C# TPL for logging in an Async manner. I use Threads instead of TPL. (TPL task doesn't hold a real thread. So if all threads of the application end, tasks will stop as well, which will cause all 'waiting' log entries to be lost.)
public abstract class BaseLogger
{
// ... Omitted properties constructor .etc. ... //
public virtual void AddLogEntry(LogEntry entry)
{
if (!AsyncSupported)
{
// the underlying logger doesn't support Async.
// Simply call the log method and return.
Log(entry);
return;
}
// Logger supports Async.
LogAsync(entry);
}
private void LogAsync(LogEntry entry)
{
lock (LogQueueSyncRoot) // Make sure we ave a lock before accessing the queue.
{
LogQueue.Enqueue(entry);
}
if (LogThread == null || LogThread.ThreadState == ThreadState.Stopped)
{ // either the thread is completed, or this is the first time we're logging to this logger.
LogTask = new new Thread(new ThreadStart(() =>
{
while (true)
{
LogEntry logEntry;
lock (LogQueueSyncRoot)
{
if (LogQueue.Count > 0)
{
logEntry = LogQueue.Dequeue();
}
else
{
break;
// is it possible for a message to be added,
// right after the break and I leanve the lock {} but
// before I exit the loop and task gets 'completed' ??
}
}
Log(logEntry);
}
}));
LogThread.Start();
}
}
// Actual logger implimentations will impliment this method.
protected abstract void Log(LogEntry entry);
}
Note that AddLogEntry can be called from multiple threads at the same time.
My question is, is it possible for this implementation to lose log entries ?
I'm worried that, is it possible to add a log entry to the queue, right after my thread exists the loop with the break statement and exits the lock block, and which is in the else clause, and the thread is still in the 'Running' state.
I do realize that, because I'm using a queue, even if I miss an entry, the next request to log, will push the missed entry as well. But this is not acceptable, specially if this happens for the last log entry of the application.
Also, please let me know whether and how I can implement the same, but using the new C# 5.0 async and await keywords with a cleaner code. I don't mind requiring .NET 4.5.
Thanks in Advance.

While you could likely get this to work, in my experience, I'd recommend, if possible, use an existing logging framework :) For instance, there are various options for async logging/appenders with log4net, such as this async appender wrapper thingy.
Otherwise, IMHO since you're going to be blocking a threadpool thread during your logging operation anyway, I would instead just start a dedicated thread for your logging. You seem to be kind-of going for that approach already, just via Task so that you'd not hold a threadpool thread when nothing is logging. However, the simplification in implementation I think benefits just having the dedicated thread.
Once you have a dedicated logging thread, you then only need have an intermediate ConcurrentQueue. At that point, your log method just adds to the queue and your dedicated logging thread just does that while loop you already have. You can wrap with BlockingCollection if you need blocking/bounded behavior.
By having the dedicated thread as the only thing that writes, it eliminates any possibility of having multiple threads/tasks pulling off queue entries and trying to write log entries at the same time (painful race condition). Since the log method is now just adding to a collection, it doesn't need to be async and you don't need to deal with the TPL at all, making it simpler and easier to reason about (and hopefully in the category of 'obviously correct' or thereabouts :)
This 'dedicated logging thread' approach is what I believe the log4net appender I linked to does as well, FWIW, in case that helps serve as an example.

I see two race conditions off the top of my head:
You can spin up more than one Thread if multiple threads call AddLogEntry. This won't cause lost events but is inefficient.
Yes, an event can be queued while the Thread is exiting, and in that case it would be "lost".
Also, there's a serious performance issue here: unless you're logging constantly (thousands of times a second), you're going to be spinning up a new Thread for each log entry. That will get expensive quickly.
Like James, I agree that you should use an established logging library. Logging is not as trivial as it seems, and there are already many solutions.
That said, if you want a nice .NET 4.5-based approach, it's pretty easy:
public abstract class BaseLogger
{
private readonly ActionBlock<LogEntry> block;
protected BaseLogger(int maxDegreeOfParallelism = 1)
{
block = new ActionBlock<LogEntry>(
entry =>
{
Log(entry);
},
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDegreeOfParallelism,
});
}
public virtual void AddLogEntry(LogEntry entry)
{
block.Post(entry);
}
protected abstract void Log(LogEntry entry);
}

Regarding the loosing waiting messages on app crush because of unhandled exception, I've bound a handler to the event AppDomain.CurrentDomain.DomainUnload. Goes like this:
protected ManualResetEvent flushing = new ManualResetEvent(true);
protected AsyncLogger() // ctor of logger
{
AppDomain.CurrentDomain.DomainUnload += CurrentDomain_DomainUnload;
}
protected void CurrentDomain_DomainUnload(object sender, EventArgs e)
{
if (!IsEmpty)
{
flushing.WaitOne();
}
}
Maybe not too clean, but works.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string