I'm wondering if it is possible to add a member object that can be used across multiple map() call. For example, a StringBuilder:
private StringBuilder builder;
public void map(...){
...
builder.setLength(0);
builder.append(a);
builder.append(b);
builder.append(c);
d = builder.toString();
...
}
Obviously, if the mapper object is shared across multiple threads, the builder object above will not behave as expected due to concurrent access from more than one threads.
So my question is: Is it assured that each thread in hadoop will use one dedicated mapper object for itself? Or it is a configurable behavior?
Thanks
As long as you are not using the MultithreadedMapper class, but your own, there will be no problem. map() is called sequential and not in parallel.
It is common to use a StringBuilder or other data structures to buffer a few objects between the calls.
But make sure you clone the objects from your input objects, there is only one object and it will be filled over and over again to prevent lots of GC.
So there is no need to synchronize or take care of race conditions.
I don't think that's possible. The reason for that is that each mapper runs in its own JVM (they will be distributed on different machines), so there's no way you can share a variable or object across multiple mappers or reducers easily.
Now if all your mappers run on the same node, I believe there is a configuration for JVM reuse somewhere, but honestly I wouldn't bother with that, especially if all you need is a StringBuilder :)
I've seen this question once before, and it could be solved very easily by changing the design of the application. Maybe you can tell more about what you're trying to accomplish with this to see if this is really needed. If you really need it, you can still serialize your object, put it in HDFS, then read it with each mapper, deserialize it, but that seems backwards.
Related
i created this list on JSR223 PreProcessor. I access to this list by all the threads but the problem is that it is not synchronized
props.put("listOfTasks", new ArrayList());
someone has any idea how to do it ? thank you
props itself is Properties which inherits from HashTable which is synchronized by nature. More information: Top 8 JMeter Java Classes You Should Be Using with Groovy
What do you store there it's totally up to you, ArrayList per se is not synchronized so you might want to use CopyOnWriteArrayList or call Collections.synchronizedList() function
It's quite hard to give a piece of advice without either seeing your code or background information like what you're trying to achieve, the safest solution from JMeter perspective is working with your list under Critical Section Controller scope, it will allow to avoid any race conditions
Let's have a worker thread which is accessed from a wide variety of objects. This worker object has some public slots, so anyone who connects its signals to the worker's slots can use emit to trigger the worker thread's useful tasks.
This worker thread needs to be almost global, in the sense that several different classes use it, some of them are deep in the hierarchy (child of a child of a child of the main application).
I guess there are two major ways of doing this:
All the methods of the child classes pass their messages upwards the hierarchy via their return values, and let the main (e.g. the GUI) object handle all the emitting.
All those classes which require the services of the worker thread have a pointer to the Worker object (which is a member of the main class), and they all connect() to it in their constructors. Every such class then does the emitting by itself. Basically, dependency injection.
Option 2. seems much more clean and flexible to me, I'm only worried that it will create a huge number of connections. For example, if I have an array of an object which needs the thread, I will have a separate connection for each element of the array.
Is there an "official" way of doing this, as the creators of Qt intended it?
There is no magic silver bullet for this. You'll need to consider many factors, such as:
Why do those objects emit the data in the first place? Is it because they need to do something, that is, emission is a “command”? Then maybe they could call some sort of service to do the job without even worrying about whether it's going to happen in another thread or not. Or is it because they inform about an event? In such case they probably should just emit signals but not connect them. Its up to the using code to decide what to do with events.
How many objects are we talking about? Some performance tests are needed. Maybe it's not even an issue.
If there is an array of objects, what purpose does it serve? Perhaps instead of using a plain array some sort of “container” class is needed? Then the container could handle the emission and connection and objects could just do something like container()->handle(data). Then you'd only have one connection per container.
I'm building a library that others--i.e. those uninterested in internals--could use to pull data from our DBs. In the internals, I want a couple I/O calls to be performed in parallel for performance purposes. The trade-off here is that the client (who, again, might not care much about this whole threading thing) would need to provide an appropriate execution context. Therefore, I provide a suggestion to use a helpful execution context in a helper object:
object ThreadPoolHelper {
val cachedThreadPoolContext: ExecutionContext =
ExecutionContext.fromExecutor(Executors.newCachedThreadPool())
}
The question is (assuming that someday I also provide other options, like, say, a fixed thread pool for the clients to optionally use) am I fine just leaving this (these) as a val? Or am I better off making it lazy? Or a def?
One way or another, lazy is the way to go.
Making them lazy vals would be a good all-purpose choice, as each could be initialized as needed (as they are accessed). Then, you would never initialize more thread pools than are needed. scala.concurrent.ExecutionContext.Implicits.global is an implicit lazy val.
Technically, singleton objects (like ThreadPoolHelper) are lazy by default, so they will not be initialized until they are first accessed. A val would be fine if you only had one ExecutionContext in an object. However, multiple ExecutionContexts as vals in the same object wouldn't make as much sense, because accessing one would initialize them all--which would use more resources than needed.
A def would not make sense, because then you would be creating a new ExecutionContext on each call, and throwing it away when done. That could cause a lot of unwanted overhead, and default the purpose of having a thread pool in the first place.
Some ExecutionContexts that are little more custom than yours are singleton objects that extend ExecutionContext and implement their own custom behavior. These would also be lazy.
I'm having a go with greenDAO and so far it's going pretty well. One thing that doesn't seem to be covered by the docs or website (or anywhere :( ) is how it handles thread safety.
I know the basics mentioned elsewhere, like "use a single dao session" (general practice for Android + SQLite), and I understand the Java memory model quite well. The library internals even appear threadsafe, or at least built with that intention. But nothing I've seen covers this:
greenDAO caches entities by default. This is excellent for a completely single-threaded program - transparent and a massive performance boost for most uses. But if I e.g. loadAll() and then modify one of the elements, I'm modifying the same object globally across my app. If I'm using it on the main thread (e.g. for display), and updating the DB on a background thread (as is right and proper), there are obvious threading problems unless extra care is taken.
Does greenDAO do anything "under the hood" to protect against common application-level threading problems? For example, modifying a cached entity in the UI thread while saving it in a background thread (better hope they don't interleave! especially when modifying a list!)? Are there any "best practices" to protect against them, beyond general thread safety concerns (i.e. something that greenDAO expects and works well with)? Or is the whole cache fatally flawed from a multithreaded-application safety standpoint?
I've no experience with greenDAO but the documentation here:
http://greendao-orm.com/documentation/queries/
Says:
If you use queries in multiple threads, you must call forCurrentThread() on the query to get a Query instance for the current thread. Starting with greenDAO 1.3, object instances of Query are bound to their owning thread that build the query. This lets you safely set parameters on the Query object while other threads cannot interfere. If other threads try to set parameters on the query or execute the query bound to another thread, an exception will be thrown. Like this, you don’t need a synchronized statement. In fact you should avoid locking because this may lead to deadlocks if concurrent transactions use the same Query object.
To avoid those potential deadlocks completely, greenDAO 1.3 introduced the method forCurrentThread(). This will return a thread-local instance of the Query, which is safe to use in the current thread. Every time, forCurrentThread() is called, the parameters are set to the initial parameters at the time the query was built using its builder.
While so far as I can see the documentation doesn't explicitly say anything about multi threading other than this this seems pretty clear that it is handled. This is talking about multiple threads using the same Query object, so clearly multiple threads can access the same database. Certainly it's normal for databases and DAO to handle concurrent access and there are a lot of proven techniques for working with caches in this situation.
By default GreenDAO caches and returns cached entity instances to improve performance. To prevent this behaviour, you need to call:
daoSession.clear()
to clear all cached instances. Alternatively you can call:
objectDao.detachAll()
to clear cached instances only for the specific DAO object.
You will need to call these methods every time you want to clear the cached instances, so if you want to disable all caching, I recommend calling them in your Session or DAO accessor methods.
Documentation:
http://greenrobot.org/greendao/documentation/sessions/#Clear_the_identity_scope
Discussion: https://github.com/greenrobot/greenDAO/issues/776
What is the best way to have synchronized a collection of objects between various threads in .Net?
I need to have a List or Dictionary accessed from different threads in a thread safe mode. With Adds, Removes, Foreachs, etc.
Basically it depends on the pattern you need to use.
If you have several threads writing and reading the same place you can use the same data structure that you would have used with a single thread (hastable, array, etc.) with a lock/monitor or a ReaderWriterLock to prevent race conditions.
In case you need to pass data between threads you'll need some kind of queue (synced or lockfree) that thread(s) of group A would insert to and thread(s) of group B would deque from. You might want to use WaitEvent (AutoReset or Manual) so that you won't loose CPU when the queue is empty.
It really depends on what kind of workflow you want to achieve.
You could implement a lock-free queue:
http://www.boyet.com/Articles/LockfreeQueue.html
Or handle the synchronization yourself using locks:
http://www.albahari.com/threading/part2.html#_Locking
Hashtable.Synchronized method returns a synchronized (thread safe) wrapper for the Hashtable.
http://msdn.microsoft.com/en-us/library/system.collections.hashtable.synchronized(VS.80).aspx
This also exists for other collections.
A number of the collection classes in .Net have built in support for synchronizing and making access from multiple threads safe. For example (in C++/CLR):
Collections::Queue ^unsafe_queue = gcnew Collections::Queue();
Collections::Queue ^safe_queue = Collections::Queue::Synchronized(unsafe_queue);
You can throw away the reference to unsafe_queue, and keep the reference to safe_queue. It can be shared between threads, and you're guaranteed thread safe access. Other collection classes, like ArrayList and Hashtable, also support this, in a similar manner.
Without knowing specifics, I'd lean towards delegates and events to notify of changes.
http://msdn.microsoft.com/en-us/library/17sde2xt(VS.71).aspx
And implementing the Observer or Publish Subscribe pattern
http://en.wikipedia.org/wiki/Observer_pattern
http://msdn.microsoft.com/en-us/library/ms978603.aspx