What multithreading based data structure should I use?

What multithreading based data structure should I use? - multithreading

I have recently come across a question based on multi-threading. I was given a situation where there will be variable no of cars constantly changing there locations. Also there are multiple users who are posting requests to get location of any car at any moment. What would be data structure to handle this situation and why?

You could use a mutex (one per car).
Lock: before changing location of the associated car
Unlock: after changing location of the associated car
Lock: before getting location of the associated car
Unlock: after done doing work that relies on that location being up to date

I'd answer with:
Try to make threading an external concept to your system yet make the system as modular and encapsulated as possible at the same time. It will allow adding concurrency at later phase at low cost and in case the solution happens to work nicely in a single thread (say by making it event-loop-based) no time will have been burnt for nothing.

There are several ways to do this. Which way you choose depends a lot on the number of cars, the frequency of updates and position requests, the expected response time, and how accurate (up to date) you want the position reports to be.
The easiest way to handle this is with a simple mutex (lock) that allows only one thread at a time to access the data structure. Assuming you're using a dictionary or hash map, your code would look something like this:
Map Cars = new Map(...)
Mutex CarsMutex = new Mutex(...)
Location GetLocation(carKey)
{
acquire mutex
result = Cars[carKey].Location
release mutex
return result
}
You'd do that for Add, Remove, Update, etc. Any method that reads or updates the data structure would require that you acquire the mutex.
If the number of queries far outweighs the number of updates, then you can do better with a reader/writer lock instead of a mutex. With an RW lock, you can have an unlimited number of readers, OR you can have a single writer. With that, querying the data would be:
acquire reader lock
result = Cars[carKey].Location
release reader lock
return result
And Add, Update, and Remove would be:
acquire writer lock
do update
release writer lock
Many runtime libraries have a concurrent dictionary data structure already built in. .NET, for example, has ConcurrentDictionary. With those, you don't have to worry about explicitly synchronizing access with a Mutex or RW lock; the data structure handles synchronization for you, either with a technique similar to that shown above, or by implementing lock-free algorithms.
As mentioned in comments, a relational database can handle this type of thing quite easily and can scale to a very large number of requests. Modern relational databases, properly constructed and with sufficient hardware, are surprisingly fast and can handle huge amounts of data with very high throughput.
There are other, more involved, methods that can increase throughput in some situations depending on what you're trying to optimize. For example, if you're willing to have some latency in reported position, then you could have position requests served from a list that's updated once per minute (or once every five minutes). So position requests are fulfilled immediately with no lock required from a static copy of the list that's updated once per minute. Updates are queued and once per minute a new list is created by applying the updates to the old list, and the new list is made available for requests.
There are many different ways to solve your problem.

Related

Can LMDB be made concurrent for writes as well under specific circumstances?

MDB_NOLOCK as described at mdb_env_open() apidoc:
MDB_NOLOCK Don't do any locking. If concurrent access is anticipated, the caller must manage all concurrency itself. For proper operation the caller must enforce single-writer semantics, and must ensure that no readers are using old transactions while a writer is active. The simplest approach is to use an exclusive lock so that no readers may be active at all when a writer begins.
What if an RW txnA intends to modify a set of keys which has no key in common with another set of keys which another RW txnB intends to modify? Couldn't they be sent concurrently?
Isn't the single-writer semantic wasteful for such situations? As one txn is waiting for the previous one to finish, even though they intend to operate in entirely separate regions in an lmdb env.
In an environment opened with MDB_NOLOCK, what if the client app calculates in the domainland, that two write transactions are intending to RW to mutually exclusive set of keys anywhere in an lmdb environment, and sends only such transactions concurrently anyway? What could go wrong?
Could such concurrent writes scale linearly with cores? Like RO txns do? Given the app is able to manage these concurrent writes, in the manner described in 3.

No, since modifying key/value pairs requires also modifying the b-tree structure, and the two transactions would conflict with each other.
You should avoid doing long-running computations in the middle of a write transaction. Try to do as much as possible beforehand. If you can't do this, then LMDB might not be a great fit for you application. Usually you can though.
Very bad stuff. Application crashes and DB corruption.
Writes are generally IO bound, and will not scale with many cores anyway. There are some very hacky things you can do with LMDB's writemap and/or pwrite(2), but you are very much on your own here.

I'm going to assume that writing to the value part of a pre-existing key does not modify the b-tree because you are not modifying the keys. So what Doug Hoyte's comment stands, except possibly point 3:
Key phrase here is "are intending to RW to mutually exclusive set of keys". So assuming that the keys are pre-allocated, and already in the DB, changing the values should not matter. I don't even know if LMDB can store variable sized values, in which case it could matter if the values are different sizes.
So, it should be possible to write with MDB_NOLOCK concurrently as long as you can guarantee to never modify, add, or delete any keys during the concurrent writes.

Empirically I can state that working with LMDB opened with MDB_NO_LOCK (or lock=False in Python) and simply modifying values of pre-existing keys, or even only adding new key/values - seems to work well. Even if LMDB itself is mounted across an NFS like medium and queried from different machines.
#Doug Hoyte - I would appreciate more context as to what specific circumstances might lead to a crash or corruption. In my case there are many small short-lived type of writes to the same DB.

Designing concurrency in a Python program

I'm designing a large-scale project, and I think I see a way I could drastically improve performance by taking advantage of multiple cores. However, I have zero experience with multiprocessing, and I'm a little concerned that my ideas might not be good ones.
Idea
The program is a video game that procedurally generates massive amounts of content. Since there's far too much to generate all at once, the program instead tries to generate what it needs as or slightly before it needs it, and expends a large amount of effort trying to predict what it will need in the near future and how near that future is. The entire program, therefore, is built around a task scheduler, which gets passed function objects with bits of metadata attached to help determine what order they should be processed in and calls them in that order.
Motivation
It seems to be like it ought to be easy to make these functions execute concurrently in their own processes. But looking at the documentation for the multiprocessing modules makes me reconsider- there doesn't seem to be any simple way to share large data structures between threads. I can't help but imagine this is intentional.
Questions
So I suppose the fundamental questions I need to know the answers to are thus:
Is there any practical way to allow multiple threads to access the same list/dict/etc... for both reading and writing at the same time? Can I just launch multiple instances of my star generator, give it access to the dict that holds all the stars, and have new objects appear to just pop into existence in the dict from the perspective of other threads (that is, I wouldn't have to explicitly grab the star from the process that made it; I'd just pull it out of the dict as if the main thread had put it there itself).
If not, is there any practical way to allow multiple threads to read the same data structure at the same time, but feed their resultant data back to a main thread to be rolled into that same data structure safely?
Would this design work even if I ensured that no two concurrent functions tried to access the same data structure at the same time, either for reading or for writing?
Can data structures be inherently shared between processes at all, or do I always explicitly have to send data from one process to another as I would with processes communicating over a TCP stream? I know there are objects that abstract away that sort of thing, but I'm asking if it can be done away with entirely; have the object each thread is looking at actually be the same block of memory.
How flexible are the objects that the modules provide to abstract away the communication between processes? Can I use them as a drop-in replacement for data structures used in existing code and not notice any differences? If I do such a thing, would it cause an unmanageable amount of overhead?
Sorry for my naivete, but I don't have a formal computer science education (at least, not yet) and I've never worked with concurrent systems before. Is the idea I'm trying to implement here even remotely practical, or would any solution that allows me to transparently execute arbitrary functions concurrently cause so much overhead that I'd be better off doing everything in one thread?
Example
For maximum clarity, here's an example of how I imagine the system would work:
The UI module has been instructed by the player to move the view over to a certain area of space. It informs the content management module of this, and asks it to make sure that all of the stars the player can currently click on are fully generated and ready to be clicked on.
The content management module checks and sees that a couple of the stars the UI is saying the player could potentially try to interact with have not, in fact, had the details that would show upon click generated yet. It produces a number of Task objects containing the methods of those stars that, when called, will generate the necessary data. It also adds some metadata to these task objects, assuming (possibly based on further information collected from the UI module) that it will be 0.1 seconds before the player tries to click anything, and that stars whose icons are closest to the cursor have the greatest chance of being clicked on and should therefore be requested for a time slightly sooner than the stars further from the cursor. It then adds these objects to the scheduler queue.
The scheduler quickly sorts its queue by how soon each task needs to be done, then pops the first task object off the queue, makes a new process from the function it contains, and then thinks no more about that process, instead just popping another task off the queue and stuffing it into a process too, then the next one, then the next one...
Meanwhile, the new process executes, stores the data it generates on the star object it is a method of, and terminates when it gets to the return statement.
The UI then registers that the player has indeed clicked on a star now, and looks up the data it needs to display on the star object whose representative sprite has been clicked. If the data is there, it displays it; if it isn't, the UI displays a message asking the player to wait and continues repeatedly trying to access the necessary attributes of the star object until it succeeds.

Even though your problem seems very complicated, there is a very easy solution. You can hide away all the complicated stuff of sharing you objects across processes using a proxy.
The basic idea is that you create some manager that manages all your objects that should be shared across processes. This manager then creates its own process where it waits that some other process instructs it to change the object. But enough said. It looks like this:
import multiprocessing as m
manager = m.Manager()
starsdict = manager.dict()
process = Process(target=yourfunction, args=(starsdict,))
process.run()
The object stored in starsdict is not the real dict. instead it sends all changes and requests, you do with it, to its manager. This is called a "proxy", it has almost exactly the same API as the object it mimics. These proxies are pickleable, so you can pass as arguments to functions in new processes (like shown above) or send them through queues.
You can read more about this in the documentation.
I don't know how proxies react if two processes are accessing them simultaneously. Since they're made for parallelism I guess they should be safe, even though I heard they're not. It would be best if you test this yourself or look for it in the documentation.

Is it required to lock shared variables in perl for read access?

I am using shared variables on perl with use threads::shared.
That variables can we modified only from single thread, all other threads are only 'reading' that variables.
Is it required in the 'reading' threads to lock
{
lock $shared_var;
if ($shared_var > 0) .... ;
}
?
isn't it safe to simple verification without locking (in the 'reading' thread!), like
if ($shared_var > 0) ....
?

Locking is not required to maintain internal integrity when setting or fetching a scalar.
Whether it's needed or not in your particular case depends on the needs of the reader, the other readers and the writers. It rarely makes sense not to lock, but you haven't provided enough details for us to determine what your needs are.
For example, it might not be acceptable to use an old value after the writer has updated the shared variable. For starters, this can lead to a situation where one thread is still using the old value while the another thread is using the new value, a situation that can be undesirable if those two threads interact.

It depends on whether it's meaningful to test the condition just at some point in time or other. The problem however is that in a vast majority of cases, that Boolean test means other things, which might have already changed by the time you're done reading the condition that says it represents a previous state.
Think about it. If it's an insignificant test, then it means little--and you have to question why you are making it. If it's a significant test, then it is telltale of a coherent state that may or may not exist anymore--you won't know for sure, unless you lock it.
A lot of times, say in real-time reporting, you don't really care which snapshot the database hands you, you just want a relatively current one. But, as part of its transaction logic, it keeps a complete picture of how things are prior to a commit. I don't think you're likely to find this in code, where the current state is the current state--and even a state of being in a provisional state is a definite state.
I guess one of the times this can be different is a cyclical access of a queue. If one consumer doesn't get the head record this time around, then one of them will the next time around. You can probably save some processing time, asynchronously accessing the queue counter. But here's a case where it means little in context of just one iteration.
In the case above, you would just want to put some locked-level instructions afterward that expected that the queue might actually be empty even if your test suggested it had data. So, if it is just a preliminary test, you would have to have logic that treated the test as unreliable as it actually is.

non-blocking producer and consumer using .NET 2.0

In our scenario,
the consumer takes at least half-a-second to complete a cycle of process (against a row in a data table).
Producer produces at least 8 items in a second (no worries, we don't mind about the duration of a consuming).
the shared data is simply a data table.
we should never ask producer to wait (as it is a server and we don't want it to wait on this)
How can we achieve the above without locking the data table at all (as we don't want producer to wait in any way).
We cannot use .NET 4.0 yet in our org.

There is a great example of a producer/consumer queue using Monitors at this page under the "Producer/Consumer Queue" section. In order to synchronize access to the underlying data table, you can have a single consumer.
That page is probably the best resource for threading in .NET on the net.

Create a buffer that holds the data while it is being processed.
It takes you half a second to process, and you get 8 items a second... unless you have at least 4 processors working on it, you'll have a problem.
Just to be safe I'd use a buffer at least twice the side needed (16 rows), and make sure it's possible with the hardware.

There is no magic bullet that is going to let you access a DataTable from multiple threads without using a blocking synchronization mechanism. What I would do is to hold the lock for as short a duration as possible. Keep in mind that modifying any object in the data table's hierarchy will require locking the whole data table. This is because modifying a column value on a DataRow can change the internal indexing structures inside the parent DataTable.
So what I would do is from the producer acquire a lock, add a new row, and release the lock. Then in the conumser you will acquire the same lock, copy data contained in a DataRow into a separate data structure, and then release the lock immediately. Now, you can operate on the copied data without synchronization mechanisms since it is isolated. After you have completed the operation on it you will again acquire the lock, merge the changes back into the DataRow, and then release the lock and start the process all over again.

Thread locking / exclusive access improvements

I have 2 threaded methods running in 2 separate places but sharing access at the same time to a list array object (lets call it PriceArray), the first thread Adds and Removes items from PriceArray when necessary (the content of the array gets updated from a third party data provider) and the average update rate is between 0.5 and 1 second.
The second thread only reads -for now- the content of the array every 3 seconds using a foreach loop (takes most items but not all of them).
To ensure avoiding the nasty Collection was modified; enumeration operation may not execute exception when the second thread loops through the array I have wrapped the add and remove operation in the first thread with lock(PriceArray) to ensure exclusive access and prevent that exception from occurring. The problem is I have noticed a performance issue when the second method tries to loop through the array items as most of the time the array is locked by the add/remove thread.
Having the scenario running this way, do you have any suggestions how to improve the performance using other thread-safety/exclusive access tactics in C# 4.0?
Thanks.

Yes, there are many alternatives.
The best/easiest would be to switch to using an appropriate collection in System.Collections.Concurrent. These are all thread-safe collections, and will allow you to use them without managing your own locks. They are typically either lock-free or use very fine grained locking, so will likely dramatically improve the performance impacts you're getting from the synchronization.
Another option would be to use ReaderWriterLockSlim to allow your readers to not block each other. Since a third party library is writing this array, this may be a more appropriate solution. It would allow you to completely block during writing, but the readers would not need to block each other during reads.

My suggestion is that ArrayList.Remove() takes most of the time, because in order to perform deletion it performs two costly things:
linear search: just takes elements one by one and compares with element being removed
when index of the element being removed is found - it shifts everything below it by one position to the left.
Thus every deletion takes time proportionally to count of elements currently in the collection.
So you should try to replace ArrayList with more appropriate structure for this task. I need more information about your case to suggest which one to choose.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string