thread safe search-and-add - linux

I need to be able to do the following:
search a linked list.
add a new node to the list in case it's not found.
be thread safe and use rwlock since it's read mostly list.
The issue i'm having is when I promote from read_lock to write_lock I need to search the list again just to make sure some other thread wasn't waiting on a write_lock while I was doing the list search holding the read_lock.
Is there a different way to achieve the above without doing a double list search (perhaps a seq_lock of some sort)?

Convert the linked list to a sorted linked list. When its time to add a new node you can check again to see if another writer has added an equivalent node while you were acquiring the lock by inspecting only two nodes, instead of searching the entire list. You will spend a little more time on each node insertion because you need to determine the sorted order of the new node, but you will save time by not having to search the entire list. Overall you will probably save a lot of time.

Related

How can I have a Map that all processes can access?

I'm building a multithreaded web crawler.
I launch a thread that gets first n href links and parses some data. Then it should add those links to a Visited list that other threads can access and adds the data to a global map that will be printed when the program is done. Then the thread launches new n new threads all doing the same thing.
How can I setup a global list of Visited sites that all threads can access and a global map that all threads can also write to.
You can't share data between processes. That doesn't mean that you can't share information.
the usual way is either to use a special process (a server) in charge of this job: maintain a state; in your case the list of visited links.
Another way is to use ETS (or Mnesia the database build upon ETS) which is designed to share information between processes.
Just to clarify, erlang/elixir uses processes rather than threads.
Given a list of elements, a generic approach:
An empty list called processed is saved to ets, dets, mnesia or some DB.
The new list of elements is filtered against the processed list so the Task is not unnecessarily repeated.
For each element of the filtered list, a task is run (which in turn spawns a process) and does some work on each element that returns a map of the required data. See the Task module Task.async/1 and Task.yield_many/2 could be useful.
Once all the tasks have returned or yielded,
all the maps or parts of the data in the maps are merged and can be persisted if/as required/appropriate.
the elements whose tasks did not crash or timeout are added to the processed list in the DB.
Tasks which crash or timeout could be handled differently.

Multi Threading in a Tree like structure

Below is the Question I was asked in an interview and I believe there are many solutions to this question but I want to know what can be the best solution (and stackoverflow is perfect for this :) ).
Q: We have a tree like structure and have three threads. Now we have to perform three operations: Insert, Delete and lookup. How will you design this?
My Approach: I will take mutex for insert and delete operation as I want only one thread to perform at a time insert or delete. While in case of lookup I will allow all the three threads to enter in the function but keep a count(counting semaphore) so that insert and delete operation can't be perform this time.
Similarly when insert or delete operation is going no thread is allowed to do lookup, same with the case for insert and delete.
Now he cross questioned me that as I am allowing only one thread at a time to insert so if two nodes on different leaf need to be inserted then still my approach will allow one at a time, this made be stuck.
Is my approach fine ?
What can be other approaches ?
How about like this? Similar to a traffic road block(broken paths).
Each node will have two flags say leftClear_f and rightClear_f indicating clear-path ahead
There will be only one MutEx for the tree
Lookup Operation:
If flags are set indicating path ahead is under modification, got to conditional_wait and wait for signal.
after getting signal check the flag and continue.
Insert Operation
follow the Lookup till you get to the location of insertion.
acquire MutEx and set relevant flag of parent_node and both child_nodes after checking their state.
Release the MutEx so that parallel Delete/Insert can happen on other valid unbroken paths
Acquire MutEx after insert operation and update the relevant flag in the parent_node and child_nodes.
Delete Operation
same as Insert operation except that it deletes nodes.
PS: You can also maintain the details of the nodes under Insert or Delete process someplace else. Other operation can jump the broken paths if necessary/needed! It sounds complicated yet doable.

How to implement a conditional pop concurrency-friendly in Redis?

I am building something like a delay-line: one process RPUSHes objects into a list, another LPOPs them out in the same order.
The trick is that objects should only be popped from the list one hour after they have been added. I could use a time-stamp for that. The delay is the same for all items and never changes.
Now how to I implement the pop in a concurrency-friendly way (so that it still works when several workers access that list)? I could take out an item, check the timestamp and put it back into the list if it's still too early. But if several workers do that simultaneously, it may mess up the order of items. I could check the first item and only pop it if it's due. But another worker might have popped it then so I pop the wrong one.
Should I use the WATCH command? How? Should I use sorted sets instead of a list? Help appreciated!
I'd suggest using a sorted set. Entries go into the zset with the normal identifier as key and a Unix-style timestamp as the score. Make the timestamps the date+time after which each entry is ready for parsing. Workers do a ZPOP, which isn't a built-in but can be emulated with:
MULTI
ZRANGE <key> 0 0 WITHSCORES
ZREMRANGEBYRANK <key> 0 0
EXEC
Capture the results of the ZRANGE and you have the element with the lowest score at the time, already removed from the set, with its score. Put it back if it's not valid yet with a ZADD <key> <score> <item>.
Putting the item back in the list after checking it won't mess up the ordering enough to matter - by having any sort of concurrency you accept that the order won't be strictly defined. However, that approach won't be particularly efficient as you can't check the timestamp without some processing of the item.
With the scripting build you could implement a conditional pop command for sorted sets, though I don't think that is the easiest solution.
There are a couple of ways to implement basic scheduling using multiple lists:
Add tasks to a sorted set and have a single worker responsible for moving them from the sorted set to a list that multiple workers access with pop.
Have a list for each minute, and have the workers read from past lists. You can minimize the number of lists checked by using another key that tracks the oldest non-empty queue.

Thread locking / exclusive access improvements

I have 2 threaded methods running in 2 separate places but sharing access at the same time to a list array object (lets call it PriceArray), the first thread Adds and Removes items from PriceArray when necessary (the content of the array gets updated from a third party data provider) and the average update rate is between 0.5 and 1 second.
The second thread only reads -for now- the content of the array every 3 seconds using a foreach loop (takes most items but not all of them).
To ensure avoiding the nasty Collection was modified; enumeration operation may not execute exception when the second thread loops through the array I have wrapped the add and remove operation in the first thread with lock(PriceArray) to ensure exclusive access and prevent that exception from occurring. The problem is I have noticed a performance issue when the second method tries to loop through the array items as most of the time the array is locked by the add/remove thread.
Having the scenario running this way, do you have any suggestions how to improve the performance using other thread-safety/exclusive access tactics in C# 4.0?
Thanks.
Yes, there are many alternatives.
The best/easiest would be to switch to using an appropriate collection in System.Collections.Concurrent. These are all thread-safe collections, and will allow you to use them without managing your own locks. They are typically either lock-free or use very fine grained locking, so will likely dramatically improve the performance impacts you're getting from the synchronization.
Another option would be to use ReaderWriterLockSlim to allow your readers to not block each other. Since a third party library is writing this array, this may be a more appropriate solution. It would allow you to completely block during writing, but the readers would not need to block each other during reads.
My suggestion is that ArrayList.Remove() takes most of the time, because in order to perform deletion it performs two costly things:
linear search: just takes elements one by one and compares with element being removed
when index of the element being removed is found - it shifts everything below it by one position to the left.
Thus every deletion takes time proportionally to count of elements currently in the collection.
So you should try to replace ArrayList with more appropriate structure for this task. I need more information about your case to suggest which one to choose.

Threading and iterating through changing collections

In C# (console app) I want to hold a collection of objects. All objects are of same type.
I want to iterate through the collection calling a method on each object. And then repeat the process continuously.
However during iteration objects can be added or removed from the list. (The objects themselves will not be destroyed .. just removed from the list).
Not sure what would happen with a foreach loop .. or other similar method.
This has to have been done 1000 times before .. can you recommend a solid approach?
There is also copy based approach.
The algorithm is like that:
take the lock on shared collection
copy all items from shared collection to some local collection
release lock on shared collection
Iterate over items in local collection
The advantage of this approach is that you take the lock on shared collection for small period of time (assuming that shared collection is relatively small).
In case when method that you want to invoke on every collection item takes some considerable time to complete or can block then the approach of iterating under shared lock can lead to blocking other threads that want to add/remove items from shared collection
However if that method that you want to invoke on every object is relatively fast then iterating under shared lock can be more preferable.
This is classic case of syncronization in multithreading.
Only solid approach and better approach would be syncronization between looping and addition/deletion of items from list.
Means you should allow addition/deletion only at end of end and start of iterating loop!
some thing like this:-
ENTER SYNC_BLOCK
WAIT FOR SYNC_BLOCK to be available
LOOP for items/ call method on them.
LEAVE SYNC_BLOCK
ENTER SYNC_BLOCK
WAIT FOR SYNC_BLOCK to be available
Add/Delete items
LEAVE SYNC_BLOCK
What comes to mind when I read this example is that you could use a C5 TreeSet/TreeBag. It does require that there be a way to order your items, but the advantage of the Tree collections is that they offer a Snapshot method (A member of C5.IPersistentSorted) that allows you to make light-weight snapshots of the state of the collection without needing to make a full duplicate.
e.g.:
using(var copy = mySet.Snapshot()) {
foreach(var item in copy) {
item.DoSomething();
}
}
C5 also offers a simple way to "apply to all" and is compatible with .NET 2.0:
using(var copy = mySet.Snapshot()) {
copy.Apply(i => i.DoSomething());
}
It's important to note that the snapshot should be disposed or you will incur a small performance penalty on subsequent modifications to the base collection.
This example is from the very thorough C5 Book.

Resources