I'm writing program that parses a lot of files looking for "interesting" lines. Then it's checking if these lines were seen before. Each file is parsed using separate goroutine.
I'm wondering which approach is better:
Use sync.Map or something similar
Use channels and separate goroutine which should be responsible only for uniqueness check (probably using standard map). It would receive request and respond with something simple like "Not unique" or "Unique (and added)"
Is any of these solutions more popular or maybe both are wrong?
If you would like to have workers which can access a global map for unique checking, you can use a sync.RWMutex to be sure that the map is protected, like:
var (
mutex sync.RWMutex = sync.RWMutex{}
alreadySeen map[string]struct{} = make(map[string]struct{})
)
func Work() {
for {
Processing lines here...
//Checking
mutex.RLock() //Lock for reading only
if _, found := alreadySeen[line]; !found {
mutex.RUnLock()
mutex.Lock()
alreadySeen[line] = struct{}{}
mutex.UnLock()
} else {
mutex.RUnLock()
}
}
}
Another approach is to use a concurrent safe map to skip the whole mutexing, for example this package: https://github.com/cornelk/hashmap
Related
For the below code
Mono<String> input =
Mono.just("input")
.map {
println "inside map"
it + "added"
}
.transform {
Mono.just("hello")
}
input.subscribe {println it}
The console looks like as below.
16:11:49.056 [main] DEBUG reactor.util.Loggers$LoggerFactory - Using Slf4j logging framework
hello
The code inside the map function was never executed. I understand that transform method executes at assembly time rather than the subscription.
Why did Reactor just decide to not process my upstream map operator. Did it intelligently decide that since I am not in anyway referring to the output of the map operator that it need not execute map at all ?
Is this behaviour configurable ?
The reason is that transform does not automatically subscribe to your original Mono. It's your responsibility to chain your logic onto it. Since nothing subscribes to it, it will never get triggered.
As the example you sent is dummy, it's difficult to say what would be the right thing to do. It depends on your use case.
A few thing you can do, though:
Get rid of transform and just simply use then operator:
Mono<String> input =
Mono.just("input")
.map {
println "inside map"
it + "added"
}
.then(Mono.just("hello"))
If for some reason you need transform, then chain your logic onto your original Mono:
Mono<String> input =
Mono.just("input")
.map {
println "inside map"
it + "added"
}
.transform {
it.then(Mono.just("hello"))
}
I'm exploring Kotlin Native and have a program with a bunch of Workers doing concurrent stuff
(running on Windows, but this is a general question).
Now, I wanted to add simple logging. A component that simply logs strings by appending them as new lines to a file that is kept open in 'append' mode.
(Ideally, I'd just have a "global" function...
fun log(text:String) {...} ]
...that I would be able to call from anywhere, including from "inside" other workers and that would just work. The implication here is that it's not trivial to do this because of Kotlin Native's rules regarding passing objects between threads (TLDR: you shouldn't pass mutable objects around. See: https://github.com/JetBrains/kotlin-native/blob/master/CONCURRENCY.md#object-transfer-and-freezing ).
Also, my log function would ideally accept any frozen object. )
What I've come up with are solutions using DetachedObjectGraph:
First, I create a detached logger object
val loggerGraph = DetachedObjectGraph { FileLogger("/foo/mylogfile.txt")}
and then use loggerGraph.asCPointer() ( asCPointer() ) to get a COpaquePointer to the detached graph:
val myPointer = loggerGraph.asCPointer()
Now I can pass this pointer into the workers ( via the producer lambda of the Worker's execute function ), and use it there. Or I can store the pointer in a #ThreadLocal global var.
For the code that writes to the file, whenever I want to log a line, I have to create a DetachedObjectGraph object from the pointer again,
and attach() it in order to get a reference to my fileLogger object:
val fileLogger = DetachedObjectGraph(myPointer).attach()
Now I can call a log function on the logger:
fileLogger.log("My log message")
This is what I've come up with looking at the APIs that are available (as of Kotlin 1.3.61) for concurrency in Kotlin Native,
but I'm left wondering what a better approach would be ( using Kotlin, not resorting to C ). Clearly it's bad to create a DetachedObjectGraph object for every line written.
One could pose this question in a more general way: How to keep a mutable resource open in a separate thread ( or worker ), and send messages to it.
Side comment: Having Coroutines that truly use threads would solve this problem, but the question is about how to solve this task with the APIs currently ( Kotlin 1.3.61 ) available.
You definitely shouldn't use DetachedObjectGraph in the way presented in the question. There's nothing to prevent you from trying to attach on multiple threads, or if you pass the same pointer, trying to attach to an invalid one after another thread as attached to it.
As Dominic mentioned, you can keep the DetachedObjectGraph in an AtomicReference. However, if you're going to keep DetachedObjectGraph in an AtomicReference, make sure the type is AtomicRef<DetachedObjectGraph?> and busy-loop while the DetachedObjectGraph is null. That will prevent the same DetachedObjectGraph from being used by multiple threads. Make sure to set it to null, and repopulate it, in an atomic way.
However, does FileLogger need to be mutable at all? If you're writing to a file, it doesn't seem so. Even if so, I'd isolate the mutable object to a separate worker and send log messages to it rather than doing a DetachedObjectGraph inside an AtomicRef.
In my experience, DetachedObjectGraph is super uncommon in production code. We don't use it anywhere at the moment.
To isolate mutable state to a Worker, something like this:
class MutableThing<T:Any>(private val worker:Worker = Worker.start(), producer:()->T){
private val arStable = AtomicReference<StableRef<T>?>(null)
init {
worker.execute(TransferMode.SAFE, {Pair(arStable, producer).freeze()}){
it.first.value = StableRef.create(it.second()).freeze()
}
}
fun <R> access(block:(T)->R):R{
return worker.execute(TransferMode.SAFE, {Pair(arStable, block).freeze()}){
it.second(it.first.value!!.get())
}.result
}
}
object Log{
private val fileLogger = MutableThing { FileLogger() }
fun log(s:String){
fileLogger.access { fl -> fl.log(s) }
}
}
class FileLogger{
fun log(s:String){}
}
The MutableThing uses StableRef internally. producer makes the mutable state you want to isolate. To log something, call Log.log, which will wind up calling the mutable FileLogger.
To see a basic example of MutableThing, run the following test:
#Test
fun goIso(){
val mt = MutableThing { mutableListOf("a", "b")}
val workers = Array(4){Worker.start()}
val futures = mutableListOf<Future<*>>()
repeat(1000) { rcount ->
val future = workers[rcount % workers.size].execute(
TransferMode.SAFE,
{ Pair(mt, rcount).freeze() }
) { pair ->
pair.first.access {
val element = "ttt ${pair.second}"
println(element)
it.add(element)
}
}
futures.add(future)
}
futures.forEach { it.result }
workers.forEach { it.requestTermination() }
mt.access {
println("size: ${it.size}")
}
}
The approach you've taken is pretty much correct and the way it's supposed to be done.
The thing I would add is, instead of passing around a pointer around. You should pass around a frozen FileLogger, which will internally hold a reference to a AtomicRef<DetachedObjectGraph>, the the attaching and detaching should be done internally. Especially since DetachedObjectGraphs are invalid once attached.
Imagine we had a object like this
class Foo {
List<int> data = [];
void addAndCheck(int n){
for(int number in data){
// check something
}
data.add(n);
}
}
and the imagine we spawn a bunch of subscriptions like this
Foo foo = Foo();
for(int i = 0; i++; i<10){
subscriptions.add(api.someRandomStream().listen((response){
foo.addAndCheck(response.value);
}));
}
As it stands, if this code is run it might work but as soon as the streams start emitting around the same time we get a exception: Concurrent modification during iteration
The cause is the for loop, but how can this problem be solved? In a language like Java there are things like ConcurrentHashMap, Collections.synchronizedList(...), etc..
If you get a concurrent modification error during the iteration, then you are doing something asynchronous inside the loop. That is, your function is probably async and there is at least one await inside the loop. That will allow another event to trigger while you are awaiting, and then modify the list.
There are several ways to avoid the exception, all with different trade-offs:
Don't do anything asynchronous in the loop, and make sure that nothing you do in there will call addAndCheck again. Then there should be no problem because the loop will complete before anyone else has a chance to modify the list. That obviously only works if you don't need to do something asynchronous.
Copy the list. If you do for(int number in [...data]) { ... } (or in data.toList() as it used to be written), then the list that you iterate is a different list than the one which is modified. It also means that you might not have checked all the elements that are actually in the list at the point you reach the add call.
Don't use an iterator. If you do for (int i = 0; i < data.length; i++) { var number = data[i]; ... } instead, you will not get a concurrent modification error from the iterator. If elements are added at the end of the list, then you will eventually reach them, and all is well. If elements are removed from the list, or added in any place other than at the end, then you might be skipping elements or seeing some of them twice, which may be bad for you.
Use a mutex. If you want to be sure that all the tests on existing elements are performed before any other element is added, then you need to prevent anything from happening while you are adding. Assume a Mutex class of some sort, which would allow you to write code like:
class Foo {
List<int> data = [];
final _mutex = Mutex();
void addAndCheck(int n) async {
await _mutex.acquire();
for(int number in data){
// check something
}
data.add(n);
_mutex.release();
}
}
(I found package:mutex by searching, I have no experience with it).
This might slow down your code, though, making every operation wait for the previous one to complete entirely.
In the end, only you can say which trade-off is best for the behavior of your code.
I've been searching for a function that takes an object of type Lock
and runs a block of code with that lock taking care of locking and also unlocking.
I'd implement it as follows:
fun <T : Lock> T.runLocked(block: () -> Unit) {
lock()
try {
block()
} finally {
unlock()
}
}
Used like this:
val l = ReentrantLock()
l.runLocked {
println(l.isLocked)
}
println(l.isLocked)
//true
//false
Anything available like this? I could only find the synchronized function which cannot be used like this.
You are looking for withLock, which has the exact implementation you've written yourself, except it has a generic parameter for the result of the block instead of the receiver type.
You can find other concurrency related methods of the standard library here, in the kotlin.concurrent package.
I am writing multi-threaded server that handles async read from many tcp sockets. Here is the section of code that bothers me.
void data_recv (void) {
socket.async_read_some (
boost::asio::buffer(rawDataW, size_t(648*2)),
boost::bind ( &RPC::on_data_recv, this,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred));
} // RPC::data_recvW
void on_data_recv (boost::system::error_code ec, std::size_t bytesRx) {
if ( rawDataW[bytesRx-1] == ENDMARKER { // <-- this code is fine
process_and_write_rawdata_to_file
}
else {
read_socket_until_endmarker // <-- HELP REQUIRED!!
process_and_write_rawadata_to_file
}
}
Nearly always the async_read_some reads in data including the endmarker, so it works fine. Rarely, the endmarker's arrival is delayed in the stream and that's when my program fails. I think it fails because I have not understood how boost bind works.
My first question:
I am confused with this boost totorial example , in which "this" does not appear in the handler declaration. ( Please see code of start_accept() in the example.) How does this work? Does compiler ignore the "this" ?
my second question:
In the on_data_recv() method, how do I read data from the same socket that was read in the on_data() method? In other words, how do I pass the socket as argument from calling method to the handler? when the handler is executed in another thread? Any help in form of a few lines of code that can fit into my "read_socket_until_endmarker" will be appreciated.
My first question: I am confused with this boost totorial example , in which "this" does not appear in the handler declaration. ( Please see code of start_accept() in the example.) How does this work? Does compiler ignore the "this" ?
In the example (and I'm assuming this holds for your functions as well) the start_accept() is a member function. The bind function is conveniently designed such that when you use & in front of its first argument, it interprets it as a member function that is applied to its second argument.
So while a code like this:
void foo(int x) { ... }
bind(foo, 3)();
Is equivalent to just calling foo(3)
Code like this:
struct Bar { void foo(int x); }
Bar bar;
bind(&foo, &bar, 3)(); // <--- notice the & before foo
Would be equivalent to calling bar.foo(3).
And thus as per your example
boost::bind ( &RPC::on_data_recv, this, // <--- notice & again
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)
When this object is invoked inside Asio it shall be equivalent to calling this->on_data_recv(error, size). Checkout this link for more info.
For the second part, it is not clear to me how you're working with multiple threads, do you run io_service.run() from more than one thread (possible but I think is beyond your experience level)? It might be the case that you're confusing async IO with multithreading. I'm gonna assume that is the case and if you correct me I'll change my answer.
The usual and preferred starting point is to have just one thread running the io_service.run() function. Don't worry, this will allow you to handle many sockets asynchronously.
If that is the case, your two functions could easily be modified as such:
void data_recv (size_t startPos = 0) {
socket.async_read_some (
boost::asio::buffer(rawDataW, size_t(648*2)) + startPos,
boost::bind ( &RPC::on_data_recv, this,
startPos,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred));
} // RPC::data_recvW
void on_data_recv (size_t startPos,
boost::system::error_code ec,
std::size_t bytesRx) {
// TODO: Check ec
if (rawDataW[startPos + bytesRx-1] == ENDMARKER) {
process_and_write_rawdata_to_file
}
else {
// TODO: Error if startPos + bytesRx == 648*2
data_recv(startPos + bytesRx);
}
}
Notice though that the above code still has problems, the main one being that if the other side sent two messages quickly one after another, we could receive (in one async_read_some call) the full first message + part of the second message, and thus missing the ENDMARKER from the first one. Thus it is not enough to only test whether the last received byte is == to the ENDMARKER.
I could go on and modify this function further (I think you might get the idea on how), but you'd be better off using async_read_until which is meant exactly for this purpose.