File writing from multiple threads.

File writing from multiple threads. - multithreading

I have an application A which calls another application B which does some calculation and writes to a file File.txt
A invokes multiple instances of B through multiple threads and each instances tries to write to same file File.txt
Here comes the actual problem :
Since multiple threads tries to access the same file , the file access throws out which is common.
I tried an approach of using a concurrent queue in a singleton class and each instances of B adds to the queue And another thread in this class takes care of dequeing the items from queue and writes to the file File.txt. The queue is fetched synchronously and write operation succeeded . This works fine .
If I have too many threads and too many items in queue the file writing works but if for some reason my queue crashes or stops abruptly all the information which is supposed to be written to file is lost .
If I make the file writing synchronous from the B without using the queue then it will be slow as it needs to check for file locking but here there are less chances of data being missed as after B immediately writes to file.
What could be there best approach or design to handle this scenario? I don't need the response after file writing is completed . I can't make B wait for the file writing to be completed.
Would async await file writing could be of any use here ?

I think what you've done is the best that can be done. You may have to tune your producer/consumer queue solution if there are still problems, but it seems to me that you've done rather well with this approach.
If an in-memory queue isn't the answer, perhaps externalizing that to a message queue and a pool of listeners would be an improvement.
Relational databases and transaction managers are born to solve this problem. Why continue with a file based solution? Is it possible to explore an alternative?

is there a better approach or design to handle this scenario?
You can make each producer thread write to it's own rolling file instead of queuing the operation. Every X seconds the producers move to new files and some aggregation thread wakes up, read the previous files (of each producer) and writes the results to the final File.txt output file. No read / write locks are required here.
This ensures safe recovery since the rolling files exist until you process and delete them.
This also mean that you always write to disk, which is much slower than queuing tasks in memory and write to disk in bulks. But that's the price you pay for consistency.
Would async await file writing could be of any use here ?
Using asynchronous IO has nothing to do with this. The problems you mentioned were 1) shared resources (the output file) and 2) lack of consistency (when the queue crash), none of which async programming is about.
Why the async is in picture is because I dont want to delay the existing work by B because of this file writing operation
async would indeed help you with that. Whatever pattern you choose to implement (to solve the original problem) it can always be async by merely using the asynchronous IO api's.

Related

Backpressuring Snowflake using "rowStreamHighWaterMark" in snowflake-sdk?

I'm using snowflake-sdk and snowflake-promise to stream results (to avoid loading too many objects in memory).
For each streamed row, I want to process the received information (an ETL-like job that performs write-backs). My code is quite basic and similar to this simplistic snowflake-promise example.
My current problem is that .on('data', ...) is called more often than I can manage to handle. (My ETL-like job can't keep up with the received rows and my DB connection pool to perform write-backs gets exhausted).
I tried setting rowStreamHighWaterMark to various values (1, 10 [default], 100, 1000, 2000 and 4000) in an effort to slow down/backpressure stream.Readable but, unfortunately, it didn't change anything.
What did I miss ? How can I better control when to consume the read data ?

If this was written synchronous, you would see that to "be pushed too much data" than you can handled to write at the same time" cannot happen because:
while(data){
data.readrow()
doSomethineAwesome()
writeDataViaPoolTheBacksUp()
}
just can not spin to fast.
Now if you are accepting data on one async thread, and pushing that data onto a queue and draining the queue in another async thread, you will get the problem you discribe (that is your queue explodes). So you need to slow/pause the completion of the read's thread when the write thread is too behind.
Given to is writing to the assumed queue, when that gets too long, stop.
The other way you might be doing this is with no work queue, but fire a async write each time conditions are meet. This is bad because you have no track of outstand work, and you are doing many small updates to the DB, which if is Snowflake it really dislikes. A better approach would be to build a local set of data changes, we will call it a batch, and when you batch get to a size you flush the changes set in one operation (and you flush the batch when input is completed, to catch the dregs)

The Snowflake support got back to me with an answer.
They told me to create the connection this way:
var connection = snowflake.createConnection({
account: "testaccount",
username: "testusername",
password: "testpassword",
rowStreamHighWaterMark: 5
});
Full disclaimer: My project has changed and I could NOT recreate the problem on my local environment. I couldn't assess the answer's validity; still, I wanted to share in case somebody could get some hints from this information.

Understanding the Event-Loop in node.js

I've been reading a lot about the Event Loop, and I understand the abstraction provided whereby I can make an I/O request (let's use fs.readFile(foo.txt)) and just pass in a callback that will be executed once a particular event indicates completion of the file reading is fired. However, what I do not understand is where the function that is doing the work of actually reading the file is being executed. Javascript is single-threaded, but there are two things happening at once: the execution of my node.js file and of some program/function actually reading data from the hard drive. Where does this second function take place in relation to node?

The Node event loop is truly single threaded. When we start up a program with Node, a single instance of the event loop is created and placed into one thread.
However for some standard library function calls, the node C++ side and libuv decide to do expensive calculations outside of the event loop entirely. So they will not block the main loop or event loop. Instead they make use of something called a thread pool that thread pool is a series of (by default) four threads that can be used for running computationally intensive tasks. There are ONLY FOUR things that use this thread pool - DNS lookup, fs, crypto and zlib. Everything else execute in the main thread.

"Of course, on the backend, there are threads and processes for DB access and process execution. However, these are not explicitly exposed to your code, so you can’t worry about them other than by knowing that I/O interactions e.g. with the database, or with other processes will be asynchronous from the perspective of each request since the results from those threads are returned via the event loop to your code. Compared to the Apache model, there are a lot less threads and thread overhead, since threads aren’t needed for each connection; just when you absolutely positively must have something else running in parallel and even then the management is handled by Node.js." via http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/

Its like using, setTimeout(function(){/*file reading code here*/},1000);. JavaScript can run multiple things side by side like, having three setInterval(function(){/*code to execute*/},1000);. So in a way, JavaScript is multi-threading. And for actually reading from/or writing to the hard drive, in NodeJS, if you use:
var child=require("child_process");
function put_text(file,text){
child.exec("echo "+text+">"+file);
}
function get_text(file){
//JQuery code for getting file contents here (i think)
return JQueryResults;
}
These can also be used for reading and writing to/from the hard drive using NodeJS.

How to have many consumer threads using BlockingCollection

I am using a producer / consumer pattern backed with a BlockingCollection to read data off a file, parse/convert and then insert into a database. The code I have is very similar to what can be found here: http://dhruba.name/2012/10/09/concurrent-producer-consumer-pattern-using-csharp-4-0-blockingcollection-tasks/
However, the main difference is that my consumer threads not only parse the data but also insert into a database. This bit is slow, and I think is causing the threads to block.
In the example, there are two consumer threads. I am wondering if there is a way to have the number of threads increase in a somewhat intelligent way? I had thought a threadpool would do this, but can't seem to grasp how that would be done.
Alternatively, how would you go about choosing the number of consumer threads? 2 does not seem correct for me, but I'm not sure what the best # would be. Thoughts on the best way to choose # of consumer threads?

The best way to choose the number of consumer threads is math: figure out how many packets per minute are coming in from the producers, divide that by how many packets per minute a single consumer can handle, and you have a pretty good idea of how many consumers you need.
I solved the blocking output problem (consumers blocking when trying to update the database) by adding another BlockingCollection that the consumers put their completed packets in. A separate thread reads that queue and updates the database. So it looks something like:
input thread(s) => input queue => consumer(s) => output queue => output thread
This has the added benefit of divorcing the consumers from the output, meaning that you can optimize the output or completely change the output method without affecting the consumer. That might allow you, for example, to batch the database updates so that rather than making one database call per record, you could update a dozen or a hundred (or more) records with a single call.
I show a very simple example of this (using a single consumer) in my article Simple Multithreading, Part 2. That works with a text file filter, but the concepts are the same.

How to process rows of a CSV file using Groovy/GPars most efficiently?

The question is a simple one and I am surprised it did not pop up immediately when I searched for it.
I have a CSV file, a potentially really large one, that needs to be processed. Each line should be handed to a processor until all rows are processed. For reading the CSV file, I'll be using OpenCSV which essentially provides a readNext() method which gives me the next row. If no more rows are available, all processors should terminate.
For this I created a really simple groovy script, defined a synchronous readNext() method (as the reading of the next line is not really time consuming) and then created a few threads that read the next line and process it. It works fine, but...
Shouldn't there be a built-in solution that I could just use? It's not the gpars collection processing, because that always assumes there is an existing collection in memory. Instead, I cannot afford to read it all into memory and then process it, it would lead to outofmemory exceptions.
So.... anyone having a nice template for processing a CSV file "line by line" using a couple of worker threads?

Concurrently accessing a file might not be a good idea and GPars' fork/join-processing is only meant for in-memory data (collections). My sugesstion would be to read the file sequentially into a list. When the list reaches a certain size, process the entries in the list concurrently using GPars, clear the list and then move on with reading lines.

This might be a good problem for actors. A synchronous reader actor could hand off CSV lines to parallel processor actors. For example:
#Grab(group='org.codehaus.gpars', module='gpars', version='0.12')
import groovyx.gpars.actor.DefaultActor
import groovyx.gpars.actor.Actor
class CsvReader extends DefaultActor {
void act() {
loop {
react {
reply readCsv()
}
}
}
}
class CsvProcessor extends DefaultActor {
Actor reader
void act() {
loop {
reader.send(null)
react {
if (it == null)
terminate()
else
processCsv(it)
}
}
}
}
def N_PROCESSORS = 10
def reader = new CsvReader().start()
(0..<N_PROCESSORS).collect { new CsvProcessor(reader: reader).start() }*.join()

I'm just wrapping up an implementation of a problem just like this in Grails (you don't specify if you're using grails, plain hibernate, plain JDBC or something else).
There isn't anything out of the box that you can get that I'm aware of. You could look at integrating with Spring Batch, but the last time I looked at it, it felt very heavy to me (and not very groovy).
If you're using plain JDBC, doing what Christoph recommends probably is the easiest thing to do (read in N rows and use GPars to spin through those rows concurrently).
If you're using grails, or hibernate, and want your worker threads to have access to the spring context for dependency injection, things get a bit more complicated.
The way I solved it is using the Grails Redis plugin (disclaimer: I'm the author) and the Jesque plugin, which is a java implementation of Resque.
The Jesque plugin lets you create "Job" classes that have a "process" method with arbitrary parameters that are used to process work enqueued on a Jesque queue. You can spin up as many workers as you want.
I have a file upload that an admin user can post a file to, it saves the file to disk and enqueues a job for the ProducerJob that I've created. That ProducerJob spins through the file, for each line, it enqueues a message for a ConsumerJob to pick up. The message is simply a map of the values read from the CSV file.
The ConsumerJob takes those values and creates the appropriate domain object for it's line and saves it to the database.
We already were using Redis in production so using this as a queueing mechanism made sense. We had an old synchronous load that ran through file loads serially. I'm currently using one producer worker and 4 consumer workers and loading things this way is over 100x faster than the old load was (with much better progress feedback to the end user).
I agree with the original question that there is probably room for something like this to be packaged up as this is a relatively common thing.
UPDATE: I put up a blog post with a simple example doing imports with Redis + Jesque.

How do you regulate concurrency/relative process performance in Erlang?

Let's say I have to read from a directory that has many large XML files in it, and I have to parse that and send them to some service via network, and then write the response to disk again.
If it were Java or C++ etc., I may do something like this (hope this makes sense):
(File read & xml parsing process) -> bounded-queue -> (sender process) -> service
service -> bounded-queue -> (process to parse result and write to disk)
And then I'd assign whatever suitable number of threads to each process. This way I can limit the concurrency of each process at its optimal value, and the bounded queue will ensure there won't be memory shortage etc.
What should I do though when coding in Erlang? I guess I could just implement the whole flow in a function, then iterate the directory and spawn these "start-to-end" processes as fast as possible. This sounds suboptimal though because if parsing of XML takes longer than reading the files etc. the app. could go into memory shortage for having many XML documents in-memory at once etc., and you can't keep the concurrency at the optimal level. E.g. if the "service" is most efficient when concurrency is 4, it would be very inefficient to hit it with enormous concurrency.
How should erlang programmers deal with such situation? I.e. what is the erlang substitute for fixed thread pool and bounded queue?

There is no real way to limit the queue sizes of a process except by handling them all in a timely fashion. Best way would be to simply check available resources before spawning and wait if they are insufficient. So if you are worried about memory, check memory before spawning a new process. if discspace, check diskspace, ect.
Limiting the number of processes spawned is also possible. A simple construction would be:
pool(Max) ->
process_flag(trap_exit, true),
pool(0, Max);
pool(Current, Max) ->
receive
{'EXIT', _, _} ->
pool(Current - 1, Max);
{ work, F, Pid} when Current < Max ->
Pid ! accepted,
spawn_link(F),
pool(Current + 1, Max);
{ work, _, Pid} ->
Pid ! rejected,
pool(Current, Max);
end.
This is a rough sketch how a process would limit the number of processes it spawns. It is however considered better to limit on the real reasons instead of an artificial number.

You can definitely run your own process pool in Erlang, but it is a poor way memory usage since it doesn't take into account the size of the XML data being read (or the total memory used by the processes for that matter).
I would suggest implementing the whole workflow in a functional library, as you suggested, and spawn processes that execute this workflow. Add a check for memory usage which will look at the size of the data to be read in and the available memory (hint: use memsup).

I would suggest you do it in event-driven paradigm.
Imagine you started OTP gen_server with the list of file names.
gen_servers checks resources and spawns next worker if permitted, removing file name from the list and passing it to worker.
Worker processes file and casts message back to gen_server when ready (or you can just trap EXIT).
gen_server receives such message and performs step 1 until file list is empty.
So workers do the heavy lifting, gen_server controls the flow.
You can also create distributed system, but it's a bit more complex as you need to spawn intermediate gen_servers on each computer and query them if resources are available there and then choose which computer should process next file based on replies. And you probably need something like NFS to avoid sending long messages.
Workers can be further split if you need more concurrency.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string