how to format multiple files via multithreading

how to format multiple files via multithreading - multithreading

I'm using multiplethreading to change the case of multiple text files and comparing the time required according to the number of threads i use. I'm not able to understand how to select a set of files for processing

What I do (I have done this using c/c++, Java, and python):
create a queue with enough space to hold all of the filenames
put all of the filenames in the queue
create and start the number of threads you want
each thread needs to know where the queue is
a thread tries to get a filename from the queue
if the queue is empty the thread exits
otherwise the thread processes the file then go to step 5
Wait for threads to finish
that’s it

Related

Bash - how to redirect stdout of a certain thread?

Suppose I have a C program, and it creates threads for doing different tasks. Now, I want to redirect the stdout of a certain thread in bash scripts?
Here you can assume that I always have a way to get the process id and thread id, I only want to know if it's possible to do this using bash scripts and how?
Note: This is not about process, it's thread, and I haven't found any questions related to this yet.

There is only one console, not one per thread. So when 5 threads write in parallel to stdout, all of that goes into a single sink, basically in nondeterministic ways.
So unless each line contains a specific string that identifies the original thread, you can't take that output apart after the fact.
Alternatively, you could have your threads write to different files! When you don't throw random output together, it is much easier to get to the individual sources later on.

File writing from multiple threads.

I have an application A which calls another application B which does some calculation and writes to a file File.txt
A invokes multiple instances of B through multiple threads and each instances tries to write to same file File.txt
Here comes the actual problem :
Since multiple threads tries to access the same file , the file access throws out which is common.
I tried an approach of using a concurrent queue in a singleton class and each instances of B adds to the queue And another thread in this class takes care of dequeing the items from queue and writes to the file File.txt. The queue is fetched synchronously and write operation succeeded . This works fine .
If I have too many threads and too many items in queue the file writing works but if for some reason my queue crashes or stops abruptly all the information which is supposed to be written to file is lost .
If I make the file writing synchronous from the B without using the queue then it will be slow as it needs to check for file locking but here there are less chances of data being missed as after B immediately writes to file.
What could be there best approach or design to handle this scenario? I don't need the response after file writing is completed . I can't make B wait for the file writing to be completed.
Would async await file writing could be of any use here ?

I think what you've done is the best that can be done. You may have to tune your producer/consumer queue solution if there are still problems, but it seems to me that you've done rather well with this approach.
If an in-memory queue isn't the answer, perhaps externalizing that to a message queue and a pool of listeners would be an improvement.
Relational databases and transaction managers are born to solve this problem. Why continue with a file based solution? Is it possible to explore an alternative?

is there a better approach or design to handle this scenario?
You can make each producer thread write to it's own rolling file instead of queuing the operation. Every X seconds the producers move to new files and some aggregation thread wakes up, read the previous files (of each producer) and writes the results to the final File.txt output file. No read / write locks are required here.
This ensures safe recovery since the rolling files exist until you process and delete them.
This also mean that you always write to disk, which is much slower than queuing tasks in memory and write to disk in bulks. But that's the price you pay for consistency.
Would async await file writing could be of any use here ?
Using asynchronous IO has nothing to do with this. The problems you mentioned were 1) shared resources (the output file) and 2) lack of consistency (when the queue crash), none of which async programming is about.
Why the async is in picture is because I dont want to delay the existing work by B because of this file writing operation
async would indeed help you with that. Whatever pattern you choose to implement (to solve the original problem) it can always be async by merely using the asynchronous IO api's.

threading the same job to work on separate files

I want to run a job on 10 threads to process 100 files. Each thread is supposed to work on separate file. When a thread is done it is supposed to pick the next file.
What I'm doing right now is basically going on a loop, kick off the job and make it run in the background (using &), wait for any process to end if the count of processes is greater than 10 and pick up the next file. It is working but is there a better to way to achieve this?

I don't see any better solution, as long as each file has to be processed separately.

You'd be better off having each thread process its file, then try to pick up the next file itself. Maybe it won't matter in your current application, but the starting and teardown of threads is relatively expensive. Common practice is to keep a thread alive if it is otherwise just going to be replaced by a clone of itself.

How to improve perfomance using multithreading?

I've got a program which receives string messages from other applications and parses them using VCL.
Messages are sent as follows:
AtomId := GlobalAddAtom(PChar(s));
SendMessage(MyProgramHandle, WM_MSG, 0, AtomID);
GlobalDeleteAtom(AtomID);
My program receives this message, parses it for some time, and then returns control to an application.
It takes time to parse one message so perfomance of other applications worsens.
One possible solution is to create form with the same caption and the same class in other thread, and rename class of main form.
But as far as I know it isn't recommended to create forms in threads.
So, what are possible ways to improve perfomance?

The typical approach would be to create a worker thread (or a pool of worker threads). The main thread will continue to receive the messages, but instead of parsing them it will just add them to a queue (a linked list, for example).
The worker thread takes the first element in the queue and processes it. When done it goes back to the queue to get the next element.
Since the queue is a shared resource between multiple threads you have to control access to it. A mutex will ensure that only one thread gets access to the queue at any given time.
Good luck.

So the problem is that both the receiving of the messages and the VCL operations are done in the same thread (the main VCL thread)? And so the receiving and processing are serialized and as result the senders are blocked while your app is busy filling the grid? Then I can understand that you ask for a way to move the receiving to a different window message loop.
So I would create a window (not a VCL form) only for the purpose to receive messages and use its message loop to add message to a queue. So you only need to find this (non-VCL) window and SendMessage to its handle. In the VCL thread, a Timer could fetch the next "n" messages and add them to the grid.

Worker processes called in order azure

If Multiple worker processes have to called in order after every task by the previous worker gets done (there is a queue containing pointer to blobs and every worker has multiple instances. Pls see my previous questions.) how should this be done ?
Will Azure fabric do this automatically ? or is there a way to set this in the config file ?

You just follow the same process that you're already got but with more layers. If worker 1 reads something from queue 1, and it needs to let worker 2 know that it's time for it to start processing the same file, worker 1 simply puts a message in queue 2.
Edit: OK, let me see if I fully understand what you're after here. It sounds like you have here is a batch of files that need to go through several processes, but they can't go on to the next step of the process until they've all finished going through the previous step.
If that is the case then, no, there is nothing in Azure that will do that for you automatically.
Because of this, if possible I'd rework my workers so that each file could just be sent on without worrying about what state the other files were in.
If that is not possible, then you need some way of monitoring which files have been completed and which ones are still pending. One way to do this (and hopefully you can expand on this) is the code that creates the batch, creates a progress row in a table somewhere (SQL Azure or Azure Tables, it doesn't matter really) for each file, sends a message to worker one and starts a background task to monitor this table.
When worker 1 finishes processing a file, it updates the relevant row in the monitoring table to say, "Worker 1 finished".
The background thread that was created above waits until all of the rows have "Worker 1 finished" set to true, then creates the messages for Worker 2 and starts looking at the "Worker 2 finished" flag. Rinse repeat for as many worker steps as you have.
When all steps are finished, you'll probably want the background task to clean up this table and also have some sort of timeout in case a message gets lost somewhere.

Although what #knightpfhor is suggesting would do the trick, I would try and go about this in a more simple kind of way without referencing the names of workers :-)
Specifically, If there is a way you already know how many docs need to be processed, I would first create N-amount of rows in a Table, each holdung some info relevant to the current batch, each having columnKey set to be the batch id. I'd then put N number of messages in my queue and let the worker processes pick them up. When each worker is done, it would delete the corresponding row in the table as well. A monitoring process would simoly know a batch started and do a count every once in a while (if it is not cricital, or the worker would do a count after it finishes removing the row) and spawn a new message in the relevant queue for the next worker role to process.
If you wamt even more control you could go with having a row in your table storing the state of your process (processing files, post-processing), etc. In this case, I'd store the state transitions in a queue, and make sure you only make them once. But that's a whole new question alltogether.
Hope it heps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to format multiple files via multithreading - multithreading

I'm using multiplethreading to change the case of multiple text files and comparing the time required according to the number of threads i use. I'm not able to understand how to select a set of files for processing

Related

Bash - how to redirect stdout of a certain thread?

File writing from multiple threads.

threading the same job to work on separate files

How to improve perfomance using multithreading?

Worker processes called in order azure

Categories

Resources