Treadmill infinite file between producer and consumer processes - linux

I have a "producer" executable (that I can run, but don't control the source for) that continually writes to a growing output file. (Like a log file -- it happens to be binary, but I think of it like a log, just without the nice line breaks).
And I have another "consumer" process that is continually reading from that file in decently big (10mb) chunks.
The good news:
Once the consumer reads a chunk from the file, it is ok to discard that part of the file, we are done with it forever.
I can keep the consumer from going too fast and catching up.
I'm confident the consumer is fast enough to not fall too far behind the producer.
The bad news: if I let it run, eventually the file will get huge enough to fill up the disk (many GBs) and I have to kill off both processes, erase the file, and start everything over. :(
I'd like to have to do that restarting less often! Is there a way to put this on a "treadmill," or ring-buffer, or something similar, where I don't have to have a huge amount of disk space for the full file? The only part I actually need to keep around is the maybe 100mb buffer between the producer and consumer. I'd even be ok with some bridge process in between them, or some pipe magic, or virtual filesystems, or ???

Related

File append race condition (single thread!)

My program does something like:
Open a file (append mode)
Write some stuff
Close
[repeat]
The file is different most of the time but on certain occasions (not uncommon really) is repeated (either consecutively or in a very close iteration).
Is there any chance kernel can play tricks on me and opening the file not pointing to the end of the file? Say the write is not yet completed (buffered somewhere in the kernel) and opening the file again makes the fd point to a position that is not the real end of the file. That will result potentially in overlapping writes.
As I said, my program is single threaded, I see no reason why this would happen but I do not fully understand the kernel guarantees when it comes to this.
Thanks!

Most efficiently downloading, unzipping, and analyzing many files in Node JS

I have to download a large amount of compressed files onto my Node JS server from a third party host, unzip them, analyze them, and store them. These files are a little over 18000 XMLs, each between about 0.01 and 0.06mb. The files are split into 8 compressed folders of greatly varying size.
Right now, this is my process:
Download the compressed files using the request library
request({ url: fileUrl, encoding: null }, function(err, resp, body) {...});
Write the downloaded files to a directory
fs.writeFile(output, body, function(err) {...});
Unzip the downloaded material using extract-zip and place in a new directory
unzip(output, { dir : directory }, function (err) {...});
Delete the downloaded zip file
fs.unlink('/a-directory/' + output, (err) => { if (err) console.log(err); });
Get the items in the directory
fs.readdir(fromDir, function(err, items) {...});
For each item (XML file), read it
fs.readFile(fromDir + '/' + item, 'utf8', function(err, xmlContents) {...});
For each read XML file, convert it to a JSON
let bill = xmlToJsonParser.toJson(xmlContents)
Will do some other stuff, but I haven't written that part yet
I can post more complete code if that would help anyone.
As you can see, there are a bunch of steps here, and I have a hunch that some of them can be removed or at least made more efficient.
What are your suggestions for improving the performance?––right now the process completes, but I hit 100% CPU every time, which I am fairly certain is bad.
Some general guidelines for scaling this type of work:
Steps that are entirely async I/O scale really well in node.js.
When doing lots of I/O operations, you will want to be able to control how many are in-flight at the same time to control memory usage and TCP resource usage. So, you probably would launch several hundred requests at a time, not 18,000 requests all at once. As one finishes, you launch the next one.
Steps that use a meaningful amount of CPU should be in a process that you can run N of them (often as many as you have CPUs). This helps your CPU usage work scale.
Try to avoid keeping more in memory than you need to. If you can pipe something directly from network to disk, that can significantly reduce memory usage vs. buffer the entire file and then writing the whole thing to disk.
Figure out some way to manage a work queue of jobs waiting for the worker processes to run. You can either have your main app maintain a queue and use http to ask it for the next job or you can even work it all through the file system with lock files.
So, here are some more specifics based on these guidelines:
I'd say use your main server process for steps 1 and 2. Neither of the first two steps are CPU intensive so a single server process should be able to handle a zillion of those. All they are doing is async I/O. You will have to manage how many request() operations are in flight at the same time to avoid overloading your TCP stack or your memory usage, but other than that, this should scale just fine since it's only doing async I/O.
You can reduce memory usage in steps 1 and 2 by piping the response directly to the output file so as bytes arrive, they are immediately written to disk without holding the entire file in memory.
Then write another node.js app that caries out steps 3 - 8 (steps 3 and perhaps 7 are CPU intensive). If you write them in a way that they just "check out" a file from a known directory and work on it, you should be able to make it so that you can run as many of these processes as you have CPUs and thus gain scale while also keeping the CPU load away from your main process.
The check-out function can either be done via one central store (like a redis store or even just a simple server of your own that maintains a work queue) that keeps track of which files are available for work or you could even implement it entirely with file system logic using lock files.
right now the process completes, but I hit 100% CPU every time, which I am fairly certain is bad.
If you only have one process and it's at 100% CPU, then you can increase scale by getting more processes involved.
As you can see, there are a bunch of steps here, and I have a hunch that some of them can be removed or at least made more efficient.
Some ideas:
As mentioned before, pipe your request directly to the next operation rather than buffer the whole file.
If you have the right unzip tools, you could even pipe the request right to an unzipper which is piped directly to a file. If you did this, you'd have to scale the main process horizontally to get more CPUs involved, but this would save reading and writing the compressed file to/from disk entirely. You could conceivably combine steps 1-4 into one stream write with an unzip transform.
If you did the transform stream described in step 2, you would then have a separate set of processes that carry out steps 5-8.
Here are a couple libraries that can be used to combine pipe and unzip:
unzip-stream
node-unzip-2

Streaming output from program to an arbitrary number of programs under Linux?

How should I stream the output from one program to an undefined number of programs in such a fashion that the data isn't buffered anywhere and that the application where the stream originates from doesn't block even if there's nothing reading the stream, but the programs reading the stream do block if there's no output from the first-mentioned program?
I've been trying to Google around for a while now, but all I find is methods where the program does block if nothing is reading the stream.
How should I stream the output from one program to an undefined number of programs in such a fashion that the data isn't buffered anywhere and that the application where the stream originates from doesn't block even if there's nothing reading the stream
Your requirements as stated can not possibly be satisfied without some form of a buffer.
Most straightforward option is to write the output to the file and let consumers read that file.
Another option is to have a ring-buffer in a form of a memory mapped file. As the capacity of a ring-buffer is normally fixed there needs to be a policy for dealing with slow consumers. Options are: block the producer; terminate the slow consumer; let the slow consumer somehow recover when it missed data.
Many years ago I wrote something like what you describe for an audio stream processing app (http://hewgill.com/nwr/). It's on github as splitter.cpp and has a small man page.
The splitter program currently does not support dynamically changing the set of output programs. The output programs are fixed when the command is started.
Without knowing exactly what sort of data you are talking about (how large is the data, what format is it, etc, etc) it is hard to come up with a concrete answer. Let's say for example you want a "ticker-tape" application that sends out information for share purchases on the stock exchange, you could quite easily have a server that accepts a socket from each application, starts a thread and sends the relevant data as it appears from the recoder at the stock market. I'm not aware of any "multiplexer" that exists today (but Greg's one may be a starting point). If you use (for example) XML to package the data, you could send the second half of a packet, and the client code would detect that it's not complete, so throws it away.
If, on the other hand, you are sending out high detail live update weather maps for the whole country, the data is probably large enough that you don't want to wait for a full new one to arrive, so you need some sort of lock'n'load protocol that sets the current updated map, and then sends that one out until (say) 1 minute later you have a new one. Again, it's not that complex to write some code to do this, but it's quite a different set of code to the "ticker tape" solution above, because the packet of data is larger, and getting "half a packet" is quite wasteful and completely useless.
If you are streaming live video from the 2016 Olympics in Brazil, then you probably want a further diffferent solution, as timing is everything with video, and you need the client to buffer, pick up key-frames, throw away "stale" frames, etc, etc, and the server will have to be different.

Speed Up with multithreading

i have a parse method in my program, which first reads a file from disk then, parses the lines and creats an object for every line. For every file a collection with the objects from the lines is saved afterwards. The files are about 300MB.
This takes about 2.5-3 minutes to complete.
My question: Can i expect a significant speed up if i split the tasks up to one thread just reading files from disk, another parsing the lines and a third saving the collections? Or would this maybe slow down the process?
How long is it common for a modern notebook harddisk to read 300MB? I think, the bottleneck is the cpu in my task, because if i execute the method one core of cpu is always at 100% while the disk is idle more then the half time.
greetings, rain
EDIT:
private CANMessage parseLine(String line)
{
try
{
CANMessage canMsg = new CANMessage();
int offset = 0;
int offset_add = 0;
char[] delimiterChars = { ' ', '\t' };
string[] elements = line.Split(delimiterChars);
if (!isMessageLine(ref elements))
{
return canMsg = null;
}
offset = getPositionOfFirstWord(ref elements);
canMsg.TimeStamp = Double.Parse(elements[offset]);
offset += 3;
offset_add = getOffsetForShortId(ref elements, ref offset);
canMsg.ID = UInt16.Parse(elements[offset], System.Globalization.NumberStyles.HexNumber);
offset += 17; // for signs between identifier and data length number
canMsg.DataLength = Convert.ToInt16(elements[offset + offset_add]);
offset += 1;
parseDataBytes(ref elements, ref offset, ref offset_add, ref canMsg);
return canMsg;
}
catch (Exception exp)
{
MessageBox.Show(line);
MessageBox.Show(exp.Message + "\n\n" + exp.StackTrace);
return null;
}
}
}
So this is the parse method. It works this way, but maybe you are right and it is inefficient. I have .NET Framwork 4.0 and i am on Windows 7. I have a Core i7 where every core has HypterThreading, so i am only using about 1/8 of the cpu.
EDIT2: I am using Visual Studio 2010 Professional. It looks like the tools for a performance profiling are not available in this version (according to msdn MSDN Beginners Guide to Performance Profiling).
EDIT3: I changed the code now to use threads. It looks now like this:
foreach (string str in checkedListBoxImport.CheckedItems)
{
toImport.Add(str);
}
for(int i = 0; i < toImport.Count; i++)
{
String newString = new String(toImport.ElementAt(i).ToArray());
Thread t = new Thread(() => importOperation(newString));
t.Start();
}
While the parsing you saw above is called in the importOperation(...).
With this code it was possible to reduce the time from about 2.5 minutes to "only" 40 seconds. I got some concurrency problems i have to track but at least this is much faster then before.
Thank you for your advice.
It's unlikely that you are going to get consistent metrics for laptop hard disk performance as we have no idea how old your laptop is nor do we know if it is sold state or spinning.
Considering you have already done some basic profiling, I'd wager the CPU really is your bottleneck as it is impossible for a single threaded application to use more than 100% of a single cpu. This is of course ignoring your operating system splitting the process over multiple cores and other oddities. If you were getting 5% CPU usage instead, it'd be most likely were bottle necking at IO.
That said your best bet would be to create a new thread task for each file you are processing and send that to a pooled thread manager. Your thread manager should limit the number of threads you are running to either the number of cores you have available or if memory is an issue (you did say you were generating 300MB files after all) the maximum amount of ram you can use for the process.
Finally, to answer the reason why you don't want to use a separate thread for each operation, consider what you already know about your performance bottlenecks. You are bottle necked on cpu processing and not IO. This means that if you split your application into separate threads your read and write threads would be starved most of the time waiting for your processing thread to finish. Additionally, even if you made them process asynchronously, you have the very real risk of running out of memory as your read thread continues to consume data that your processing thread can't keep up with.
Thus, be careful not to start each thread immediately and let them instead be managed by some form of blocking queue. Otherwise you run the risk of slowing your system to a crawl as you spend more time in context switches than processing. This is of course assuming you don't crash first.
It's unclear how many of these 300MB files you've got. A single 300MB file takes about 5 or 6 seconds to read on my netbook, with a quick test. It does indeed sound like you're CPU-bound.
It's possible that threading will help, although it's likely to complicate things significantly of course. You should also profile your current code - it may well be that you're just parsing inefficiently. (For example, if you're using C# or Java and you're concatenating strings in a loop, that's frequently a performance "gotcha" which can be easily remedied.)
If you do opt for a multi-threaded approach, then to avoid thrashing the disk, you may want to have one thread read each file into memory (one at a time) and then pass that data to a pool of parsing threads. Of course, that assumes you've also got enough memory to do so.
If you could specify the platform and provide your parsing code, we may be able to help you optimize it. At the moment all we can really say is that yes, it sounds like you're CPU bound.
That long for only 300 MB is bad.
There's different things that could be impacting performance as well depending upon the situation, but typically it's reading the hard disk is still likely the biggest bottleneck unless you have something intense going on during the parsing, and which seems the case here because it only takes several seconds to read 300MB from a harddisk (unless it's way bad fragged maybe).
If you have some inefficient algorithm in the parsing, then picking or coming up with a better algorithm would probably be more beneficial. If you absolutely need that algorithm and there's no algorithmic improvement available, it sounds like you might be stuck.
Also, don't try to multithread to read and write at the same time with the multithreading, you'll likely slow things way down to increased seeking.
Given that you think this is a CPU bound task, you should see some overall increase in throughput with separate IO threads (since otherwise your only processing thread would block waiting for IO during disk read/write operations).
Interestingly I had a similar issue recently and did see a significant net improvement by running separate IO threads (and enough calculation threads to load all CPU cores).
You don't state your platform, but I used the Task Parallel Library and a BlockingCollection for my .NET solution and the implementation was almost trivial. MSDN provides a good example.
UPDATE:
As Jon notes, the time spent on IO is probably small compared to the time spent calculating, so while you can expect an improvement, the best use of time may be profiling and improving the calculation itself. Using multiple threads for the calculation will speed up significantly.
Hmm.. 300MB of lines that have to be split up into a lot of CAN message objects - nasty! I suspect the trick might be to thread off the message assembly while avoiding excessive disk-thrashing between the read and write operations.
If I was doing this as a 'fresh' requirement, (and of course, with my 20/20 hindsight, knowing that CPU was going to be the problem), I would probably use just one thread for reading, one for writing the disk and, initially at least, one thread for the message object assembly. Using more than one thread for message assembly means the complication of resequencing the objects after processing to prevent the output file being written out-of-order.
I would define a nice disk-friendly sized chunk-class of lines and message-object array instances, say 1024 of them, and create a pool of chunks at startup, 16 say, and shove them onto a storage queue. This controls and caps memory use, greatly reduces new/dispose/malloc/free, (looks like you have a lot of this at the moment!), improves the efficiency of the disk r/w operations as only large r/w are performed, (except for the last chunk which will be, in general, only partly filled), provides inherent flow-control, (the read thread cannot 'run away' because the pool will run out of chunks and the read thread will block on the pool until the write thread returns some chunks), and inhibits excess context-switching because only large chunks are processed.
The read thread opens the file, gets a chunk from the queue, reads the disk, parses into lines and shoves the lines into the chunk. It then queues the whole chunk to the processing thread and loops around to get another chunk from the pool. Possibly, the read thread could, on start or when idle, be waiting on its own input queue for a message class instance that contains the read/write filespecs. The write filespec could be propagated through a field of the chunks, so supplying the the write thread wilth everything it needs via. the chunks. This makes a nice subsystem to which filespecs can be queued and it will process them all without any further intervention.
The processing thread gets chunks from its input queue and splits the the lines up into the message objects in the chunk and then queues the completed, whole chunks to the write thread.
The write thread writes the message objects to the output file and then requeues the chunk to the storage pool queue for re-use by the read thread.
All the queues should be blocking producer-consumer queues.
One issue with threaded subsystems is completion notification. When the write thread has written the last chunk of a file, it probably needs to do something. I would probably fire an event with the last chunk as a parameter so that the event handler knows which file has been completely written. I would probably somethihng similar with error notifications.
If this is not fast enough, you could try:
1) Ensure that the read and write threads cannot be preemepted in favour of the other during chunk-disking by using a mutex. If your chunks are big enough, this probably won't make much difference.
2) Use more than one processing thread. If you do this, chunks may arrive at the write-thread 'out-of-order'. You would maybe need a local list and perhaps some sort of sequence-number in the chunks to ensure that the disk writes are correctly ordered.
Good luck, whatever design you come up with..
Rgds,
Martin

Nonblocking/asynchronous fifo/named pipe in shell/filesystem?

Is there a way to create non blocking/asynchronous named pipe or something similar in shell? So that programs could place lines in it, those lines would stay in ram, and when some program could read some lines from pipe, while leaving what it did not read in fifo? It is also very probable that programs can be writing and reading to this fifo at the same time. At first I though maybe this could be done using files, but after searching a web for a bit it seems nothing good can come from the fact that file is read and written at same time. Named pipes would almost work, just there are two problems: first they block reads/writes if there is no one at the other end, second even if I let writing to blocked and set two processes to write to pipe while no one is reading, by trying to write one line with each process, and then try head -n 1 <fifo> I get just one line as I need, but both writing processes terminate, and second line is lost. Any suggestions?
Edit: maybe some intermediate program could be used to help with this, acting like mediator between writers and readers?
You can use special program for this purpose - buffer. Buffer is designed to try and keep the writer side continuously busy so that it can stream when writing to tape drives, but you can use for other purposes. Internally buffer is a pair of processes communicating via a large circular queue held in shared memory, so your processes will work asynchronously. Your reader process will be blocked in case the queue is full and the writer process - in case the queue is empty. Example:
bzcat archive.bz2 | buffer -m 16000000 -b 100000 | processing_script | bzip2 > archive_processed.bz2
http://linux.die.net/man/1/buffer

Resources