I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?
Related
For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans).
The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file.
A typical run would be:
10000 images with 300 MB each = 3TB
10 seconds on a single core = 100000 seconds = about 27 hours
What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2).
Theoretically the 480 cores should only need 3.5 minutes for the task.
I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option.
I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files.
The alternative might be to buy a single high-end workstation and forget about distributed processing.
I am not certain, if we need data locality,
i.e. each node holds part of the data on a local HD and processes only his
local data.
I regularly run large scale distributed calculations on AWS using Spot Instances. You should definitely use the cluster of 20 servers at your disposal.
You don't mention which OS your servers are using but if it's linux based, your best friend is bash. You're also lucky that it's a command line programme. This means you can use ssh to run commands directly on the servers from one master node.
The typical sequence of processing would be:
run a script on the Master Node which sends and runs scripts via ssh on all the Slave Nodes
Each Slave Node downloads a section of the files from the master node where they are stored (via NFS or scp)
Each Slave Node processes its files, saving required data via scp, mysql or text scrape
To get started, you'll need to have ssh access to all the Slaves from the Master. You can then scp files to each Slave, like the script. If you're running on a private network, you don't have to be too concerned about security, so just set ssh passwords to something simple.
In terms of CPU cores, if the command line program you're using isn't designed for multi-core, you can just run several ssh commands to each Slave. Best thing to do is run a few tests and see what the optimal number of process is, given that too many processes might be slow due to insufficient memory, disk access or similar. But say you find that 12 simultaneous processes gives the fastest average time, then run 12 scripts via ssh simultaneously.
It's not a small job to get it all done, however, you will forever be able to process in a fraction of the time.
You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/
But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.
My problem is that application takes too long to load thousands of files. Yes, I know it's going to take a long time, but I would like to make it faster by any amount of time. What I mean by "load" is open the file to get its descriptor and then read the first 100 bytes or so of it.
So, my main strategy has been to create a second thread that will open and close (without reading any contents) all the files. This seems to help because the thread runs ahead of the main thread and I'm guessing the OS is caching these file descriptors ahead of time so that when my main thread opens them it's a quick open. This has actually helped because the thread can start caching these file descriptors while my main thread is parsing the data read in from these files.
So my real question is...what else can I do to make this faster? What approaches are there? Has anyone had success doing this?
I've heard of OS prefetching calls but it was for virtual memory pages. Is there a way to tell the OS, hey I'm going to be needed all these files pretty soon - I suggest that you start gathering them for me ahead of time. My lookahead thread is pretty crude.
Are there low level disk techniques I could use? Is there possibly a pattern of file access that would help? Right now, the files that are loaded all come from the same folder. I suppose there is no way to determine where exactly on disk they lie and which ordering of file opens would be fastest for the disk. I'm also guessing that the disk has some hard ware to make this as efficient as possible too.
My application is mainly for windows, but unix suggestions would help as well.
I am programming in C++ if that makes a difference.
Thanks,
-julian
My first thought is that this is going to be hard to work around from a programmatic level.
You'll find Linux and OSX can access thousands of files like this in a fraction of the time it takes Windows. I don't know how much control you have over the machine. If you can keep the thousands of files on a FAT partition, you should see better results than with NTFS.
How often are you scanning these files and how often are they changing. If the ratio is heavily on the reading side, it would make sense to copy the start of each file into a cache. The cache could store the filename, modification time, and 100 bytes of each of the thousand files.
the problem I try to deal with it is the saving of big number (millions) of small files (up to 50KB), which are sent via network. The saving is done sequential: server receives a file or a dir (via network), it saves it on disk; the next one arrives, it's saved etc.
Apparently, the performance is not acceptable, if multiple server processes coexist (let's say I have 5 processes which all read from network and write at the same time), because the I/O scheduler doesn't manage to merge efficiently the I/O writes.
A suggested solution is to implement some sort of buffering: each server process should have a 50MB cache, in which it should write the current file, do a chdir etc; when the buffer is full, it should be synced to disk, therefore obtaining an I/O burst.
My questions to you:
1) I know that already exists a buffer mechanism (disk buffer); do you think that the above scenario is going to add some improvement? (the design is much more complicated and it's not easy to implement a simple test case)
2) do you have any suggestions, where to look if I would implement this?
Many thanks.
You're going to need to do better than
"apparently the performance is not acceptable".
Specifically
How are you measuring it? Do you have an exact, reproducible figure
What is your target?
In order to do optimisation, you need two things- a method of measuring it (a metric) and a target (so you know when to stop, or how useful or useless a particular technique is).
Without either, you're sunk, I'm afraid.
How important are those writes? I have three suggestions (which can be combined), but one of them is a lot of work, and one of them is less safe...
Journaling
I'm guessing you're seeing some poor performance due in part to the journaling common to most modern Linux filesystems. The journaling causes barriers to be inserted into the IO queue when file metadata is written. You can try turning down the safety (and maybe turning up the speed) with mount(8) options barrier=0 and data=writeback.
But if there is a crash, the journal might not be able to prevent a lengthy fsck(8). And there's a chance the fsck(8) will wind up throwing away your data when fixing the problem. On the one hand, it's not a step to take lightly, on the other hand, back in the old days, we ran our ext2 filesystems in async mode without a journal both ways in the snow and we liked it.
IO Scheduler elevator
Another possibility is to swap the IO elevator; see Documentation/block/switching-sched.txt in the Linux kernel source tree. The short version is that deadline, noop, as, and cfq are available. cfq is the kernel default, and probably what your system is using. You can check:
$ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
The most important parts from the file:
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq
Changing the scheduler might be worthwhile, but depending upon the barriers inserted into the queue by the journaling requirements, there might not be much reordering possible. Still, it is less likely to lose your data, so it might be the first step.
Application changes
Another possibility is to drastically change your application to bundle files itself, and write fewer, larger, files to disk. I know it sounds strange, but (a) the iD development team packaged their maps, textures, objects, etc., into giant zip files that they would read into the program with a few system calls, unpack, and run with, because they found the performance much better than reading a few hundred or few thousand smaller files. Load times between levels was drastically shorter. (b) The Gnome desktop team and KDE desktop teams took different approaches to loading their icons and resource files: the KDE team packages their many small files into larger packages of some sort, and the Gnome team did not. The Gnome team had longer startup delays and were hoping the kernel could make some efforts to improve their startup time. The kernel team kept suggesting the fewer, larger, files approach.
Creating/renaming a file, syncing it, having lots of files in a directory and having lots of files (with tail waste) are some of the slow operations in your scenario. However to avoid them it would only help to write lesser files (for example writing out archives, concatenated file or similiar). I would actually try a (limited) parallel async or sync approach. The IO scheduler and caches are typically quite good.
I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J
Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.
If I start copying a huge file tree from one position to another or if some other process starts doing lots of disk activity, the foreground app (GUI) slows way down. For example, take a 2gb file tree with 100k files in it. Open a console and do cp -r bigtree bigtree2. Then go to firefox and start browsing. Firefox is almost unusable. Even if I set firefox's nice level to really high priority (-20), it's still super slow with huge delays.
I remember some years ago when I worked on a Solaris box, the system behaved much better in similar circumstances.
My HD is using DMA, not PIO. It's SATA. Not mounted with the atime flag.
Linux has long had a problem with programs that hog all the system's "dirty" cache memory. What is happening is that the copy process is filling the write cache with the file data it is copying and it is doing it very quickly. So when Firefox comes along and needs to write it must first wait for dirty buffer space or an available disk queue write slot. While waiting it is competing with the copy process and the kernel's pdflush thread, which moves data from dirty buffers to the disk write queue.
Firefox has yet another problem in this scenario. It uses SQLite to store its bookmarks, history and other things. SQLite is a ACID compliant database and it uses a transaction system with its disk writes flushed to disk. So not only does it have to wait for buffer space, it must wait for the disk queue, which is full of copied file, to clear out before it can acknowledge a successful write.
There has been a lot of tweaking done to the Linux disk queuing and buffering system. There are changes in almost every kernel release. Try one of the newer releases. You can also try tweaking the sysctl values. I sort of like these:
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 9000
vm.dirty_background_ratio = 4
vm.dirty_ratio = 80
You can also try tweaking the number of slots in the disk queue. This value is in /sys/block/sda/queue/nr_requests. You need to substitute sda with whatever your drive really is. More slots means more chances to merge IO requests and the CFQ IO scheduler can do a better job with priorities. Fewer slots usually means a shorter wait to get written to disk for synchronous IO like SQLite's transactions. Fewer slots also means a shorter wait to get read IO into the disk queue if a write-heavy process completely stuffs the queue with write IO.
Try ionice-ing or nice-ing the copy process. The issue is due to the fact that IO gets the same priority as the GUI, which for a desktop, affects perceived responsiveness.
There's an Ubuntu brainstorm about this currently.
You're not the first to notice this problem. Former kernel developer [Con Kolivas] (http://en.wikipedia.org/wiki/Con_Kolivas) found that a lot of companies are paying to improve linux server performance at the expense of desktop performance. Con had an impressive set of patches for making the desktop more responsive. Unfortunately there was some sort of code war and eventually Con dropped out.
I would love to know how to petition the Linux kernel developers for better desktop performance. In the meantime, if you are willing to run kernel 2.6.22, you can run with the -ck patch set.
Make sure that DMA is enabled on all your drives that support it. Depending on your distribution this may not be the default. Read man hdparm, and look into your systems init mechanism.