I'm facing a problem of adding more file to a partition when there are too many of them, currently, I've approximately 10 million files + Linux file system, for some reason I want to add more files, but it keep saying that there is not enough space (I do have 30+ GB left though) any idea why is that happing and is it possible to be resolved?
The most common cause is that there's too many files in a single directory - directories can hold only a finite number of files. if that's not the problem, there are some other meta-data structures which can also limit the total number of files on disk.
You can differentiate between these two problems by checking if you can add files to another directory.
Related
I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.
I am writing software in C, on Linux running on AWS, that has to handle 240 terabytes of data, in 72 million files.
The data will be spread across 24 or more nodes, so there will only be 10 terabytes on each node, and 3 million files per node.
Because I have to append data to each of these three million files every 60 seconds, the easiest and fastest thing to do would to be able to keep each of these files open at one time.
I can't store the data in a database, because the performance in reading/writing the data will be too slow. I need to be able to read the data back very quickly.
My questions:
1) is it even possible to keep open 3 million files
2) if it is possible, how much memory would it consume
3) if it is possible, would performance be terrible
4) if it is not possible, I will need to combine all of the individual files into a couple of dozen large files. Is there a maximum file size in Linux?
5) if it is not possible, what technique should I use to append data every 60 seconds, and keep track of it?
The following is a very coarse description of an architecture that can work for your problem, assuming that the maximum number of file descriptors is irrelevant when you have enough instances.
First, take a look at this:
https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/
https://aws.amazon.com/efs/
EFS provides a shared storage that you can mount as a filesystem.
You can store ALL your files in a single storage unit of EFS. Then, you will need a set of N worker-machines running at full capacity of filehandlers. You can then use a Redis queue to distribute the updates. Each worker has to dequeue a set of updates from Redis, and then will open necessary files and perform the updates.
Again: the maximum number of open filehandlers will not be a problem, because if you hit a maximum, you only need to increase the number of worker machines until you achieve the performance you need.
This is scalable, though I'm not sure if this is the cheapest way to solve your problem.
My Node.js application currently stores the uploaded images to the file system with the paths saved into a MongoDB database. Each document, maybe max 2000 in future, has between 4 and 10 images each. I don't believe I need to store the images in the database directly for my usage (I do not need to track versions etc), I am only concerned with performance.
Currently, I store all images in one folder and associated paths stored in the database. However as the number of documents, hence number of images, increase will this slow performance having so many files in a single folder?
Alternatively I could have a folder for each document. Does this extra level of folder complexity affect performance? Also using MongoDB the obvious folder naming schema would be to use the ObjectID but does folder names of the length (24) affect performance? Should I be using a custom ObjectID?
Are there more efficient ways? Thanks in advance.
For simply accessing files, the number of items in a directory does not really affect performance. However, it is common to split out directories for this as getting the directory index can certainly be slow when you have thousands of files. In addition, file systems have limits to the number of files per directory. (What that limit is depends on your file system.)
If I were you, I'd just have a separate directory for each document, and load the images in there. If you are going to have more than 10,000 documents, you might split those a bit. Suppose your hash is 7813258ef8c6b632dde8cc80f6bda62f. It's pretty common to have a directory structure like /7/8/1/3/2/5/7813258ef8c6b632dde8cc80f6bda62f.
I have an application (Endeca) that is a file-based search engine. A customer has Linux 100 servers, all attached to the same SAN (very fast, fiber-channel). Currently, each of those 100 servers uses the same set of files. Currently, each server has their own copy of the index (approx 4 gigs, thus 400 gigs in total).
What I would like to do is to have one directory, and 100 virtual copies of that directory. If the application needs to make changes to any of the files in that directory, only then would is start creating a distinct copy of the original folder.
So my idea is this: All 100 start using the same directory (but they each think they have their own copy, and don't know any better). As changes come in, Linux/SAN would then potentially have up to 100 copies (now slightly different) of that original.
Is something like this possible?
The reason I'm investigating this approach would be to reduce file transfer times and disk space. We would only have to copy the 4 gig index files once to the SAN and create virtual copies. If no changes came in, we'd only use 4 gigs instead of 400.
Thanks in advance!
The best solution here is to utilise the "de-dupe" functionality at the SAN level. Different vendors may call it differently, but this is what I am talking about:
https://communities.netapp.com/community/netapp-blogs/drdedupe/blog/2010/04/07/how-netapp-deduplication-works--a-primer
All 100 "virtual" copies will utilise the same physical disk blocks on the SAN. SAN will only need to allocate new blocks if there are changes made to a specific copy of a file. Then a new block will be allocated for this copy but the remaining 99 copies will keep using the old block - thus dramatically reducing the disk space requirements.
What version of Endeca are you using? MDEX7 engine has the clustering ability where the leader and follower nodes are all reading from the same set of files, so as long as the files are shared (say over NAS) then you can have multiple engines running on different machines backed by the same set of index files. Only the leader node will have ability to change the files keeping the changes consistent, the follower nodes will then be notified by the cluster coordinator when the changes are ready to be "picked up".
In MDEX 6 series you could probably achieve something similiar provided that the index files are read-only. The indexing in V6 would usually happen on another machine and the destination set of index files would usually be replaced once the new index is ready. This though won't help you if you need to have partial updates.
Netapp deduplication sounds interesting, Endeca has never tested the functionality, so I am not sure what kinds of problems you will run into.
One software we developed generates more and more, currently about 70000 files per day, 3-5 MB each. We store these files on a Linux server with ext3 file system. The software creates a new directory every day, and writes the files generated that day into this directory. Writing and reading such a large number of files is getting slower and slower (I mean, per file), so one of my colleagues suggested opening subdirectories in every hour. We will test whether this makes the system faster, but this problem can be generalized:
Has anyone measured the speed of writing and reading files, as a function of the number of files in the target directory? Is there an optimal file count above which it's faster to put the files into subdirectories? What are the important parameters which may influence the optimum?
Thank you in advance.