Optimum directory structure for large number of files to display on a page - linux

I currently have a single directory call "files" which contains 200,000 photos from about 100,000 members. When the number of members increases to millions, I would expect the number of files in the "files" directory to get very large. The name of the files are all random because the users named them. The only way I can do is to sort them by the user name who created those files. In essence, each user will have their own sub-directory.
The server I am running is on Linux with ext3 file system. I am wondering if I shall split up the files into sub-directories inside the "files" directory? Is there any benefit to split up the files into many sub-directories? I saw some argument that it doesn't matter.
If I do need to split, I am thinking of creating directories base on the first two characters of user ID, then a third level sub-directory with the user ID like this:
files/0/0/00024userid/ (so all user ids started with 00 will go in files/0/0/...)
files/0/1/01auser/
files/0/2/0242myuserid/
.
files/0/a/0auser/
files/0/b/0bsomeuser/
files/0/c/0comeuser/
.
files/0/z/0zero/
files/1/0/10293832/
files/1/1/11029user/
.
files/9/z/9zl34/
files/a/0/a023user2/
..
files/z/z/zztopuser/
I will be showing 50 photos at a time. What is the most efficient(fast) way for the server to pick up the files for static display? All from the same directory or from 50 different sub-directories? Any comments or thoughts is appreciated. Thanks.

Depending on the file system, there might be an upper limit to how many files a directory can hold. This, and the performance impact of storing many files in one directory is also discussed at some length in another question.
Also keep in mind that your file names will likely not be truly random - quite a lot might start with "DSC", "IMG" and the like. In a similar vein, the different users (or, indeed, the same user) might try storing two images with the same name, necessitating a level of abstraction from the file name anyway.

Related

Compare and sync two (huge) directories - consider only filenames

I want to do a one-way sync in Linux between two directories. One contains files and the other one contains processed files but has the same directory structure and the same filenames but some files might be missing.
Right now I am doing:
cd $SOURCE
find * -type f | while read fname; do
if [ ! -e "$TARGET$fname" ]
then
# process the file and copy it to the target. Create directories if needed.
fi
done
which works, but is painfully slow.
Is there a better way to do this?
There are rughly 50.000.000 files, spread among directories and sub-directories. Each directory does not contain more than 255 files/subdirs.
I looked at
rsync: seems like it always does size or timestamp comparison. This will result on every file being flagged as different since the processing takes some time and changes the file contents.
diff -qr: could not figure out how to make it ignore file-sizes and content
Edit
Valid assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory. This is only partially true, but deletions from the source are rare and happen in bulk, so I will do a special case for that.
Assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory
I don't see a way around comparing 2x lists of ~50 million entries but we can try to eliminate the entry-by-entry approach of a bash looping solution ...
One idea:
# obtain sorted list of all $SOURCE files
srcfiles=$(mktemp)
cd "${SOURCE}"
find * -type f | sort > "${srcfiles}"
# obtain sorted list of all $TARGET files
tgtfiles=$(mktemp)
cd "${TARGET}"
find * -type f | sort > "${tgtfiles}"
# 'comm -23' => extract list of items that only exist in the first file - ${srcfiles}
missingfiles=$(mktemp)
comm -23 "${srcfiles}" "${tgtfiles}" > "${missingfiles}"
# process list of ${SOURCE}-only files
while read -r missingfile
do
process_and_copy "${missingfile}"
done < "${missingsfiles}"
'rm' -rf "${srcfiles}" "${tgtfiles}" "${missingfiles}"
This solution is (still) serial in nature so if there are a 'lot' of missing files the overall time to process said missing files could be appreciable.
With enough system resources (cpu, memory, disk throughput) a 'faster' solution would look at methods of parallelizing the work, eg:
running parallel find/sort/comm/process threads on different $SOURCE/$TARGET subdirectories (may work well if the number of missing files is evenly distributed across the different subdirectories) or ...
stick with the serial find/sort/comm but split ${missingfiles} into chunks and then spawn separate OS processes to process_and_copy the different chunks

Stream definition: Ignore all files but one filetype

We have a server with a depot that does not allow committing files which are in a client mapping therefore I need a stream configuration.
Now I struggle with a task which I would assume should be simple:
We have a very large stream with lots of different file types and I would like to check out the entire stream but get only a certain file type.
Can this be done with perforce without black-listing every file type in question?
Edit: Sorry that I (for some reason omitted) so many information in my question.
I am already setting up a virtual stream where the UI gives me three nice fields:
Paths – where I can enter import, share isolate paths
Remapping – ignored in my case
Ignored – here I can enter wildcards to ignore directories or files
I was hoping that by creating a virtual stream I actually could define the file types I want, e.g. I could write an import statement like
import RootDir/....txt //Depot/mainline/RootDir/....txt (note the 4 dots, 3 for perforce and the other as a "wildcard"
however the stream definition does not support this and only allows me to write
import RootDir/... //Depot/mainline/RootDir/...
Since I was not able to find a way to white list the files I wanted I only knew a way to blacklist all things I did not want but I would like to avoid that because my Ignored list would be dozens of entries long.
Now I will look into that sync hint because I could use the full stream spec without filter and only sync the files I need on disk, which might be very good.
There are a few different things going on in your question but this seems the most like a statement of what you're trying to do so I'm going to zero in on it:
I would like to check out the entire stream but get only a certain
file type.
If by "check out" you mean you only want to sync that file type to your local workspace:
p4 sync ....TXT
If by "check out" you mean you want to open only that file type for edit:
p4 edit ....TXT
ANY operation in Perforce that operates on files accepts an arbitrary file path, because Perforce tracks all of its state per-file. This is true whether you're using classic clients or streams.
There needs to be some mechanism for telling the Helix (Perforce) server that you only want to retrieve certain files from the stream.
Virtual Streams may be a good fit here, as they allow you to filter the view of an existing stream.
This means you can sync only the files you want and when you submit you will be submitting directly back to the stream your virtual stream is based on.
More information is available here:
https://www.perforce.com/perforce/doc.current/manuals/p4v/p4v_virtual_streams.html

Zipping a folder into equal size parts

I've been using 7Zip for a few years now and always liked that I could zip a folder into several parts of a specific size. For example, the website BOX only allows uploads under 100MB so anything I wanted to put into BOX, I just split the zip file into 95MB files. However, recently I've needed to do something similar except instead of breaking into a certain size, I need to split them up into a specific number of files but all equaling the same size. Right now, 7zip breaks them into the max size you allow and the last file is any remaining data ranging from 1KB up to the limit specified.
For example, say I have a 826MB file, I want it to zip up 5 files that are all the same size. Is there any program out there that will do this?
Thanks in advanced!
I don't know of any program that does this, but if this is something that you're doing regularly, you could write a script that:
Finds out the size of the file
Calculates the maximum piece size to use if you want to split it into n pieces.
Constructs a corresponding 7zip command

How to find the position of Central Directory in a Zip file?

I am trying to find the position of the first Central Directory file header in a Zip file.
I'm reading these:
http://en.wikipedia.org/wiki/Zip_(file_format)
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...
If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.
To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?
After reading about End Of Central Directory record, Wikipedia says:
This ordering allows a zip file to be created in one pass, but it is
usually decompressed by first reading the central directory at the
end.
How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?
P.S. I'm writing a Zip file reader.
Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.
If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.
I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.
Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

Splitting long input into multiple text files

I have some code which will generate an infinite number of lines in output. So, I can't store those values in a single output file.
Instead, I split the output file into more files. I am splitting the file according to the index numbers. Now my doubt is I don't know how many numbers my file will be having. So is it possible to split the file into different output without giving index? For example:
first 100,000 lines in m.txt
from 100,001 to next 200,000 in n.txt
If you don't need to be able to find a particular line based on the file name, you can split the output based on the file size. Write lines to m1.txt until the next line will make it >1MB; then move to the next file - m2.txt.
split(1) appears to be exactly the tool for your job.
Generate files with a running index. Start with opening e.g. m_000001.txt. Write a fixed nuber of lines to that file. Close file. Open next file, e.g. m_000002.txt, and continue.
Making sure that you don't overflow the disk is an housekeepting task to be done separately. Here one can think of backups, compression, file rotation and so on.
You may want to use logrotate for this purpose. It has a lot of options: check out the man page.
Here's the introduction of the man page:
"logrotate is designed to ease administration of systems that generate
large numbers of log files. It allows automatic rotation, compression,
removal, and mailing of log files. Each log file may be handled daily,
weekly, monthly, or when it grows too large."
4 ways to split while writing:
A) Fixed no of characters (Size)
B) Fixed no of lines
C) Fixed Interval of time before writing
D) Fixed Counter of a function before calling a write
Based on those splitings, You can name the output file.

Resources