Maximum number of images on folder - linux

we are working on image gallery where we expect 1 million to 40 million photos but we are thinking to keep them in photo folder
but can one photo folder keep 40 million photos. if i directly keep them inside photo folder without creating any subfolder is there any issue of i have to create folder based on date of upload so that for any given date the photo uploaded in that day will go in that day folder like that .
i dont have any issue in creating that structure but for the knowledge point of view i want to know what is the problem if we keep few millions of photo directly in one folder. i have seen few websites who is doing this, for example if you will see this page all images are there under image folder.
something about 5 million images.all images are there under respective id for example under
4132808 so it shows that under images directory there are more than 5 million sub folder.is it ok to keep that much folder under one directory
http://www.listal.com/viewimage/4132808
http://iv1.lisimg.com/image/4132808/600full-the-hobbit%3A-an-unexpected-journey-photo.jpg

Depends on the filesystem check the file system comparison page on Wikipedia for comparison.
However you might want to sort in some structure like
images/[1st 2 char of some kind of hash/[2nd 2 char of hash]/...
With this you create an easily reproducable path with drastically decreasing the number of files in one folder.
You want to do this because in any event if you'd want to list the contents of the folder (or any application would need to do it) it would cause a huge performance problem.
What you can see on other sites is only how you publish those images. Of course they can be served seemingly from the safe url but in the underlying structure you want partition the files somehow.
Some calculations:
Let's say you use the sha256 hash of the filename to create the path. That gives you 40 chars of [0-9a-f]. So if you chose to have 2 letters sub folders then you'd have 256 of folders on each level. Now let's assume you do it for 3 levels: ab/cd/ef/1234...png. That's 256^3 folder meaning 16 million. So even if you'll be fine up to couple billion images.
As for serving the files you can do something like this with apache + mod_rewrite:
RewriteEngine On
RewriteCond %{REQUEST_URI} !^/images/../../../.*
RewriteRule ^/images/(..)(..)(..)(.*)$ /images/$1/$2/$3/$4 [L]
This would reroute the requests for the images to the correct place

See How many files can I put in a directory?.
Don't put all your files into one folder, it does not scale. If you don't want to start with a deep folder hierarchy, start simple and put the logic where you build the path to the folder in one class or method. This allows to simply rearrange later if needed.

Related

What does .sprite file refers to?

I'm using Liferay Portal 6, The .sprite file is not specified in the source code, however, it's included in the URL with a slash dot, then it's blocked by a security program.
When I delete those file in theme/docroot/images and I deploy the project, they are generated again.
I would like to know how to manage those files or rename them?
You can open those files: It's combined images - look up "CSS Sprite" for a thorough documentation. They're used to limit the number of requests that go back to the server. Without sprites, you'd have every theme image loaded individually. With them you only need the sprite once, resulting in a significant performance boost: You want to have as few http-requests per page as possible, and sprites are one automatically handled way to help you achieving this.

Having 1 million folder or have 1 million files in one folder?

I would like to know. Which one is better solution.
This is an I/O question. My web application store file in file system. But i want to know which solution should i approach. Store all files in one folder or store each file in split folder. The folder will be based on user id.
Example :
User A have User ID 1
User Z have user ID 10
So in my file structure, there have 2 folder which is folder name "1" and folder name "10".And each folder have maybe like 10- 50 images file.
Im thinking about performance. Which one is more better.? Let say the user id already reach 1 million. Is there any problem to have 1 million folder in windows server.?
Any idea,?
Thanks.
Try using a database...
Images can be stored in database using blob datatype

Scan remote directory and find sequential images with known constants

I have a quick and dirty project I need assistance with.
Outline: A remote server uses randomized file naming for storing JPEGs. All of the JPEGs are stored within the same directory. For example, "website.com/photos/". All of the images in that directory have a 10-digit (0-9) file name, with the suffix .jpg. The images are sequentially named (for example 12XXXXXXXX.jpg then much later the series becomes 13XXXXXXXX.jpg) but not every sequential number is used. Most are not. One image might be 1300055000.jpg but then the next image won't be until 130099000.jpg.
I am looking for a program to scan this directory and 'try' every file name possibility (0000000000.jpg - 9999999999.jpg) and then output a URL sheet (basic HTML) with links to the working JPEGs that are found.
All non-working JPEGs when tried return a 404 not found page. All working JPEGs return a sizable photo.
Your assistance is greatly appreciated! I'd be willing to compensate for the work. Thank you!

Best practice for white-labeling a static web site?

I have a directory structure with files under directory 'web', which represents a white-label web site (html files, images, js etc.).
I also have some twenty different 'brands' let's call them 'web-1', 'web-2' etc. each containing specific files that should override the files in 'web' for a specific brand.
Apache is configured to find the files for each virtual site i in document root 'website-i'.
In order for each 'website-i' to contain content like 'web' with the overrides for brand 'web-i', my build script first copies all of 'web' to 'website-i' and then overrides it with the source files from 'web-i'.
There several problems with this approach:
It takes time to copy the files.
It takes a lot of disk space
Adding a new brand requires adding to the script.
Is there a best practice for doing this in a way that does not require duplicating files?
(with Apache and Linux)
Well, the best solution is with pretty simple server-side code but I'm going to assume you've already rejected that, maybe because you haven't permission to run code on the server (although if your hacking the config then you probably do).
A solution just in config could be to make it serve from the default root but rewrite the url if the file exists in the proper dir ...
RewriteCond /web-1/a_file -f
RewriteRule ^(.*)$ /web/$1
But you'd have to do this for every file in every brand. You might be able to be more clever about it if the files are stored in a dir that's the same as the hostname but that's getting too complex for me.
Seriously, server-side code is the way to go on this...
Strangely enough, a work colleague was in pretty much exactly the same situation a couple of weeks ago and he eventually rewrote it as PHP. Each site is a row in a database, one page which pulls out the changed text and urls for images, etc. and falls back to a default if there's nothing there.
Having said all that. Using links, as you say above, solves problem 2, probably much of 1 and I don't think there's a way around 3 any way.

What are some good crawlers that can help download files

For one of my statistics project, I need to RANDOMLY download several files from a google patent page, and each file is a large zip file. The web link is the following:
http://www.google.com/googlebooks/uspto-patents-grants-text.html#2012
Specifically, I want to RANDOMLY select 5 years (The links on the top of the page) and download (i.e 5 files). DO you guys know if there's some good package out there that is good for this task?
Thank you.
That page contains mostly zip files and looking at the HTML content it seems that it should be fairly easy to determine which links will yield a zip file by simply searching for a *.zip in a collection of candidate URLs, so here is what I would recommend:
fetch the page
parse the HTML
extract the anchor tags
for each anchor tag
if href of anchor tag contaings "*.zip"
add href to list of file links
while more files needed
generate a random index i, such that i is between 0 and num links in list
select i-th element from the links list
fetch the zip file
save the file to disk or load it in memory
If you don't want to get the same file twice, then just remove the URL from the list of links and that randomly select another index (until you have enough files or until you run out of links). I don't know what programming language your team codes in, but it shouldn't be very difficult to write a small program that does the above.

Resources