Best solution for this signature generator? - signature

Hey! i want to know the best solution for my problem.
I have a signature generator http://www.anitard.org/siggen/siggen_stripes/ where people can upload their own images for the signature. The problem is that my storage will get full pretty fast if i dont somehow have a script that deletes the images when they are done with the signature.
What is the best solution for this?

My initial feeling on this would be to not save the uploaded files at all, but to just delete them as soon as the image is generated. However, some browsers might request the image again when the user tries to save the image -- I know this is true with Firefox's DownloadThemAll extension, for instance. So you'll probably have to store the files for a short amount of time, like #JustLoren suggests.
A quick Google search for "php delete temp files" turns up at least one script explaining exactly how to delete files after a certain amount of time. This would not have to be run as an external script or cron job; it could merely be tacked on to the upload script, for instance.
One flaw in the given script is that someone could rapidly upload many files in a row, exceeding your disk quota. You might want to expand on the linked script by deleting any files older than the last 50 or however many. To do that, just check the count of the matched files, sort by creation time, and delete any with index greater than 50.

Personally, I would have a script that runs every hour (or day, depending on volume) that checks the file's creation date and deletes it if the time is over an hour old. Realistically, users should save their images to their harddrives within 2 minutes of creating them, but you can't count on it. An hour seems like a nice compromise.

Related

What is the performance impact of the touch command or its equivalent system call?

In a custom-developed NodeJS web server (running on Linux) that can dynamically generate thumbnail images, I want to cache these thumbnails on the filesystem and keep track of when they are actually used. If they haven't been used for a certain period of time (say, one year), I'd consider them "orphans" and delete them.
To this end, I considered to touch them each time they're requested from a client, so that I can use the modification time to check when they were last used.
I assume this would incur a significant performance hit on the web server in high-load situations, as it is an "unnecessary" filesystem write, while, apart from logging, most requests will only consist of reads.
Has anyone performed any benchmarks on how big an impact this might have and if it's worthwhile?
It's probably not great, and probably worth avoiding updating every time you open a file. That's the reason the relatime / noatime mount options were invented, to prevent the existing Unix access-time timestamp from being updated every time a file was opened.
Is your filesystem mounted with relatime? That updates atime at most once per day, when the file is opened (even for reading). The other mount option that's common on Linux is noatime: never update atime.
If you can't let the kernel do this for you without needing extra system calls, you might be better off making an fstat system call after opening the file and only touching it to update the mod time if the mod time is older than a day or week. (You're concerned about intervals of a year, so a week is fine.) i.e. manually implement the relatime logic, but for mod time.
Frequently accessed files will not need updates (and you're still making a total of one system call for them, plus a date-compare). Rarely accessed files will need another system call and a metadata write. If most of the accesses in your access pattern are to a smallish set of files repeatedly, this should be excellent.
Possible reasons for not being able to use atime could include:
The filesystem is mounted with noatime and it's not worth changing that
The files sometimes get read by something other than your web server / CGI setup. (e.g. a backup job that does more than compare size / timestamps)
Of course, the other option is to not update timestamps on use, and simply let a thumbnail be regenerated once a year after your weekly cron job deleted it. That might be ok depending on your workload.
If you manually touch some of the "hottest" thumbnails so you stagger their deletion, instead of having a big load spike this time next year, you could be ok. And/or have your deleter walk your filesystem very slowly, again so you don't have a big batch of frequently-needed thumbnails deleted at once.
You could come up with schemes like enabling mod-time updates in the week before the bi-annual cleanup, so thumbnails that should stay hot in cache get their modtimes updated. But probably better to just fstat / check / update all the time since that shouldn't be too much extra load.

VBA: Coordinate batch jobs between several computers

I have a vba script that extract information from huge text files and does a lot of data manipulation and calculations on the extracted info. I have about 1000 files and each take an hour to finish.
I would like to run the script on as many computers (among others ec2 instances) as possible to reduce the time needed to finish the job. But how do I coordinate the work?
I have tried two ways: I set up a dropbox as a network drive with one txt file with the current last job number thart vba access, start the next job and update the number but there is apparently too much lag between an update on a file on one computer is updated throughout the rest to be practical. The second was to find a simple "private" counter service online that updated for each visit so han would access the page, read the number and the page would update the number for the next visit from another computer. But I have found no such service.
Any suggestions on how to coordinate such tasks between different computers in vba?
First of all if you can use a proper programming language, forexample c# for easy parallel processing.
If you must use vba than first optimize your code first. Can you show us the code?
Edit:
If you must than you could do the following. First you need some sort of fileserver to store all text files in a folder.
Then in the macro, foreach every .txt file in folder,
try to open the first in exclusive mode if the file can be opened, then ("your code" after your code is finished move the file elsewhere) else Next .txt file.

Node .fs Working with a HUGE Directory

Picture a directory with a ton of files. As a rough gauge of magnitude I think the most that we've seen so far is a couple of million but it could technically go another order higher. Using node, I would like to read files from this directory, process them (upload them, basically), and then move them out of the directory. Pretty simple. New files are constantly being added while the application is running, and my job (like a man on a sinking ship holding a bucket) is to empty this directory as fast as it's being filled.
So what are my options? fs.readdir is not ideal, it loads all of the filenames into memory which becomes a problem at this kind of scale. Especially as new files are being added all the time and so it would require repeated calls. (As an aside for anybody referring to this in the future, there is something being proposed to address this whole issue which may or may not have been realised within your timeline.)
I've looked at the myriad of fs drop-ins (graceful-fs, chokadir, readdirp, etc), none of which have this particular use-case within their remit.
I've also come across a couple of people suggesting that this can be handled with child_process, and there's a wrapper called inotifywait which tasks itself with exactly what I am asking but I really don't understand how this addresses the underlying problem, especially at this scale.
I'm wondering if what I really need to do is find a way to just get the first file (or, realistically, batch of files) from the directory without having the overhead of reading the entire directory structure into memory. Some sort of stream that could be terminated after a certain number of files had been read? I know Go has a parameter for reading the first n files from a directory but I can't find a node equivalent, has anybody here come across one or have any interesting ideas? Left-field solutions more than welcome at this point!
You can use your operation system listing file command, and stream the result into NodeJS.
For example in Linux:
var cp=require('child_process')
var stdout=cp.exec('ls').stdout
stdout.on('data',function(a){
console.log(a)
});0
RunKit: https://runkit.com/aminanadav/57da243180f3bb140059a31d

How do I limit log4j files based on size (only)?

I would like to configure log4j to write only files up to a maximum specified size, e.g. 5MB. When the log file hits 5MB I want it to start writing to a new file. I'd like to put some kind of meaningful timestamp into the logfile name to distinguish one file from the next.
I do not need it to rename or manipulate the old files in any way when a new one is written (compression would be a boon, but is not a must).
I absolutely do not want it to start deleting old files after a certain period of time or number of rolled files.
Timestamped chunks of N MB logfiles seems like the absolute basic minimum simple strategy I can imagine. It's so basic that I almost refuse to believe that it's not provided out of the box, yet I have been incapable of figuring out how to do this!
In particular, I have tried all incarnations of DailyRollingFileAppender, RollingFileAppender or the 'Extras' RollingFileAppender. All of these delete old files when the backupcount is exceeded. I don't want this, I want all my log files!
TimeBasedRollingPolicy from Extras does not fit because it doesn't have a max file size option and you cannot associate a secondary SizeBasedTriggeringPolicy (as it already implements TriggeringPolicy).
Is there an out of the box solution - preferably one that's in maven central?
I gave up trying to get this to work out of the box. It turns out that uk.org.simonsite.log4j.appender.TimeAndSizeRollingAppender is, as the young folk apparently say, the bomb.
I tried getting in touch some time back to get the author to stick this into maven central, but no luck so far.

Editing large data files

I'm about to start on a project wherein I can foresee there being large files (mostly flat text files, but could be CSV, fixed-width, XML, ...so far) that need to be edited. I need to develop the pieces to do this editing within the application.
In trying to determine a Good Way to handle editing large amounts of data (possibly into the GB range) without having to load the whole thing, I found that Audacity is able to handle large files quite well. Audacity is open source, so I thought it would make an excellent teaching tool for me in this circumstance. However, I started thinking myself in circles going through the code and now I'm thoroughly confused.
I'm hoping for two results from this question:
A good way to handle this editing without loading the whole file. I thought about loading the data as they edited it, caching it on demand.
An explanation of how Audacity does it.
I'm using C# and .NET, but the answers don't need to be coupled to that environment.
Several tricks can make editing simpler and faster.
INDEX it for faster access. While the user is doing nothing, skim through the file and create an index so you can quickly find a specific spot in the file (see below).
Only store changes the user makes. Don't try to apply them directly to the file until the user saves.
Set a limit on how much to read into memory when the user jumps to a point. Read one or two screens of data initially so you can display it, and then if the user doesn't jump to a new spot immediately, read a bit before and a bit after the current spot.
Indexing:
When the user wants to jump to line X or timestamp T, you don't want to skim the whole file counting line breaks and characters. Skim the data, and create a record. Say, every 50 lines, record the byte offset, character count, and line number. This data can be stored in a hashtable, tree, or just an ordered list. Then when user jumps within the file, you can find the nearest index spot and read from there until you find the requested point. This technique is especially useful when working with Unicode, where the number of bytes per character may vary. If files are so large a full index won't fit in memory, you may want to limit the index points and space them more widely, or store the index in a temporary file.
Editing and altering big files:
As suggested by Harvey -- only store changes in memory (as diffs), and then apply them to the file when saved by streaming from input to output. A tree or ordered list may be helpful, so you can quickly find the next place where you need to make a change when writing from input to output.
If changes are too large to fit in memory, you may want to track them in a separate temporary file (perhaps in the same folder as the original). You can just keep writing a continuous list of changes, with new ones appended to this change file. When you save, you'll read through the change list and create a final list of alterations to apply, before deleting the temp file. For performance reasons, it may be helpful to avoid rewriting the change log file; instead, just append to the end of it, and remove redundant or cancelling edits when performing a save.
Interesting fact: the same structures you use for the change log can be used to provide Undo/Redo information.
Sound files are basically a data stream, right? So you don't actually need to deal with the whole file at once. Audacity users may only work with a small snippet of that large file at any given moment.
Hypothetically, if you are adding a 1 second snippet of sound to a large sound file, you only actually have to deal with the entire file when you have to save, at which point you splice together 3 parts: Before, 1 second snippet, and after. So the only thing that needs to actually be in memory is the 1 second snippet, and maybe a small portion of the sound before and after the snippet.
So when you save, you read, say 64 megabytes of file at a time (if you are really aggressive), and stream that out to a temporary file, until you get to your insertion point. Then you stream out the 1 second snippet, stream the remainder of the original file, close the temporary write file, delete the original file, and rename the new file to the original file name.
Of course, it's a little more complicated than this. There might be multiple edits before save, for example, and an undo buffer. But I can pretty much guarantee you that Audacity is limited in unsaved edit complexity by the amount of available RAM.

Resources