On Linux, is access() faster than stat()? - linux

I would have assumed that access() was just a wrapper around stat(), but I've been googling around and have found some anecdotes about replacing stat calls with 'cheaper' access calls. Assuming you are only interested in checking if a file exists, is access faster? Does it completely vary by filesystem?

Theory
I doubt that.
In lower layers of kernel there is no much difference between access() and stat() calls both are performing lookup operation: they map file name to an entry in dentry cache and to inode (it is actual kernel structure, inode). Lookup is slow operation because you need to perform it for each part of path, i.e. for /usr/bin/cat you will need to lookup usr, bin and then cat and it can require reading from disk -- that is why inodes and dentries are cached in memory.
Major difference between that calls is that stat() performs conversion of inode structure to stat structure, while access() will do a simple check, but that time is small comparing with lookup time.
The real performance gain can be achieved with at-operations like faccessat() and fstatat(), which allow to open() directory once, just compare:
struct stat s;
stat("/usr/bin/cat", &s); // lookups usr, bin and cat = 3
stat("/usr/bin/less", &s); // lookups usr, bin and less = 3
int fd = open("/usr/bin"); // lookups usr, bin = 2
fstatat(fd, "cat", &s); // lookups cat = 1
fstatat(fd, "less", &s); // lookups less = 1
Experiments
I wrote a small python script which calls stat() and access():
import os, time, random
files = ['gzexe', 'catchsegv', 'gtroff', 'gencat', 'neqn', 'gzip',
'getent', 'sdiff', 'zcat', 'iconv', 'not_exists', 'ldd',
'unxz', 'zcmp', 'locale', 'xz', 'zdiff', 'localedef', 'xzcat']
access = lambda fn: os.access(fn, os.R_OK)
for i in xrange(1, 80000):
try:
random.choice((access, os.stat))("/usr/bin/" + random.choice(files))
except:
continue
I traced system with SystemTap to measure time spent in different operations. Both stat() and access() system calls use user_path_at_empty() kernel function which represents lookup operation:
stap -ve ' global tm, times, path;
probe lookup = kernel.function("user_path_at_empty")
{ name = "lookup"; pathname = user_string_quoted($name); }
probe lookup.return = kernel.function("user_path_at_empty").return
{ name = "lookup"; }
probe stat = syscall.stat
{ pathname = filename; }
probe stat, syscall.access, lookup
{ if(pid() == target() && isinstr(pathname, "/usr/bin")) {
tm[name] = local_clock_ns(); } }
probe syscall.stat.return, syscall.access.return, lookup.return
{ if(pid() == target() && tm[name]) {
times[name] <<< local_clock_ns() - tm[name];
delete tm[name];
} }
' -c 'python stat-access.py'
Here are the results:
COUNT AVG
lookup 80018 1.67 us
stat 40106 3.92 us
access 39903 4.27 us
Note that I disabled SELinux in my experiments, as it adds significant influence on the results.

Related

What is the default behavior of perf record?

It's clear to me that perf always records one or more events, and the sampling can be counter-based or time-based. But when the -e and -F switches are not given, what is the default behavior of perf record? The manpage for perf-record doesn't tell you what it does in this case.
The default event is cycles, as can be seen by running perf script after perf record. There, you can also see that the default sampling behavior is time-based, since the number of cycles is not constant. The default frequency is 4000 Hz, which can be seen in the source code and checked by comparing the file size or number of samples to a recording where -F 4000 was specified.
The perf wiki says that the rate is 1000 Hz, but this is not true anymore for kernels newer than 3.4.
Default event selection in perf record is done in user-space perf tool which is usually distributed as part of linux kernel. With make perf-src-tar-gz from linux kernel source dir we can make tar gz for quick rebuild or download such tar from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf. There are also several online "LXR" cross-reference viewers for linux kernel source which can be used just like grep to learn about perf internals.
There is the function to select default event list (evlist) for perf record: __perf_evlist__add_default of tools/perf/util/evlist.c file:
int __perf_evlist__add_default(struct evlist *evlist, bool precise)
{
struct evsel *evsel = perf_evsel__new_cycles(precise);
evlist__add(evlist, evsel);
return 0;
}
Called from perf record implementation in case of zero events parsed from options: tools/perf/builtin-record.c: int cmd_record()
rec->evlist->core.nr_entries == 0 &&
__perf_evlist__add_default(rec->evlist, !record.opts.no_samples)
And perf_evsel__new_cycles will ask for hardware event cycles (PERF_TYPE_HARDWARE + PERF_COUNT_HW_CPU_CYCLES) with optional kernel sampling, and max precise (check modifiers in man perf-list, it is EIP sampling skid workarounds using PEBS or IBS):
struct evsel *perf_evsel__new_cycles(bool precise)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.exclude_kernel = !perf_event_can_profile_kernel(),
};
struct evsel *evsel;
/*
* Now let the usual logic to set up the perf_event_attr defaults
* to kick in when we return and before perf_evsel__open() is called.
*/
evsel = evsel__new(&attr);
evsel->precise_max = true;
/* use asprintf() because free(evsel) assumes name is allocated */
if (asprintf(&evsel->name, "cycles%s%s%.*s",
(attr.precise_ip || attr.exclude_kernel) ? ":" : "",
attr.exclude_kernel ? "u" : "",
attr.precise_ip ? attr.precise_ip + 1 : 0, "ppp") < 0)
return evsel;
}
In case of failed perf_event_open (no access to hardware cycles sampling, for example in virtualized environment without virtualized PMU) there is failback to software cpu-clock sampling in tools/perf/builtin-record.c: int record__open() which calls perf_evsel__fallback() of tools/perf/util/evsel.c:
bool perf_evsel__fallback(struct evsel *evsel, int err,
char *msg, size_t msgsize)
{
if ((err == ENOENT || err == ENXIO || err == ENODEV) &&
evsel->core.attr.type == PERF_TYPE_HARDWARE &&
evsel->core.attr.config == PERF_COUNT_HW_CPU_CYCLES) {
/*
* If it's cycles then fall back to hrtimer based
* cpu-clock-tick sw counter, which is always available even if
* no PMU support.
*/
scnprintf(msg, msgsize, "%s", "The cycles event is not supported, trying to fall back to cpu-clock-ticks");
evsel->core.attr.type = PERF_TYPE_SOFTWARE;
evsel->core.attr.config = PERF_COUNT_SW_CPU_CLOCK;
return true;
} ...
}

Linux OS: /proc/[pid]/smaps vs /proc/[pid]/statm

I would like calculate the memory usage for single process. So after a little bit of research I came across over smaps and statm.
First of all what is smaps and statm? What is the difference?
statm has a field RSS and in smaps I sum up all RSS values. But those values are different for the same process. I know that statm measures in pages. For comparison purposes I converted that value in kb as in smaps. But those values are not equal.
Why do these two values differ, even though they represent the rss value for the same process?
statm
232214 80703 7168 27 0 161967 0 (measured in pages, pages size is 4096)
smaps
Rss 1956
My aim is to calculate the memory usage for a single process. I am interested in two values. USS and PSS.
Can I gain those two values by just using smaps? Is that value correct?
Also, I would like to return that value as percentage.
I think statm is an approximated simplification of smaps, which is more expensive to get. I came to this conclusion after I looked at the source:
smaps
The information you see in smaps is defined in /fs/proc/task_mmu.c:
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
(...)
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
.mm = vma->vm_mm,
.private = &mss,
};
memset(&mss, 0, sizeof mss);
walk_page_vma(vma, &smaps_walk);
show_map_vma(m, vma, is_pid);
seq_printf(m,
(...)
"Rss: %8lu kB\n"
(...)
mss.resident >> 10,
The information in mss is used by walk_page_vma defined in /mm/pagewalk.c. However, the mss member resident is not filled in walk_page_vma - instead, walk_page_vma calls callback specified in smaps_walk:
.pmd_entry = smaps_pte_range,
.private = &mss,
like this:
if (walk->pmd_entry)
err = walk->pmd_entry(pmd, addr, next, walk);
So what does our callback, smaps_pte_range in /fs/proc/task_mmu.c, do?
It calls smaps_pte_entry and smaps_pmd_entry in some circumstances, out of which both call statm_account(), which in turn... upgrades resident size! All of these functions are defined in the already linked task_mmu.c so I didn't post relevant code snippets as they can be easily seen in the linked sources.
PTE stands for Page Table Entry and PMD is Page Middle Directory. So basically we iterate through the page entries associated with given process and update RAM usage depending on the circumstances.
statm
The information you see in statm is defined in /fs/proc/array.c:
int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
struct mm_struct *mm = get_task_mm(task);
if (mm) {
size = task_statm(mm, &shared, &text, &data, &resident);
mmput(mm);
}
seq_put_decimal_ull(m, 0, size);
seq_put_decimal_ull(m, ' ', resident);
seq_put_decimal_ull(m, ' ', shared);
seq_put_decimal_ull(m, ' ', text);
seq_put_decimal_ull(m, ' ', 0);
seq_put_decimal_ull(m, ' ', data);
seq_put_decimal_ull(m, ' ', 0);
seq_putc(m, '\n');
return 0;
}
This time, resident is filled by task_statm. This one has two implementations, one in /fs/proc/task_mmu.c and second in /fs/proc/task_nomm.c. Since they're almost surely mutually exclusive, I'll focus on the implementation in task_mmu.c (which also contained task_smaps). In this implementation we see that
unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
*shared = get_mm_counter(mm, MM_FILEPAGES);
(...)
*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
it queries some counters, namely, MM_FILEPAGES and MM_ANONPAGES. These counters are modified during different operations on memory such as do_wp_page defined at /mm/memory.c. All of the modifications seem to be done by the files located in /mm/ and there seem to be quite a lot of them, so I didn't include them here.
Conclusion
smaps does complicated iteration through all referenced memory regions and updates resident size using the collected information. statm uses data that was already calculated by someone else.
The most important part is that while smaps collects the data each time in an independent manner, statm uses counters that get incremented or decremented during process life cycle. There are a lot of places that need to do the bookkeeping, and perhaps some places don't upgrade the counters like they should. That's why IMO statm is inferior to smaps, even if it takes fewer CPU cycles to complete.
Please note that this is the conclusion I drew based on common sense, but I might be wrong - perhaps there are no internal inconsistencies in counter decrementing and incrementing, and instead, they might count some pages differently than smaps. At this point I believe it'd be wise to take it to some experienced kernel maintainers.

fuse: Setting offsets for the filler function in readdir

I am implementing a virtual filesystem using the fuse, and need some understanding regarding the offset parameter in readdir.
Earlier we were ignoring the offset and passing 0 in the filler function, in which case the kernel should take care.
Our filesystem database, is storing: directory name, filelength, inode number and parent inode number.
How do i calculate get the offset?
Then is the offset of each components, equal to their size sorted in incremental form of their inode number? What happens is there is a directory inside a directory, is the offset in that case equal to the sum of the files inside?
Example: in case the dir listing is - a.txt b.txt c.txt
And inode number of a.txt=3, b.txt=5, c.txt=7
Offset of a.txt= directory offset
Offset of b.txt=dir offset + size of a.txt
Offset of c.txt=dir offset + size of b.txt
Is the above assumption correct?
P.S: Here are the callbacks of fuse
The selected answer is not correct
Despite the lack of upvotes on this answer, this is the correct answer. Cracking into the format of the void buffer should be discouraged, and that's the intent behind declaring such things void in C code - you shouldn't write code that assumes knowledge of the format of the data behind void pointers, use whatever API is provided properly instead.
The code below is very simple and straightforward, as it should be. No knowledge of the format of the Fuse buffer is required.
Fictitious API
This is a contrived example of what some device's API could look
like. This is not part of Fuse.
// get_some_file_names() -
// returns a struct with buffers holding the names of files.
// PARAMETERS
// * path - A path of some sort that the fictitious device groks.
// * offset - Where in the list of file names to start.
// RETURNS
// * A name_list, it has some char buffers holding the file names
// and a couple other auxiliary vars.
//
name_list *get_some_file_names(char *path, size_t offset);
Listing the files in parts
Here's a Fuse callback that can be registered with the Fuse system to
list the filenames provided by get_some_file_names(). It's arbitrarily named readdir_callback() so its purpose is obvious.
int readdir_callback( char *path,
void *buf, // This is meant to be "opaque".
fuse_fill_dir_t *filler, // filler takes care of buf.
off_t off, // Last value given to filler.
struct fuse_file_info *fi )
{
// Call the fictitious API to get a list of file names.
name_list *list = get_some_file_names(path, off);
for (int i = 0; i < list->length; i++)
{
// Feed the file names to filler() one at a time.
if (filler(buf, list->names[i], NULL, off + i + 1))
{
break; // filler() returned 1, requesting a break.
}
incr_num_files_listed(list);
}
if (all_files_listed(list))
{
return 1; // Tell Fuse we're done.
}
return 0;
}
The off (offset) value is not used by the filler function to fill its opaque buffer, buf. The off value is, however, meaningful to the callback as an offset base as it provides file names to filler(). Whatever value was last passed to filler() is what gets passed back to readdir_callback() on its next invocation. filler()
itself only cares whether the off value is 0 or not-0.
Indicating "I'm done listing!" to Fuse
To signal to the Fuse system that your readdir_callback() is done listing file names in parts (when the last of the list of names has been given to filler()), simply return 1 from it.
How off Is Used
The off, offset, parameter should be non-0 to perform the partial listings. That's its only requirement as far as filler() is concerned. If off is 0, that indicates to Fuse that you're going to do a full listing in one shot (see below).
Although filler() doesn't care what the off value is beyond it being non-0, the value can still be meaningfully used. The code above is using the index of the next item in its own file list as its value. Fuse will keep passing the last off value it received back to the read dir callback on each invocation until the listing is complete (when readdir_callback() returns 1).
Listing the files all at once
int readdir_callback( char *path,
void *buf,
fuse_fill_dir_t *filler,
off_t off,
struct fuse_file_info *fi )
{
name_list *list = get_all_file_names(path);
for (int i = 0; i < list->length; i++)
{
filler(buf, list->names[i], NULL, 0);
}
return 0;
}
Listing all the files in one shot, as above, is simpler - but not by much. Note that off is 0 for the full listing. One may wonder, 'why even bother with the first approach of reading the folder contents in parts?'
The in-parts strategy is useful where a set number of buffers for file names is allocated, and the number of files within folders may exceed this number. For instance, the implementation of name_list above may only have 8 allocated buffers (char names[8][256]). Also, buf may fill up and filler() start returning 1 if too many names are given at once. The first approach avoids this.
The offset passed to the filler function is the offset of the next item in the directory. You can have the entries in the directory in any order you want. If you don't want to return an entire directory at once, you need to use the offset to determine what gets asked for and stored. The order of items in the directory is up to you, and doesn't matter what order the names or inodes or anything else is.
Specifically, in the readdir call, you are passed an offset. You want to start calling the filler function with entries that will be at this callback or later. In the simplest case, the length of each entry is 24 bytes + strlen(name of entry), rounded up to the nearest multiple of 8 bytes. However, see the fuse source code at http://sourceforge.net/projects/fuse/ for when this might not be the case.
I have a simple example, where I have a loop (pseudo c-code) in my readdir function:
int my_readdir(const char *path, void *buf, fuse_fill_dir_t filler, off_t offset, struct fuse_file_info *fi)
{
(a bunch of prep work has been omitted)
struct stat st;
int off, nextoff=0, lenentry, i;
char namebuf[(long enough for any one name)];
for (i=0; i<NumDirectoryEntries; i++)
{
(fill st with the stat information, including inode, etc.)
(fill namebuf with the name of the directory entry)
lenentry = ((24+strlen(namebuf)+7)&~7);
off = nextoff; /* offset of this entry */
nextoff += lenentry;
/* Skip this entry if we weren't asked for it */
if (off<offset)
continue;
/* Add this to our response until we are asked to stop */
if (filler(buf, namebuf, &st, nextoff))
break;
}
/* All done because we were asked to stop or because we finished */
return 0;
}
I tested this within my own code (I had never used the offset before), and it works fine.

How to chown filesystem element by device:inode pair in perl, or a better solution

I have a relatively complex perl script which is walking over a filesystem and storing a list of updated ownership, then going over that list and applying the changes. I'm doing this in order to update changed UIDs. Because I have several situations where I'm swapping user a's and user b's UIDs, I can't just say "everything which is now 1 should be 2 and everything which is 2 should be 1", as it's also possible that this script could be interrupted, and the system would be left in a completely busted, pretty much unrecoverable state outside of "restore from backup and start over". Which would pretty much suck.
To avoid that problem, I do the two-pas approach above, creating a structure like $changes->{path}->\%c, where c has attributes line newuid, olduid, newgid, and olduid. I then freeze the hash, and once it's written to disk, I read the hash back in and start making changes. This way, if I'm interrupted, I can check to see if the frozen hash exists or not, and just start applying changes again if it does.
The drawback is that sometimes a changing user has literally millions of files, often with very long paths. This means I'm storing a lot of really long strings as hash keys, and I'm running out of memory sometimes. So, I've come up with two options. The one relevant to this question is to instead store the elements as device:inode pairs. That'd be way more space-efficient, and would uniquely identify filesystem elements. The drawback is that I haven't figured out a particularly efficient way to either get a device-relative path from the inode, or to just apply the stat() changes I want to the inode. Yes, I could do another find, and for each file do a lookup against my stored list of devices and inodes to see if a change is needed or not. But if there's a perl-accessible system call - which is portable across HP-UX, AIX, and Linux - from which I can directly just say "on this device make these changes to this inode", it'd be notably better from a performance perspective.
I'm running this across several thousand systems, some of which have filesystems in the petabyte range, having trillions of files. So, while performance may not make much of a differece on my home PC, it's actually somewhat significant in this scenario. :) That performance need, BTW, is why I really don't want to do the other option - which would be to bypass the memory problem by just tie-ing a hash to a disk-based file. And is why I'd rather do more work to avoid having to traverse the whole filesystem a second time.
Alternate suggestions which could reduce memory consumption are, of course, also welcome. :) My requirement is just that I need to record both the old and new UID/GID values, so I can back the changes out / validate changes / update files restored from backups taken prior to the cleanup date. I've considered making /path/to/file look like ${changes}->{root}->{path}->{to}->{file}, but that's a lot more work to traverse, and I dont know that it'll really save me enough memory space to resolve my problem. Collapsing the whole thing to ->{device}->{inode} makes it basically just the size of two integers rather than N characters, which is substatial for any path longer than, say, 2 chars. :)
Simplified idea
When I mentioned streaming, I didn't mean uncontrolled. A database journal (e.g.) is also written in streaming mode, for comparison.
Also note, that the statement that you 'cannot afford to sort even a single subdirectory' directly contradicts the use of a Perl hash to store the same info (I won't blame you if you don't have the CS background).
So here is a really simple illustration of what you could do. Note that every step on the way is streaming, repeatable and logged.
# export SOME_FIND_OPTIONS=...?
find $SOME_FIND_OPTIONS -print0 | ./generate_script.pl > chownscript.sh
# and then
sh -e ./chownscript.sh
An example of generate_script.pl (obviously, adapt it to your needs:)
#!/usr/bin/perl
use strict;
use warnings;
$/="\0";
while (<>)
{
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = stat;
# demo purpose, silly translation:
my ($newuid, $newgid) = ($uid+1000, $gid+1000);
print "./chmod.pl $uid:$gid $newuid:$newgid '$_'\n"
}
You could have a system dependent implementation of chmod.pl (this helps to reduce complexity and therefore: risk):
#!/usr/bin/perl
use strict;
use warnings;
my $oldown = shift;
my $newown = shift;
my $path = shift;
($oldown and $newown and $path) or die "usage: $0 <uid:gid> <newuid:newgid> <path>";
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = stat $path;
die "file not found: $path" unless $ino;
die "precondition failed" unless ($oldown eq "$uid:$gid");
($uid, $gid) = split /:/, $newown;
chown $uid, $gid, $path or die "unable to chown: $path"
This will allow you to restart when things bork midway, it will even allow you to hand-pick exceptions if necessary. You can save the scripts so you'll have accountability. I've done a reasonable stab at making the scripts operate safely. However, this is obviously just a starting point. Most importantly, I do not deal with filesystem crossings, symbolic links, sockets, device nodes where you might want to pay attention to them.
original response follows:
Ideas
Yeah, if performance is the issue, do it in C
Do not do persistent logging for the whole filesystem (by the way, why the need to keep them in a single hash? streaming output is your friend there)
Instead, log completed runs per directory. You could easily break the mapping up in steps:
user A: 1 -> 99
user B: 2 -> 1
user A: 99 -> 2
Ownify - what I use (code)
As long as you can reserve a range for temporary uids/guids like the 99 there won't be any risk on having to restart (not any more than doing this transnumeration on a live filesystem, anyway).
You could start from this nice tidbit of C code (which admittedly is not very highly optmized):
// vim: se ts=4 sw=4 et ar aw
//
// make: g++ -D_FILE_OFFSET_BITS=64 ownify.cpp -o ownify
//
// Ownify: ownify -h
//
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
/* old habits die hard. can't stick to pure C ... */
#include <string>
#include <iostream>
#define do_stat(a,b) lstat(a,b)
#define do_chown(a,b,c) lchown(a,b,c)
//////////////////////////////////////////////////////////
// logic declarations
//
void ownify(struct stat& file)
{
// if (S_ISLNK(file.st_mode))
// return;
switch (file.st_uid)
{
#if defined(PASS1)
case 1: file.st_uid = 99; break;
case 99: fputs(err, "Unexpected existing owned file!"); exit(255);
#elif defined(PASS2)
case 2: file.st_uid = 1; break;
#elif defined(PASS3)
case 99: file.st_uid = 1; break;
#endif
}
switch (file.st_gid) // optionally map groups as well
{
#if defined(PASS1)
#elif defined(PASS2)
#elif defined(PASS3)
#endif
}
}
/////////////////////////////////////////////////////////
// driver
//
static unsigned int changed = 0, skipped = 0, failed = 0;
static bool dryrun = false;
void process(const char* const fname)
{
struct stat s;
if (0==do_stat(fname, &s))
{
struct stat n = s;
ownify(n);
if ((n.st_uid!=s.st_uid) || (n.st_gid!=s.st_gid))
{
if (dryrun || 0==do_chown(fname, n.st_uid, n.st_gid))
printf("%u\tchanging owner %i:%i '%s'\t(was %i:%i)\n",
++changed,
n.st_uid, n.st_gid,
fname,
s.st_uid, s.st_gid);
else
{
failed++;
int e = errno;
fprintf(stderr, "'%s': cannot change owner %i:%i (%s)\n",
fname,
n.st_uid, n.st_gid,
strerror(e));
}
}
else
skipped++;
} else
{
int e = errno;
fprintf(stderr, "'%s': cannot stat (%s)\n", fname, strerror(e));
failed++;
}
}
int main(int argc, char* argv[])
{
switch(argc)
{
case 0: //huh?
case 1: break;
case 2:
dryrun = 0==strcmp(argv[1],"-n") ||
0==strcmp(argv[1],"--dry-run");
if (dryrun)
break;
default:
std::cerr << "Illegal arguments" << std::endl;
std::cout <<
argv[0] << " (Ownify): efficient bulk adjust of owner user:group for many files\n\n"
"Goal: be flexible and a tiny bit fast\n\n"
"Synopsis:\n"
" find / -print0 | ./ownify -n 2>&1 | tee ownify.log\n\n"
"Input:\n"
" reads a null-delimited stream of filespecifications from the\n"
" standard input; links are _not_ dereferenced.\n\n"
"Options:\n"
" -n/--dry-run - test run (no changes)\n\n"
"Exit code:\n"
" number of failed items" << std::endl;
return 255;
}
std::string fname("/dev/null");
while (std::getline(std::cin, fname, '\0'))
process(fname.c_str());
fprintf(stderr, "%s: completed with %u skipped, %u changed and %u failed%s\n",
argv[0], skipped, changed, failed, dryrun?" (DRYRUN)":"");
return failed;
}
Note that this comes with quite a few safety measures
paranoia check in first pass (check no fiels with reserved uid exists)
ability to change behaviour of do_stat and do_chown with regards to links
a --dry-run option (to observe what would be done) -n
The program will gladly tell you how to use it with ownify -h:
./ownify (Ownify): efficient bulk adjust of owner user:group for many files
Goal: be flexible and a tiny bit fast
Synopsis:
find / -print0 | ./ownify -n 2>&1 | tee ownify.log
Input:
reads a null-delimited stream of file specifications from the
standard input;
Options:
-n/--dry-run - test run (no changes)
Exit code:
number of failed items
A few possible solutions that come to mind:
1) Do not store a hash in the file, just a sorted list in any format that can be reasonably parsed serially. By sorting the list by filename, you should get the equivalent of running find again, without actually doing it:
# UID, GID, MODE, Filename
0,0,600,/a/b/c/d/e
1,1,777,/a/b/c/f/g
...
Since the list is sorted by filename, the contents of each directory should be bunched together in the file. You do not have to use Perl to sort the file, sort will do nicely in most cases.
You can then just read in the file line-by-line - or with any delimiter that will not mangle your filenames - and just perform any changes. Assuming that you can tell which changes are needed for each file at once, it does not sound as if you actually need the random-access capabilities of a hash, so this should do.
So the process would happen in three steps:
Create the change file
Sort the change file
Perform changes per the change file
2) If you cannot tell which changes each file needs at once, you could have multiple lines for each file, each detailing a part of the changes. Each line would be produced the moment you determine a needed change at the first step. You can then merge them after sorting.
3) If you do need random access capabilities, consider using a proper embedded database, such as BerkeleyDB or SQLite. There are Perl modules for most embedded databases around. This will not be quite as fast, though.

Tools to reduce risk regarding password security and HDD slack space

Down at the bottom of this essay is a comment about a spooky way to beat passwords. Scan the entire HDD of a user including dead space, swap space etc, and just try everything that looks like it might be a password.
The question: part 1, are there any tools around (A live CD for instance) that will scan an unmounted file system and zero everything that can be? (Note I'm not trying to find passwords)
This would include:
Slack space that is not part of any file
Unused parts of the last block used by a file
Swap space
Hibernation files
Dead space inside of some types of binary files (like .DOC)
The tool (aside from the last case) would not modify anything that can be detected via the file system API. I'm not looking for a block device find/replace but rather something that just scrubs everything that isn't part of a file.
part 2, How practical would such a program be? How hard would it be to write? How common is it for file formats to contain uninitialized data?
One (risky and costly) way to do this would be to use a file system aware backup tool (one that only copies the actual data) to back up the whole disk, wipe it clean and then restore it.
I don't understand your first question (do you want to modify the file system? Why? Isn't this dead space exactly where you want to look?)
Anyway, here's an example of such a tool:
#include <stdio.h>
#include <alloca.h>
#include <string.h>
#include <ctype.h>
/* Number of bytes we read at once, >2*maxlen */
#define BUFSIZE (1024*1024)
/* Replace this with a function that tests the passwort consisting of the first len bytes of pw */
int testPassword(const char* pw, int len) {
/*char* buf = alloca(len+1);
memcpy(buf, pw,len);
buf[len] = '\0';
printf("Testing %s\n", buf);*/
int rightLen = strlen("secret");
return len == rightLen && memcmp(pw, "secret", len) == 0;
}
int main(int argc, char* argv[]) {
int minlen = 5; /* We know the password is at least 5 characters long */
int maxlen = 7; /* ... and at most 7. Modify to find longer ones */
int avlen = 0; /* available length - The number of bytes we already tested and think could belong to a password */
int i;
char* curstart;
char* curp;
FILE* f;
size_t bytes_read;
char* buf = alloca(BUFSIZE+maxlen);
if (argc != 2) {
printf ("Usage: %s disk-file\n", argv[0]);
return 1;
}
f = fopen(argv[1], "rb");
if (f == NULL) {
printf("Couldn't open %s\n", argv[1]);
return 2;
}
for(;;) {
/* Copy the rest of the buffer to the front */
memcpy(buf, buf+BUFSIZE, maxlen);
bytes_read = fread(buf+maxlen, 1, BUFSIZE, f);
if (bytes_read == 0) {
/* Read the whole file */
break;
}
for (curstart = buf;curstart < buf+bytes_read;) {
for (curp = curstart+avlen;curp < curstart + maxlen;curp++) {
/* Let's assume the password just contains letters and digits. Use isprint() otherwise. */
if (!isalnum(*curp)) {
curstart = curp + 1;
break;
}
}
avlen = curp - curstart;
if (avlen < minlen) {
/* Nothing to test here, move along */
curstart = curp+1;
avlen = 0;
continue;
}
for (i = minlen;i <= avlen;i++) {
if (testPassword(curstart, i)) {
char* found = alloca(i+1);
memcpy(found, curstart, i);
found[i] = '\0';
printf("Found password: %s\n", found);
}
}
avlen--;
curstart++;
}
}
fclose(f);
return 0;
}
Installation:
Start a Linux Live CD
Copy the program to the file hddpass.c in your home directory
Open a terminal and type the following
su || sudo -s # Makes you root so that you can access the HDD
apt-get install -y gcc # Install gcc
This works only on Debian/Ubuntu et al, check your system documentation for others
gcc -o hddpass hddpass.c # Compile.
./hddpass /dev/YOURDISK # The disk is usually sda, hda on older systems
Look at the output
Test (copy to console, as root):
gcc -o hddpass hddpass.c
</dev/zero head -c 10000000 >testdisk # Create an empty 10MB file
mkfs.ext2 -F testdisk # Create a file system
rm -rf mountpoint; mkdir -p mountpoint
mount -o loop testdisk mountpoint # needs root rights
</dev/urandom head -c 5000000 >mountpoint/f # Write stuff to the disk
echo asddsasecretads >> mountpoint/f # Write password in our pagefile
# On some file systems, you could even remove the file.
umount testdisk
./hdpass testdisk # prints secret
Test it yourself on an Ubuntu Live CD:
# Start a console and type:
wget http://phihag.de/2009/so/hddpass-testscript.sh
sh hddpass-testscript.sh
Therefore, it's relatively easy. As I found out myself, ext2 (the file system I used) overwrites deleted files. However, I'm pretty sure some file systems don't. Same goes for the pagefile.
How common is it for file formats to contain uninitialized data?
Less and less common, I would've thought. The classic "offender" is older versions of MS office applications that (essentially) did a memory dump to disk as its "quicksave" format. No serialisation, no selection of what to dump and a memory allocator that doesn't zero newly allocated memory pages. That lead to not only juicy things from previous versions of the document (so the user could use undo), but also juicy snippets from other applications.
How hard would it be to write?
Something that clears out unallocated disk blocks shouldn't be that hard. It'd need to run either off-line or as a kernel module, so as to not interfer with normal file-system operations, but most file systems have an "allocated"/"not allocated" structure that is fairly straight-forward to parse. Swap is harder, but as long as you're OK with having it cleared on boot (or shutdown), it's not too tricky. Clearing out the tail block is trickier, definitely not something I'd want to try to do on-line, but it shouldn't be TOO hard to make it work for off-line cleaning.
How practical would such a program be?
Depends on your threat model, really. I'd say that on one end, it'd not give you much at all, but on the other end, it's a definite help to keep information out of the wrong hands. But I can't give a hard and fast answer,
Well, if I was going to code it for a boot CD, I'd do something like this:
File is 101 bytes but takes up a 4096-byte cluster.
Copy the file "A" to "B" which has nulls added to the end.
Delete "A" and overwrite it's (now unused) cluster.
Create "A" again and use the contents of "B" without the tail (remember the length).
Delete "B" and overwrite it.
Not very efficient, and would need a tweak to make sure you don't try to copy the first (and therefor full) clusters in a file. Otherwise, you'll run into slowness and failure if there's not enough free space.
There's tools that do this efficiently that are open source?

Resources