I have a problem with a piece of Linux code (written in C++) that does something like this:
Create a new directory with mkdir("xyz",0755) which succeeds (return code is 0).
Tries to open/create a new file in the justly created directory.
Fails because the permissions on the new directory is actually 0600 and NOT 0755 as requested.
The code looks like this and checks that the path prefix "/tmp/slim" exists before coming to this part:
int somefunc(const string& path)
{
if ( mkdir( path.c_str(), 0755 ) == 0 ) {
// (*) if (!access( path.c_str(), F_OK | R_OK | W_OK | X_OK ) == 0 )
// (*) chmod( path.c_str(), 0755);
string pidinfo = to_string( getpid() ) + "\n";
string pidinfofile = path + "/pid";
int fd = open( pidinfofile.c_str(), O_RDWR | O_CREAT, S_IWUSR | S_IRUSR );
if ( fd == -1 )
return 0;
ssize_t written = write( fd, text.c_str(), text.size() );
// ... do more stuff
}
}
As this snippet of a strace (no lines missing/redacted) shows, the openat() fails even though the mkdir() returned 0.
13661 16:32:22.068465 mkdir("/tmp/slim/testsock", 0755) = 0
13661 16:32:22.068720 getpid() = 13661
13661 16:32:22.068829 openat(AT_FDCWD, "/tmp/slim/testsock/pid", O_RDWR|O_CREAT, 0600) = -1 EACCES (Permission denied)
The result of running getfacl looks like this:
[localhost]$ getfacl /tmp/slim/
getfacl: Removing leading '/' from absolute path names
# file: tmp/slim/
# owner: stk
# group: stk
user::rwx
group::r-x
other::r-x
How can the the mkdir() return 0 but create a directory with permissions that differ from the specified? It's not a umask thing, I've tried setting umask to 0 before creating the directory without any effect. If the two commented lines marked with (*) is enabled/uncommented, things work as they should - but I don't like that sort of symptom treatment that skirts the real problem. There's got to be a reasonable explanation for this seemingly weird behavior.
Part of the story is that this works in an application with multiple threads. Each thread performs the code above (which is a smallish, thread safe function), and most threads succeed, but there's always 1 or 2 (out of 5-10) that fails as described.
Well, as it turned out (and quite expectedly, right?) it was a threading problem alright, as #RobertHarvey also hinted at. But I was also a bit right :-) when I wrote in a comment "there is some hidden, shared state somewhere". Well, the process umask is a shared, but perhaps not exactly hidden, state/variable. Here's what went wrong:
One or more threads was executing the code above, happily creating directories.
Simultaneously another thread was fiddling with the umask to ensure the right permission of a Unix socket it was creating, temporarily setting umask at 0177.
Although the umask fiddling was ever so short, Murphy's Law dictated that the directories sometimes were created exactly while the umask was 0177, also forcing the directories to get permission mask 0600 rather than 0755.
Lesson (re-)learned: Watch out for hidden shared state/variables when using more than 1 thread.
Related
I would have assumed that access() was just a wrapper around stat(), but I've been googling around and have found some anecdotes about replacing stat calls with 'cheaper' access calls. Assuming you are only interested in checking if a file exists, is access faster? Does it completely vary by filesystem?
Theory
I doubt that.
In lower layers of kernel there is no much difference between access() and stat() calls both are performing lookup operation: they map file name to an entry in dentry cache and to inode (it is actual kernel structure, inode). Lookup is slow operation because you need to perform it for each part of path, i.e. for /usr/bin/cat you will need to lookup usr, bin and then cat and it can require reading from disk -- that is why inodes and dentries are cached in memory.
Major difference between that calls is that stat() performs conversion of inode structure to stat structure, while access() will do a simple check, but that time is small comparing with lookup time.
The real performance gain can be achieved with at-operations like faccessat() and fstatat(), which allow to open() directory once, just compare:
struct stat s;
stat("/usr/bin/cat", &s); // lookups usr, bin and cat = 3
stat("/usr/bin/less", &s); // lookups usr, bin and less = 3
int fd = open("/usr/bin"); // lookups usr, bin = 2
fstatat(fd, "cat", &s); // lookups cat = 1
fstatat(fd, "less", &s); // lookups less = 1
Experiments
I wrote a small python script which calls stat() and access():
import os, time, random
files = ['gzexe', 'catchsegv', 'gtroff', 'gencat', 'neqn', 'gzip',
'getent', 'sdiff', 'zcat', 'iconv', 'not_exists', 'ldd',
'unxz', 'zcmp', 'locale', 'xz', 'zdiff', 'localedef', 'xzcat']
access = lambda fn: os.access(fn, os.R_OK)
for i in xrange(1, 80000):
try:
random.choice((access, os.stat))("/usr/bin/" + random.choice(files))
except:
continue
I traced system with SystemTap to measure time spent in different operations. Both stat() and access() system calls use user_path_at_empty() kernel function which represents lookup operation:
stap -ve ' global tm, times, path;
probe lookup = kernel.function("user_path_at_empty")
{ name = "lookup"; pathname = user_string_quoted($name); }
probe lookup.return = kernel.function("user_path_at_empty").return
{ name = "lookup"; }
probe stat = syscall.stat
{ pathname = filename; }
probe stat, syscall.access, lookup
{ if(pid() == target() && isinstr(pathname, "/usr/bin")) {
tm[name] = local_clock_ns(); } }
probe syscall.stat.return, syscall.access.return, lookup.return
{ if(pid() == target() && tm[name]) {
times[name] <<< local_clock_ns() - tm[name];
delete tm[name];
} }
' -c 'python stat-access.py'
Here are the results:
COUNT AVG
lookup 80018 1.67 us
stat 40106 3.92 us
access 39903 4.27 us
Note that I disabled SELinux in my experiments, as it adds significant influence on the results.
Program A (ReportHandler) calls program B (Specific Report). In order for me to get my "specific report" I need to go through program A, which then calls program B and gets me my report. My problem here is that Program B has a "security" measure that checks for program B to be a child process of program A. (This is because program A makes sure no-one else is running this program B, makes sure it gets run between x and y hour of the day, or other programs that may interfere with the running of program B, etc.)
Program A & B are C based, but I cannot (must not) change them. I checked the code and I cannot pass parameters to program A to run B from console.. SOOO, the only idea I have left, is to try and "trick" the system so that program B shows up as a child of program A so that I can run it from console.
The reason for me to try and automate this, is that I need to dial into a dozen servers each day to get this report... I want to centralize this script so that I can remotely ssh this script to each server and be done with it. would save me an hour of my day. or more.
Check being made
if ( TRUE != child_of_Program_A() )
{
epause( win[MAIN], 1,
_("This Program Must Be Run From Program A"));
return( FAILURE );
}
STATIC BOOL child_of_Program_A()
{
FILE *fp;
char statname[32];
pid_t ppid;
char proc_name[32];
char buffer[128];
char *ptr;
ppid = getppid();
while(ppid != 1)
{
snprintf(statname, sizeof(statname), "/proc/%d/status", ppid);
if (NULL == (fp = fopen(statname, "r")))
{
return(FALSE);
}
proc_name[0] = '\0';
ppid = -1;
while (NULL != fgets(buffer, sizeof(buffer), fp))
{
if (NULL != (ptr = strtok(buffer, STAT_SEP)))
{
if (strcasecmp(ptr, "name") == 0)
{
if (NULL != (ptr = strtok(NULL, STAT_SEP)))
{
if (strcmp(ptr, "Program_A") == 0)
{
fclose(fp);
return(TRUE);
}
strncpy(proc_name, ptr, sizeof(proc_name));
}
}
else if (strcasecmp(ptr, "ppid") == 0)
{
if (NULL != (ptr = strtok(NULL, STAT_SEP)))
{
ppid = atoi(ptr);
}
}
}
if (ppid != -1 && proc_name[0] != '\0')
break;
}
fclose(fp);
}
return(FALSE);
If I understand - you are making this harder than it needs to be. Automate navigating the menu maze.
Unless you really want to try the very unusual thing you asked about - consider this alternative.
Going into program A in a non-interactive way will solve the problem. scp a shell script to each of the servers, in your account. The shell script can be a here document. stdin becomes the script, in place of the keyboard.
Pretend that you have a transcript of a correct interactive session with Program A and it looks like this:
cd /foo
./programA username
password
A
/deviceA/catalog.txt
B
A
13 cows
now
The script using a here doc looks like this
#!/bin/ksh
cd /foo
./programA username<<EOF
password
A
/deviceA/catalog.txt
B
A
13 cows
now
EOF
You may have answers that are unique to each remote server, incorporate them. scp the correct file to each remote server.
ssh remote_server 'cd /foo && chmod +x ./myscript.sh
This sets execute permissions.
On your local desktop create a simple script.sh
ssh remote1 './foo/myscript.sh'
ssh remote2 './foo/myscript.sh'
ssh remote3 './foo/myscript.sh'
This script.sh now runs your report with no intervention from you. I am also guessing you may not have set up ssh keys on the remote servers - this allows passwordless access like the script above would want to have.
http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html
I need to test 64bit version of file IO APIs (open, create stat etc. ). In this process I need to create a file which has a 64 bit inode, so that the internal 64 bit data structures/variables are tested and so the APIs. How do I create a 64 bit inode ?
I have written a script where in I am trying to create a nested array of directories with 1024 files in each directories. The script takes huge amount of time to execute and terminates abruptly. I am not able to proceed, is there any other way to achieve it?
You could simulate any inode number you want by using FUSE.
Look at the hello_ll.c example that comes with FUSE. It creates a filesystem with a single file that has inode number 2. You could modify that file pretty easily to create files with whatever inode number you want.
A quick test with 0x10000000FFFFFFL does this:
$ stat fuse/hello
File: `fuse/hello'
Size: 13 Blocks: 0 IO Block: 4096 regular file
Device: 11h/17d Inode: 4503599644147711 Links: 1
Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Other than FUSE, I know of no practical way of forcing an inode number on "real" filesystems.
Here's a minimal patch used to produce that:
--- hello_ll.c.orig 2011-11-14 13:22:19.000000000 +0100
+++ hello_ll.c 2011-11-14 13:20:27.000000000 +0100
## -9,6 +9,7 ##
*/
#define FUSE_USE_VERSION 26
+#define MYINO 0x10000000FFFFFFL
#include <fuse_lowlevel.h>
#include <stdio.h>
## -31,7 +32,7 ##
stbuf->st_nlink = 2;
break;
- case 2:
+ case MYINO:
stbuf->st_mode = S_IFREG | 0444;
stbuf->st_nlink = 1;
stbuf->st_size = strlen(hello_str);
## -65,7 +66,7 ##
fuse_reply_err(req, ENOENT);
else {
memset(&e, 0, sizeof(e));
- e.ino = 2;
+ e.ino = MYINO;
e.attr_timeout = 1.0;
e.entry_timeout = 1.0;
hello_stat(e.ino, &e.attr);
## -117,7 +118,7 ##
memset(&b, 0, sizeof(b));
dirbuf_add(req, &b, ".", 1);
dirbuf_add(req, &b, "..", 1);
- dirbuf_add(req, &b, hello_name, 2);
+ dirbuf_add(req, &b, hello_name, MYINO);
reply_buf_limited(req, b.p, b.size, off, size);
free(b.p);
}
## -126,7 +127,7 ##
static void hello_ll_open(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi)
{
- if (ino != 2)
+ if (ino != MYINO)
fuse_reply_err(req, EISDIR);
else if ((fi->flags & 3) != O_RDONLY)
fuse_reply_err(req, EACCES);
## -139,7 +140,7 ##
{
(void) fi;
- assert(ino == 2);
+ assert(ino == MYINO);
reply_buf_limited(req, hello_str, strlen(hello_str), off, size);
}
You would have to create 4294967296 files or directories.
In order to do so, you have to prepare your file system to have space for this. Depending on which file system you use, it may or may not be possible. (I just tried to do so with a ext4 file system, and it didn't work.)
You could use a systemtap script to simply bump up the inode number returned by a stat call.
On ext4, something like:
probe kernel.statement("ext4_getattr#fs/ext4/inode.c+21")
{
$stat->ino = $stat->ino + 4294967295;
}
probe begin { log("starting probe") }
would do the trick (you might have to adjust the "21" offset, if ext4_getattr is different in your tree).
I have a relatively complex perl script which is walking over a filesystem and storing a list of updated ownership, then going over that list and applying the changes. I'm doing this in order to update changed UIDs. Because I have several situations where I'm swapping user a's and user b's UIDs, I can't just say "everything which is now 1 should be 2 and everything which is 2 should be 1", as it's also possible that this script could be interrupted, and the system would be left in a completely busted, pretty much unrecoverable state outside of "restore from backup and start over". Which would pretty much suck.
To avoid that problem, I do the two-pas approach above, creating a structure like $changes->{path}->\%c, where c has attributes line newuid, olduid, newgid, and olduid. I then freeze the hash, and once it's written to disk, I read the hash back in and start making changes. This way, if I'm interrupted, I can check to see if the frozen hash exists or not, and just start applying changes again if it does.
The drawback is that sometimes a changing user has literally millions of files, often with very long paths. This means I'm storing a lot of really long strings as hash keys, and I'm running out of memory sometimes. So, I've come up with two options. The one relevant to this question is to instead store the elements as device:inode pairs. That'd be way more space-efficient, and would uniquely identify filesystem elements. The drawback is that I haven't figured out a particularly efficient way to either get a device-relative path from the inode, or to just apply the stat() changes I want to the inode. Yes, I could do another find, and for each file do a lookup against my stored list of devices and inodes to see if a change is needed or not. But if there's a perl-accessible system call - which is portable across HP-UX, AIX, and Linux - from which I can directly just say "on this device make these changes to this inode", it'd be notably better from a performance perspective.
I'm running this across several thousand systems, some of which have filesystems in the petabyte range, having trillions of files. So, while performance may not make much of a differece on my home PC, it's actually somewhat significant in this scenario. :) That performance need, BTW, is why I really don't want to do the other option - which would be to bypass the memory problem by just tie-ing a hash to a disk-based file. And is why I'd rather do more work to avoid having to traverse the whole filesystem a second time.
Alternate suggestions which could reduce memory consumption are, of course, also welcome. :) My requirement is just that I need to record both the old and new UID/GID values, so I can back the changes out / validate changes / update files restored from backups taken prior to the cleanup date. I've considered making /path/to/file look like ${changes}->{root}->{path}->{to}->{file}, but that's a lot more work to traverse, and I dont know that it'll really save me enough memory space to resolve my problem. Collapsing the whole thing to ->{device}->{inode} makes it basically just the size of two integers rather than N characters, which is substatial for any path longer than, say, 2 chars. :)
Simplified idea
When I mentioned streaming, I didn't mean uncontrolled. A database journal (e.g.) is also written in streaming mode, for comparison.
Also note, that the statement that you 'cannot afford to sort even a single subdirectory' directly contradicts the use of a Perl hash to store the same info (I won't blame you if you don't have the CS background).
So here is a really simple illustration of what you could do. Note that every step on the way is streaming, repeatable and logged.
# export SOME_FIND_OPTIONS=...?
find $SOME_FIND_OPTIONS -print0 | ./generate_script.pl > chownscript.sh
# and then
sh -e ./chownscript.sh
An example of generate_script.pl (obviously, adapt it to your needs:)
#!/usr/bin/perl
use strict;
use warnings;
$/="\0";
while (<>)
{
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = stat;
# demo purpose, silly translation:
my ($newuid, $newgid) = ($uid+1000, $gid+1000);
print "./chmod.pl $uid:$gid $newuid:$newgid '$_'\n"
}
You could have a system dependent implementation of chmod.pl (this helps to reduce complexity and therefore: risk):
#!/usr/bin/perl
use strict;
use warnings;
my $oldown = shift;
my $newown = shift;
my $path = shift;
($oldown and $newown and $path) or die "usage: $0 <uid:gid> <newuid:newgid> <path>";
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = stat $path;
die "file not found: $path" unless $ino;
die "precondition failed" unless ($oldown eq "$uid:$gid");
($uid, $gid) = split /:/, $newown;
chown $uid, $gid, $path or die "unable to chown: $path"
This will allow you to restart when things bork midway, it will even allow you to hand-pick exceptions if necessary. You can save the scripts so you'll have accountability. I've done a reasonable stab at making the scripts operate safely. However, this is obviously just a starting point. Most importantly, I do not deal with filesystem crossings, symbolic links, sockets, device nodes where you might want to pay attention to them.
original response follows:
Ideas
Yeah, if performance is the issue, do it in C
Do not do persistent logging for the whole filesystem (by the way, why the need to keep them in a single hash? streaming output is your friend there)
Instead, log completed runs per directory. You could easily break the mapping up in steps:
user A: 1 -> 99
user B: 2 -> 1
user A: 99 -> 2
Ownify - what I use (code)
As long as you can reserve a range for temporary uids/guids like the 99 there won't be any risk on having to restart (not any more than doing this transnumeration on a live filesystem, anyway).
You could start from this nice tidbit of C code (which admittedly is not very highly optmized):
// vim: se ts=4 sw=4 et ar aw
//
// make: g++ -D_FILE_OFFSET_BITS=64 ownify.cpp -o ownify
//
// Ownify: ownify -h
//
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
/* old habits die hard. can't stick to pure C ... */
#include <string>
#include <iostream>
#define do_stat(a,b) lstat(a,b)
#define do_chown(a,b,c) lchown(a,b,c)
//////////////////////////////////////////////////////////
// logic declarations
//
void ownify(struct stat& file)
{
// if (S_ISLNK(file.st_mode))
// return;
switch (file.st_uid)
{
#if defined(PASS1)
case 1: file.st_uid = 99; break;
case 99: fputs(err, "Unexpected existing owned file!"); exit(255);
#elif defined(PASS2)
case 2: file.st_uid = 1; break;
#elif defined(PASS3)
case 99: file.st_uid = 1; break;
#endif
}
switch (file.st_gid) // optionally map groups as well
{
#if defined(PASS1)
#elif defined(PASS2)
#elif defined(PASS3)
#endif
}
}
/////////////////////////////////////////////////////////
// driver
//
static unsigned int changed = 0, skipped = 0, failed = 0;
static bool dryrun = false;
void process(const char* const fname)
{
struct stat s;
if (0==do_stat(fname, &s))
{
struct stat n = s;
ownify(n);
if ((n.st_uid!=s.st_uid) || (n.st_gid!=s.st_gid))
{
if (dryrun || 0==do_chown(fname, n.st_uid, n.st_gid))
printf("%u\tchanging owner %i:%i '%s'\t(was %i:%i)\n",
++changed,
n.st_uid, n.st_gid,
fname,
s.st_uid, s.st_gid);
else
{
failed++;
int e = errno;
fprintf(stderr, "'%s': cannot change owner %i:%i (%s)\n",
fname,
n.st_uid, n.st_gid,
strerror(e));
}
}
else
skipped++;
} else
{
int e = errno;
fprintf(stderr, "'%s': cannot stat (%s)\n", fname, strerror(e));
failed++;
}
}
int main(int argc, char* argv[])
{
switch(argc)
{
case 0: //huh?
case 1: break;
case 2:
dryrun = 0==strcmp(argv[1],"-n") ||
0==strcmp(argv[1],"--dry-run");
if (dryrun)
break;
default:
std::cerr << "Illegal arguments" << std::endl;
std::cout <<
argv[0] << " (Ownify): efficient bulk adjust of owner user:group for many files\n\n"
"Goal: be flexible and a tiny bit fast\n\n"
"Synopsis:\n"
" find / -print0 | ./ownify -n 2>&1 | tee ownify.log\n\n"
"Input:\n"
" reads a null-delimited stream of filespecifications from the\n"
" standard input; links are _not_ dereferenced.\n\n"
"Options:\n"
" -n/--dry-run - test run (no changes)\n\n"
"Exit code:\n"
" number of failed items" << std::endl;
return 255;
}
std::string fname("/dev/null");
while (std::getline(std::cin, fname, '\0'))
process(fname.c_str());
fprintf(stderr, "%s: completed with %u skipped, %u changed and %u failed%s\n",
argv[0], skipped, changed, failed, dryrun?" (DRYRUN)":"");
return failed;
}
Note that this comes with quite a few safety measures
paranoia check in first pass (check no fiels with reserved uid exists)
ability to change behaviour of do_stat and do_chown with regards to links
a --dry-run option (to observe what would be done) -n
The program will gladly tell you how to use it with ownify -h:
./ownify (Ownify): efficient bulk adjust of owner user:group for many files
Goal: be flexible and a tiny bit fast
Synopsis:
find / -print0 | ./ownify -n 2>&1 | tee ownify.log
Input:
reads a null-delimited stream of file specifications from the
standard input;
Options:
-n/--dry-run - test run (no changes)
Exit code:
number of failed items
A few possible solutions that come to mind:
1) Do not store a hash in the file, just a sorted list in any format that can be reasonably parsed serially. By sorting the list by filename, you should get the equivalent of running find again, without actually doing it:
# UID, GID, MODE, Filename
0,0,600,/a/b/c/d/e
1,1,777,/a/b/c/f/g
...
Since the list is sorted by filename, the contents of each directory should be bunched together in the file. You do not have to use Perl to sort the file, sort will do nicely in most cases.
You can then just read in the file line-by-line - or with any delimiter that will not mangle your filenames - and just perform any changes. Assuming that you can tell which changes are needed for each file at once, it does not sound as if you actually need the random-access capabilities of a hash, so this should do.
So the process would happen in three steps:
Create the change file
Sort the change file
Perform changes per the change file
2) If you cannot tell which changes each file needs at once, you could have multiple lines for each file, each detailing a part of the changes. Each line would be produced the moment you determine a needed change at the first step. You can then merge them after sorting.
3) If you do need random access capabilities, consider using a proper embedded database, such as BerkeleyDB or SQLite. There are Perl modules for most embedded databases around. This will not be quite as fast, though.
Down at the bottom of this essay is a comment about a spooky way to beat passwords. Scan the entire HDD of a user including dead space, swap space etc, and just try everything that looks like it might be a password.
The question: part 1, are there any tools around (A live CD for instance) that will scan an unmounted file system and zero everything that can be? (Note I'm not trying to find passwords)
This would include:
Slack space that is not part of any file
Unused parts of the last block used by a file
Swap space
Hibernation files
Dead space inside of some types of binary files (like .DOC)
The tool (aside from the last case) would not modify anything that can be detected via the file system API. I'm not looking for a block device find/replace but rather something that just scrubs everything that isn't part of a file.
part 2, How practical would such a program be? How hard would it be to write? How common is it for file formats to contain uninitialized data?
One (risky and costly) way to do this would be to use a file system aware backup tool (one that only copies the actual data) to back up the whole disk, wipe it clean and then restore it.
I don't understand your first question (do you want to modify the file system? Why? Isn't this dead space exactly where you want to look?)
Anyway, here's an example of such a tool:
#include <stdio.h>
#include <alloca.h>
#include <string.h>
#include <ctype.h>
/* Number of bytes we read at once, >2*maxlen */
#define BUFSIZE (1024*1024)
/* Replace this with a function that tests the passwort consisting of the first len bytes of pw */
int testPassword(const char* pw, int len) {
/*char* buf = alloca(len+1);
memcpy(buf, pw,len);
buf[len] = '\0';
printf("Testing %s\n", buf);*/
int rightLen = strlen("secret");
return len == rightLen && memcmp(pw, "secret", len) == 0;
}
int main(int argc, char* argv[]) {
int minlen = 5; /* We know the password is at least 5 characters long */
int maxlen = 7; /* ... and at most 7. Modify to find longer ones */
int avlen = 0; /* available length - The number of bytes we already tested and think could belong to a password */
int i;
char* curstart;
char* curp;
FILE* f;
size_t bytes_read;
char* buf = alloca(BUFSIZE+maxlen);
if (argc != 2) {
printf ("Usage: %s disk-file\n", argv[0]);
return 1;
}
f = fopen(argv[1], "rb");
if (f == NULL) {
printf("Couldn't open %s\n", argv[1]);
return 2;
}
for(;;) {
/* Copy the rest of the buffer to the front */
memcpy(buf, buf+BUFSIZE, maxlen);
bytes_read = fread(buf+maxlen, 1, BUFSIZE, f);
if (bytes_read == 0) {
/* Read the whole file */
break;
}
for (curstart = buf;curstart < buf+bytes_read;) {
for (curp = curstart+avlen;curp < curstart + maxlen;curp++) {
/* Let's assume the password just contains letters and digits. Use isprint() otherwise. */
if (!isalnum(*curp)) {
curstart = curp + 1;
break;
}
}
avlen = curp - curstart;
if (avlen < minlen) {
/* Nothing to test here, move along */
curstart = curp+1;
avlen = 0;
continue;
}
for (i = minlen;i <= avlen;i++) {
if (testPassword(curstart, i)) {
char* found = alloca(i+1);
memcpy(found, curstart, i);
found[i] = '\0';
printf("Found password: %s\n", found);
}
}
avlen--;
curstart++;
}
}
fclose(f);
return 0;
}
Installation:
Start a Linux Live CD
Copy the program to the file hddpass.c in your home directory
Open a terminal and type the following
su || sudo -s # Makes you root so that you can access the HDD
apt-get install -y gcc # Install gcc
This works only on Debian/Ubuntu et al, check your system documentation for others
gcc -o hddpass hddpass.c # Compile.
./hddpass /dev/YOURDISK # The disk is usually sda, hda on older systems
Look at the output
Test (copy to console, as root):
gcc -o hddpass hddpass.c
</dev/zero head -c 10000000 >testdisk # Create an empty 10MB file
mkfs.ext2 -F testdisk # Create a file system
rm -rf mountpoint; mkdir -p mountpoint
mount -o loop testdisk mountpoint # needs root rights
</dev/urandom head -c 5000000 >mountpoint/f # Write stuff to the disk
echo asddsasecretads >> mountpoint/f # Write password in our pagefile
# On some file systems, you could even remove the file.
umount testdisk
./hdpass testdisk # prints secret
Test it yourself on an Ubuntu Live CD:
# Start a console and type:
wget http://phihag.de/2009/so/hddpass-testscript.sh
sh hddpass-testscript.sh
Therefore, it's relatively easy. As I found out myself, ext2 (the file system I used) overwrites deleted files. However, I'm pretty sure some file systems don't. Same goes for the pagefile.
How common is it for file formats to contain uninitialized data?
Less and less common, I would've thought. The classic "offender" is older versions of MS office applications that (essentially) did a memory dump to disk as its "quicksave" format. No serialisation, no selection of what to dump and a memory allocator that doesn't zero newly allocated memory pages. That lead to not only juicy things from previous versions of the document (so the user could use undo), but also juicy snippets from other applications.
How hard would it be to write?
Something that clears out unallocated disk blocks shouldn't be that hard. It'd need to run either off-line or as a kernel module, so as to not interfer with normal file-system operations, but most file systems have an "allocated"/"not allocated" structure that is fairly straight-forward to parse. Swap is harder, but as long as you're OK with having it cleared on boot (or shutdown), it's not too tricky. Clearing out the tail block is trickier, definitely not something I'd want to try to do on-line, but it shouldn't be TOO hard to make it work for off-line cleaning.
How practical would such a program be?
Depends on your threat model, really. I'd say that on one end, it'd not give you much at all, but on the other end, it's a definite help to keep information out of the wrong hands. But I can't give a hard and fast answer,
Well, if I was going to code it for a boot CD, I'd do something like this:
File is 101 bytes but takes up a 4096-byte cluster.
Copy the file "A" to "B" which has nulls added to the end.
Delete "A" and overwrite it's (now unused) cluster.
Create "A" again and use the contents of "B" without the tail (remember the length).
Delete "B" and overwrite it.
Not very efficient, and would need a tweak to make sure you don't try to copy the first (and therefor full) clusters in a file. Otherwise, you'll run into slowness and failure if there's not enough free space.
There's tools that do this efficiently that are open source?