File search algorithms using indexing in linux - linux

I think of implementing a file search program using indexing in linux... I know that there are several other file search programs like beagled. but I am doing this for study purpose... I am struck with how to do indexing.. I have the following idea that I took from maemo-mapper application..
for example if u have file named "suresh" its index in the file system as files...
/home/$USERNAME/.file_search_index/s/u/r/e/s/h/list.txt.. This list.txt contains the location of all files with name = "suresh"... Pls suggest a better idea/algorithm to implement it... And If there is any material on various file search technique pls post it....

You haven't seen the locate command that comes with findutils? Like beagled, it's free software, so you can study the code.
The findutils package is always looking for contributors.
Information on the database format is at http://www.gnu.org/software/findutils/manual/html_node/find_html/Database-Formats.html

Beagle uses a very interesting approach with inotify. It starts, establishes a watch on the parent directory and starts another thread that does a recursive scan. As more directories are accessed, the parent sees them and adds more watches, while watching what it already knows about.
So, when its started, you're watching an entire tree quite cheaply (one watch per directory) and have the whole thing indexed. This also helps to ensure that no files are 'missed' during the scan.
So, that's most of your battle.. typically FS search programs hit their sluggish point when indexing, for instance 'updatedb'.
As for storing the index, I would not favor splitting it up in directories. You'd be in essence calling stat() on each character in a file name array. some-very-long-shared-object-name.so.0 for instance would be one call to stat() for every character in the name. You might try using a well designed SQLite3 database.
I'm working on something very similar, a program to provide slightly cheaper auditing means for PCI certification (credit card processor), without using kernel auditing hooks.

Related

How to protect my script from copying and modifying in it?

I created expect script for customer and i fear to customize it like he want without returning to me so I tried to encrypt it but i didn't find a way for it
Then I tried to convert it to excutable but some commands was recognized by active tcl like "send" command even it is working perfectly on red hat
So is there a way to protect my script to be reading?
Thanks
It's usually enough to just package the code in a form that the user can't directly look inside. Even the smallest of speed-bump stops them.
You can use sdx qwrap to parcel your script up into a starkit. Those are reasonably resistant to random user poking, while being still technically open (the sdx tool is freely available, after all). You can convert the .kit file it creates into an executable by merging it with a packaged runtime.
In short, it's basically like this (with some complexity glossed over):
tclkit sdx.kit qwrap myapp.tcl
tclkit sdx.kit unwrap myapp.kit
# Copy additional assets into myapp.vfs if you need to
tclkit sdx.kit wrap myapp.exe -runtime C:\path\to\tclkit.exe
More discussion is here, the tclkit runtimes are here, and sdx itself can be obtained in .kit-packaged form here. Note that the runtime you use to run sdx does not need to be the same that you package; you can deploy code for other platforms than the one you are running from. This is a packaging phase action, not a compilation or linking.
Against more sophisticated users (i.e., not Joe Ordinary User) you'll want the Tcl Compiler out of the ActiveState TclDevKit. It's a code-obscurer formally (it doesn't actually improve the performance of anything) and the TDK isn't particularly well supported any more, but it's the main current solution for commercial protection of Tcl code. I'm on a small team working on a true compiler that will effectively offer much stronger protection, but that's not yet released (and really isn't ready yet).
One way is to store the essential code running in your server as back-end. Just give the user a fron-end application to do the requests. This way essential processes are on your control, and user cannot access that code.

Linux Kernel Module that can list files and folders inside a given path

I would like to know if it is possible to list files and folders inside a given folder from within the Linux Kernel. I bet there is a way.
I have searched on-line and gave it few shots, but still could not do it.
Thank you!
Reacting your comment: your question isn't about file reading, but getting the entries of a directory. About your last sentence: yes, every filesystem implements the readdir() function, so it would be filesystem-independent.
In my opinion, you need the following steps:
Research, how to write kernel modules. Tthere are very many tutorials on the net, including step-by-step tutorials with well commented examples.
Write a simple module, which printk()-s some simple text in its initialization function.
Research, how can you call system calls from a kernel module. It is probably not so simple, as from user space, but nearly surely possible.
The simplest way to pass through the path to the directory in a module parameter. Linux kernel modules can have multiple parameters, whose processing is very well automatized (essentially, you can directly bind the parameter name to static variables in the module).
After your module can call system calls, and has its input, you can now open this directory in its init function with the opendir() call. Then read its content (see readdir()), and finally output the result with printk().
Probably there will be some obstacles, for example maybe you can't use syscalls from the module init function, or similar, but none of them will be really hard.

Is it possible for the same file to exist in more than one directory?

Just a simple question, borne out of learning about File Systems;
Is it possible for a single file two simultaneously exist in two or more directories?
I'd like to know if this is possible in Linux and well as Windows.
Yes, you can do this with either hard- or soft links (and maybe on Windows with shortcuts. I'm not sure about that). Note this is different from making a copy of the file! In both cases, you only store the same file once, unlike when you make a copy.
In the case of hard links, the same file (on disk) will be referenced in two different places. You cannot distinguish between the 'original' and the 'new one'. If you delete one of them, the other will be unaffected; a file will only actually be deleted when the last "reference" is removed. An important detail is that the way hard links work means that you cannot create them for directories.
Soft links, also referred to as symbolic links, are a bit similar to shortcuts in Windows, but on a lower level. if you open them for read or write operations, you'll read from the file, but you can distinguish between reading from the file directly, and reading from the soft link.
In Windows, the use of soft links is fairly uncommon, but there is support for it (IDK about the filesystem APIs, but there's a tool called ln just like on Unix).

FS based on a database without using fuse

To serve millions of files out of a single directory, being able to connect to a drive from hundreds of endpoints, and for some other reasons (to avoid gluster/nfs/all fs based networking solutions), I want to evaluate the possibility of making a filesystem that's based on a mongodb (or any other).
Basically, it works like fusefs, every single file is kept in mongo gridfs. In theory, I do,
mount mongodbfs /mountPoint mongodb://localhost
then when i say touch /mountPoint/test.txt this file is inserted into mongodb. This FS will also store uid/gid and perms with the file, we can throw hundreds of servers to it, and no useradd will be necessary. I'm not thinking to include all the features of FS, just the ones we need.
My question is, how do I start my quest in finding resources, books, links, people, developers who'd help me implement this? at least a proof of concept. Is it feasible? What should I expect as a timeline for such undertaking?
Please only think about gazillion small files and folders.
ps: after a few days of research i think this is the direction i'm heading
http://www.ibm.com/developerworks/library/l-sc12.html
http://www.flipcode.com/archives/Programming_a_Virtual_File_System-Part_I.shtml
ps2: i'm aware of the difficulty of this undertaking. however we're willing to set aside a serious budget and willing to form a serious team implementing it - only after we make sure that this isn't a black hole (thus the question).
Your most frequent piece of advice here is going to be "Use FUSE". This is excellent advice, and you would do well to heed it (As Sciurus pointed out there's already gridfs-fuse which is pretty close to what you want).
That said, if you want to take the long, hard road of pain and suffering (writing your own filesystem), you almost certainly want to take an operating systems course at a local university, or look at some online course materials ("Write a simple FS" is usually a small project. The filesystems typically suck because they're academic toys).
Follow that up with Linux File Systems (Moshe Bar) and a thorough reading of some simple filesystem drivers to see the basic skeleton of what you'll need to do.
As far as timeline, if you're a decent coder you can write a basic filesystem in a few days to a week (but it will SUCK). I wouldn't even guess how long it would take to write a GOOD filesystem -- UFS/FFS (the BSD filesystem) has been under continuous development since at least the late 1970s/early 1980s, and improvements/enhancements/bug fixes still pop up occasionally. Sun/Oracle's ZFS has gone through over 20 iterations in its relative short (6-year) life, though admittedly much of that is related to volume management capabilities.

Recommended FHS compliant application test/install workflow under Linux?

I'm in the process of switching to Linux for development, and I'm puzzled about how to maintain a good FHS compliancy in my programs.
For example, under Windows, I know that all the resources (Bitmaps, audio data, etc.) that my program will need can be found with relative paths from the executable, so its the same if I'm running the program from my development directory, or from an installation (Under "Program Files" for example), the program will be able to locate all its files.
Now, under Linux, I see that usually the executable goes under /usr/local/bin and its resources on /usr/local/share. (And the truth is that I'm not even sure of this)
For convenience reasons (such as version control) I'd like to have all the files pertaining to the project under a same path, say, for example, project/src for the source and project/data for resource files.
Is there any standard or recommended way to let me just rebuild the binary for testing and use the files on the project/data directory, while also being able to locate the files when they are under /usr/local/share?
I thought for example of setting a symlink under /usr/local/share pointing to my resources dir, and then just hardcode that path inside my program, but I feel its quite hackish and not very portable.
Also, I thought of running an install script that copies all the resources to /usr/local/share everytime I change, or add resources, but I also feel its not a good way to do it.
Could anyone tell me or point me to where it tells how this issue is usually resolved?
Thanks!
For convenience reasons (such as version control) I'd like to have all the files pertaining to the project under a same path, say, for example, project/src for the source and project/data for resource files.
You can organize your source tree as you wish — it need not bear any resemblance to the FHS layout desired of installed software.
I see that usually the executable goes under /usr/local/bin and its resources on /usr/local/share. (And the truth is that I'm not even sure of this)
The standard prefix is /usr. /usr/local is for, well, "local installations" as the FHS spec reiterates.
Is there any standard or recommended way to let me just rebuild the binary for testing and use the files on the project/data directory
Definitely. Run ./configure --datadir=$PWD/share for example is the way to point your build to the data files form the source tree (substitute by proper path) and use something like -DDATADIR="'${datadir}'" in AM_CFLAGS to make the value known to the (presumably C) code. (All of that, provided you are using autoconf/automake. Similar options may be available in other build systems.)
This sort of hardcoding is what is used in practice, and it suffices. For a development build within your own working copy, having a hardcoded path should not be a problem, and final builds (those done by a packager) will simply use the standard FHS paths.
You could just test a few locations. For example, first check if you have a data directory within the directory you're currently running the program from. If so, just go ahead and use it. If not, try /usr/local/share/yourproject/data, and so on.
For developing/testing, you can use the data directory within your project folder, and for deploying, use the stuff in /usr/local/share/. Of course, you can test for even more locations (e.g. /usr/share).
Basically the requirement for this method is that you have a function that builds the correct paths for all filesystem accesses. Instead of fopen("data/blabla.conf", "w") use something like fopen(path("blabla.conf"), "w"). path() will construct the correct path from the path determined using the directory tests when the program started. E.g. if the path was /usr/local/share/yourproject/data/, the string returned by path("blabla.conf") would be "/usr/local/share/yourproject/data/blabla.conf" - and there is your nice absolute path.
That's how I'd do it. HTH.
My preferred solution in cases like this is to use a configuration file, along with a command-line option that overrides its location.
For example, a configuration file for a fully deployed application named myapp could reside in /etc/myapp/settings.conf and a part of it could look like this:
...
confdir=/etc/myapp/
bindir=/usr/bin/
datadir=/usr/share/myapp/
docdir=/usr/share/doc/myapp/
...
Your application (or a launcher script) can parse this file to determine where to find the rest of the needed files.
I believe that you can reasonably assume in your code that the location of the configuration file is fixed under /etc/myapp - or any other location specified at compile time. Then you provide a command line option to allow that location to be overridden:
myapp --configfile=/opt/myapp/etc/settings.conf ...
It might also make sense to have options for some of the directory paths as well, so that the user can easily override any of the configuration file settings. This approach has a couple of advantages:
Your users can relocate the application very easily - just by moving the files, modifying the paths in the configuration file and then using e.g. a wrapper script to call the main application with the proper --configfile option.
You can easily support FHS, as well as any other scheme you need to.
While developing, you can have your testsuite use a specially crafted configuration file with the paths being wherever you need them to be.
Some people advocate probing the system at runtime to resolve issues like this. I usually suggest avoiding such solutions for at least the following reasons:
It makes your program non-deterministic. You can never tell at a first glance which configuration file it picks up - especially if you have multiple versions of the application on your system.
At any installation mix-up, the application will remain fat and happy - and so will the user. In my opinion, the application should look at one specific and well-documented location and abort with an informative message if it cannot find what it is looking for.
It's highly unlikely that you will always get everything right. There will always be unexpected rare environments or corner cases that the application will not handle.
Such behaviour is against the Unix philosophy. Even comamnd shells probe multiple locations because all locations can hold a file that should be parsed.
EDIT:
This method is not mandated by any formal standard that I know of, but it is the prevalent solution in the Unix world. Most major daemons (e.g. BIND, sendmail, postfix, INN, Apache) will look for a configuration file at a certain location, but will allow you to override that location and - through the file - any other path.
This is mostly to allow the system administrator to implement whetever scheme they want or to setup multiple concurrent installations, but it does help during testing as well. This flexibility is what makes it a Best Practice if not a proper standard.

Resources