Building a file index in Linux

Building a file index in Linux - linux

I have a filesystem with deeply nested directories. Inside the bottom level directory for any node in the tree is a directory whose name is the guid of a record in a database. This folder contains the binary file(s) (pdf, jpg, etc) that are attached to that record.
Two Example paths:
/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf
/g/camm/MOUNT/raid_fs1/FOO/052014/22/321.654.987/04.20.30--27.04.2014--RJ123.pdf
In the above example, 123.456.789 and 321.654.987 are guids
I want to build an index of the complete filesystem so that I can create a lookup table in my database to easily map the guid of the record to the absolute path(s) of its attached file(s).
I can easily generate a straight list of files with:
find /g/camm/MOUNT -type f > /g/camm/MOUNT/files.index
but I want to parse the output of each file path into a CSV file which looks like:
GUID ABSOLUTEPATH FILENAME
123.456.789 /g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf 04.20.30--27.04.2014--RJ123.pdf
321.654.987 /g/camm/MOUNT/raid_fs1/FOO/052014/22/321.654.987/04.20.30--27.04.2014--RJ123.pdf 04.20.30--27.04.2014--RJ123.pdf
I think I need to pipe the output of my find command into xargs and again into awk to process each line of the output into the desired format for the CSV output... but I can't make it work...

Wait for your long-running find to finish, then you
can pass the list of filenames through awk:
awk -F/ '{printf "%s,%s,%s\n",$(NF-1),$0,$NF}' /g/camm/MOUNT/files.index
and this will convert lines like
/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf
into
123.456.789,/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf,04.20.30--27.04.2014--RJ123.pdf
The -F/ splits the line into fields using "/" as separator, NF is the
number of fields, so $NF means the last field, and $(NF-1) the
next-to-last, which seems to be the directory you want in the first column
of the output. I used "," in the printf to separate the output columns, as
is typical in a csv; you can replace it by any character such as space or ";".

I dont think there can be anything much faster than your find command, but
you may be interested by the locate package. It uses the updatedb command, usually run each night by cron, to traverse the filesystem and creates a file holding all the filenames in a manner than can be easily searched by another command.
The locate command is used to read the database to find matching directories, files, and so on, even using glob wild-card or regex pattern matching. Once tried, it is hard to live without it.
For example, on my system locate -S lists the statistics:
Database /var/lib/mlocate/mlocate.db:
59945 directories
505330 files
30401572 bytes in file names
12809265 bytes used to store database
and I can do
locate rc-dib0700-nec.ko
locate -r rc-.*-nec.ko
locate '*/media/*rc-*-nec.ko*'
to find files like /usr/lib/modules/4.1.6-100.fc21.x86_64/kernel/drivers/media/rc/keymaps/rc-dib0700-nec.ko.xz in no time at all.

You can nearly do what you want with the find's -printf option.
The difficuty is on GUID.
Assuming prefixes have the same length as in your example, I would probably do:
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | colrm 1 37 > /g/camm/MOUNT/files.index
Or if the number of / is constant
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | cut -d '/' -f 9- > /g/camm/MOUNT/files.index
Otherwise, I would use sed:
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | sed -e 's#^.*/\(.*\) #\1 #' > /g/camm/MOUNT/files.index

Related

List files in unix in a specific format to a text file

I want to list files in a particular folder to a file list but in a particular format
for instance i have below files in a folder
/path/file1.csv
/path/file2.csv
/path/file3.csv
I want to create a string in a text file that lists them as below
-a file1.csv -a file2.csv -a file3.csv
assist create a script for that
ls /path/* > file_list.lst

The find utility can do this.
find /path/ -type f -printf "-a %P " > file_list.lst
This gives, for each thing that is a file in the given path (recursively), their Path relative to the starting point, formatted as in your example.
Note that:
Linux filenames can contain spaces and newlines; this does not deal with those.
The file_list.lst file will have a trailing space but no trailing newline.
The results will not be in a particular order.

You can just printf them.
printf "-a %s " /path/*
If you plan to be using it with a command, you may want to read https://mywiki.wooledge.org/BashFAQ/050 and interest yourself with %q printf format specifier.

how to find a last updated file with the prefix name in bash?

How can I find a last updated file with the specific prefix in bash?
For example, I have three files, and I just want to see a file that has "ABC" and where the last Last_updatedDateTime desc.
fileName Last_UpdatedDateTime
abc123 7/8/2020 10:34am
abc456 7/6/2020 10:34am
def123 7/8/2020 10:34am

You can list files sorted in the order they were modified with ls -t:
-t sort by modification time, newest first
You can use globbing (abc*) to match all files starting with abc.
Since you will get more than one match and only want the newest (that is first):
head -1
Combined:
ls -t abc* | head -1

If there are a lot of these files scattered across a variety of directories, find mind be better.
find -name abc\* -printf "%T# %f\n" |sort -nr|sed 's/^.* //; q;'
Breaking that out -
find -name 'abc*' -printf "%T# %f\n" |
find has a ton of options. This is the simplest case, assuming the current directory as the root of the search. You can add a lot of refinements, or just give / to search the whole system.
-name 'abc*' picks just the filenames you want. Quote it to protect any globs, but you can use normal globbing rules. -iname makes the search case-insensitive.
-printf defines the output. %f prints the filename, but you want it ordered on the date, so print that first for sorting so the filename itself doesn't change the order. %T accepts another character to define the date format - # is the unix epoch, seconds since 00:00:00 01/01/1970, so it is easy to sort numerically. On my git bash emulation it returns fractions as well, so it's great granularity.
$: find -name abc\* -printf "%T# %f\n"
1594219755.7741618000 abc123
1594219775.5162510000 abc321
1594219734.0162554000 abc456
find may not return them in the order you want, though, so -
sort -nr |
-n makes it a numeric sort. -r sorts in reverse order, so that the latest file will pop out first and you can ignore everything after that.
sed 's/^.* //; q;'
Since the first record is the one we want, sed can just use s/^.* //; to strip off everything up to the space, which we know will be the timestamp numbers since we controlled the output explicitly. That leaves only the filename. q explicitly quits after the s/// scrubs the record, so sed spits out the filename and stops without reading the rest, which prevents the need for another process (head -1) in the pipeline.

Find files whose content match a line from text file

I have a text file - accessions.txt (below is a subset of this file):
KRO94967.1
KRO95967.1
KRO96427.1
KRO94221.1
KRO94121.1
KRO94145.1
WP_088442850.1
WP_088252850.1
WP_088643726.1
WP_088739685.1
WP_088283155.1
WP_088939404.1
And I have a directory with multiple files (*.align).
I want to find the filenames (*.align) which content matches any line within my accessions.txt text file.
I know that find . -exec grep -H 'STRING' {} + works to find specific strings (e.g replacing STRING with WP_088939404.1 returns every filename where the string WP_088939404.1 is present).
Is there a way to replace STRING with "all strings inside my text file" ?
Or
Is there another (better) way to do this?
I was trying to avoid writing a loop that reads the content of all my files as there are too many of them.
Many thanks!

You're looking for grep's -f option.
find . -name '*.align' -exec grep -Fxqf accessions.txt {} \; -print

grep can take a list of patterns to match with -f.
grep -lFf accessions.txt directory/*.align
-F tells grep to interpret the lines as fixed strings, not regex patterns.
Sometimes, -w is also needed to prevent matching inside words, e.g.
abcd
might match not only abcd, but also xabcd or abcdy. Sometimes, preprocessing the input list is needed to prevent unwanted matching if the rules are more complex.

Finding multiple strings in directory using linux commends

If I have two strings, for example "class" and "btn", what is the linux command that would allow me to search for these two strings in the entire directory.
To be more specific, lets say I have directory that contains few folders with bunch of .php files. My goal is to be able to search throughout those .php files so that it prints out only files that contain "class" and "btn" in one line. Hopefully this clarifies things better.
Thanks,

I normally use the following to search for strings inside my source codes. It searches for string and shows the exact line number where that text appears. Very helpful for searching string in source code files. You can always pipes the output to another grep and filter outputs.
grep -rn "text_to_search" directory_name/
example:
$ grep -rn "angular" menuapp
$ grep -rn "angular" menuapp | grep some_other_string
output would be:
menuapp/public/javascripts/angular.min.js:251://# sourceMappingURL=angular.min.js.map
menuapp/public/javascripts/app.js:1:var app = angular.module("menuApp", []);

grep -r /path/to/directory 'class|btn'
grep is used to search a string in a file. With the -r flag, it searches recursively all files in a directory.

Or, alternatively using the find command to "identify" the files to be searched instead of using grep in recursive mode:
find /path/to/your/directory -type f -exec grep "text_to_search" {} \+;

Listing entries in a directory using grep

I'm trying to list all entries in a directory whose names contain ONLY upper-case letters. Directories need "/" appended.
#!/bin/bash
cd ~/testfiles/
ls | grep -r *.*
Since grep by default looks for upper-case letters only (right?), I'm just recursively searching through the directories under testfiles for all names who contain only upper-case letters.
Unfortunately this doesn't work.
As for appending directories, I'm not sure why I need to do this. Does anyone know where I can start with some detailed explanations on what I can do with grep? Furthermore how to tackle my problem?

No, grep does not only consider uppercase letters.
Your question I a bit unclear, for example:
from your usage of the -r option, it seems you want to search recursively, however you don't say so. For simplicity I assume you don't need to; consider looking into #twm's answer if you need recursion.
you want to look for uppercase (letters) only. Does that mean you don't want to accept any other (non letter) characters, but which are till valid for file names (like digits or dashes, dots, etc.)
since you don't say th it i not permissible to have only on file per line, I am assuming it is OK (thus using ls -1).
The naive solution would be:
ls -1 | grep "^[[:upper:]]\+$"
That is, print all lines containing only uppercase letters. In my TEMP directory that prints, for example:
ALLBIG
LCFEM
WPDNSE
This however would exclude files like README.TXT or FILE001, which depending on your requirements (see above) should most likely be included.
Thus, a better solution would be:
ls -1 | grep -v "[[:lower:]]\+"
That is, print all lines not containing an lowercase letter. In my TEMP directory that prints for example:
ALLBIG
ALLBIG-01.TXT
ALLBIG005.TXT
CRX_75DAF8CB7768
LCFEM
WPDNSE
~DFA0214428CD719AF6.TMP
Finally, to "properly mark" directories with a trailing '/', you could use the -F (or --classify) option.
ls -1F | grep -v "[[:lower:]]\+"
Again, example output:
ALLBIG
ALLBIG-01.TXT
ALLBIG005.TXT
CRX_75DAF8CB7768
LCFEM/
WPDNSE/
~DFA0214428CD719AF6.TMP
Note a different option would to be use find, if you can live with the different output (e.g. find ! -regex ".*[a-z].*"), but that will have a different output.

The exact regular expression depend on the output format of your ls command. Assuming that you do not use an alias for ls, you can try this:
ls -R | grep -o -w "[A-Z]*"
note that with -R in ls you will recursively list directories and files under the current directory. The grep option -o tells grep to only print the matched part of the text. The -w options tell grep to consider as match only for whole words. The "[A-Z]*" is a regexp to filter only upper-cased words.
Note that this regexp will print TEST.txt as well as TEXT.TXT. In other words, it will only consider names that are formed by letters.

It's ls which lists the files, not grep, so that is where you need to specify that you want "/" appended to directories. Use ls --classify to append "/" to directories.
grep is used to process the results from ls (or some other source, generally speaking) and only show lines that match the pattern you specify. It is not limited to uppercase characters. You can limit it to just upper case characters and "/" with grep -E '^[A-Z/]*$ or if you also want numbers, periods, etc. you could instead filter out lines that contain lowercase characters with grep -v -E [a-z].
As grep is not the program which lists the files, it is not where you want to perform the recursion. ls can list paths recursively if you use ls -R. However, you're just going to get the last component of the file paths that way.
You might want to consider using find to handle the recursion. This works for me:
find . -exec ls -d --classify {} \; | egrep -v '[a-z][^/]*/?$'
I should note, using ls --classify to append "/" to the end of directories may also append some other characters to other types of paths that it can classify. For instance, it may append "*" to the end of executable files. If that's not OK, but you're OK with listing directories and other paths separately, this could be worked around by running find twice - once for the directories and then again for other paths. This works for me:
find . -type d | egrep -v '[a-z][^/]*$' | sed -e 's#$#/#'
find . -not -type d | egrep -v '[a-z][^/]*$'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string