Search for multiple strings (from a file) in another file and list the missing strings - string

I have a file with 200 student names. Another huge file that contains data for those 200 students. I want to make sure that none of the student names got missed. I'm looking for a script that look at the string from students.txt and then search for it in alldata.txt. If it is missing, list it
I tried using
find /tmp/alldata.txt -type f -exec grep -iHFf students.txt {} +
But it lists all the matches and misses to provide the list of the strings that it didn't find in the alldata.txt

You don't need find if you're just searching one file. But if your data file is unstructured text and the names can appear anywhere, you may need to look for them one at a time:
while read name; do
fgrep -q $name alldata.txt || echo $name
done < students.txt

Assuming the student names are in the first field of alldata.txt:
comm -23 <(sort students.txt) <(awk '{print $1}' alldata.txt | sort -u)
comm -23 prints all the lines that are in the first file but not in the second file. This uses process substitution to treat the output of the two commands as files.

Related

grep search for pipe term Argument list too long

I have something like
grep ... | grep -f - *orders*
where the first grep ... gives a list of order numbers like
1393
3435
5656
4566
7887
6656
and I want to find those orders in multiple files (a_orders_1, b_orders_3 etc.), these files look something like
1001|strawberry|sam
1002|banana|john
...
However, when the first grep... returns too many order numbers I get the error "Argument list too long".
I also tried to give the grep command one order number at a time using a while loop but that's just way too slow. I did
grep ... | while read order; do grep $order *orders*; done
I'm very new to Unix clearly, explanations would be greatly appreciated, thanks!
The problem is the expansion of *orders* in grep ... | grep -f - *orders*. Your shell expands the pattern to the full list of files before passing that list to grep.
So we need to pass fewer "orders" files to each grep invocation. The find program is one way to do that, because it accepts wildcards and expands them internally:
find . -name '*orders*' # note this searches subdirectories too
Now that you know how to generate the list of filenames without running into the command line length limit, you can tell find to execute your second grep:
grep ... | find . -name '*orders*' -exec grep -f - {} +
The {} is where find places the filenames, and the + terminates the command and lets find know you're OK with passing multiple arguments to each invocation of grep -f, while still respecting the command line length limit by invoking grep -f more than once if the list of files exceeds the allowed length of a single command.

How to sort and print array listing of specific file type in shell

I am trying to write a loop with which I want to extract text file names in all sub-directories and append certain strings to it. Additionally, I want the text file name sorted for numbers after ^.
For example, I have three sub directories mydir1, mydir2, mydir3. I have,
in mydir1,
file223^1.txt
file221^2.txt
file666^3.txt
in mydir2,
file111^1.txt
file4^2.txt
In mydir3,
file1^4.txt
file5^5.txt
The expected result final.csv:
STRINGmydir1file223^1
STRINGmydir1file221^2
STRINGmydir1file666^3
STRINGmydir2file111^1
STRINGmydir2file4^2
STRINGmydir3file1^4
STRINGmydir3file5^5
This is the code I tried:
for dir in my*/; do
array=(${dir}/*.txt)
IFS=$'\n' RGBASE=($(sort <<<"${array[#]}"));
for RG in ${RGBASE[#]}; do
RGTAG=$(basename ${RG/.txt//})
echo "STRING${dir}${RGTAG}" >> final.csv
done
done
Can someone please explain what is wrong with my code? Also, there could be other better ways to do this, but I want to use the for-loop.
The output with this code:
$ cat final.csv
STRINGdir1file666^3.txt
STRINGdir2file4^2.txt
STRINGdir3file5^5.txt
As a starting point which works for your special case, I got a two liner for this.
mapfile -t array < <( find my* -name "*.txt" -printf "STRING^^%H^^%f\n" | cut -d"." -f1 | LANG=C sort -t"^" -k3,3 -k6 )
printf "%s\n" "${array[#]//^^/}"
To restrict the directory depth, you can add -maxdepth with the number of subdirs to search. The find command can also use regex in the search, which is applied to the whole path, which can be used to work on a more complex directory-tree.
The difficulty was the sort on two positions and the delimiter.
My idea was to add a delimiter, which easily can be removed afterwards.
The sort command can only handle one delimiter, therefore I had to use the double hat as delimiter which can be removed without removing the single hat in the filename.
A solution using decorate-sort-undecorate idiom could be:
printf "%s\n" my*/*.txt |
sed -E 's_(.*)/(.*)\^([0-9]+).*_\1\t\3\tSTRING\1\2^\3_' |
sort -t$'\t' -k1,1 -k2,2n |
cut -f3
assuming filenames don't contain tab or newline characters.
A basic explanation: The printf prints each pathname on a separate line. The sed converts the pathname dir/file^number.txt into dir\tnumber\tSTRINGdirfile^number (\t represents a tab character). The aim is to use the tab character as a field separator in the sort command. The sort sorts the lines by the first (lexicographically) and second fields (numerically). The cut discards the first and second fields; the remaining field is what we want.

How to sort by name then date modification in BASH

Lets say I have a folder of .txt files that have a dd-MM-yyyy_HH-mm-ss time followed by _name.txt. I want to be able to sort by name first then time after. Example:
BEFORE
15-2-2010_10-01-55_greg.txt
10-2-1999_10-01-55_greg.txt
10-2-1999_10-01-55_jason.txt
AFTER
greg_1_10-2-1999_10-01-55
greg_2_15-2-2010_10-01-55
jason_1_10-2-1999_10-01-55
Edit: Apologies, from my "cp" line I was meant to copy them into another directory with a different name to them.
Something I tried to do is make a copy with the count, but it doesn't sort the files with the same name properly in terms of dates:
cd data/unfilteredNames
for filename in *.txt; do
n=${filename%.*}
n=${filename##*_}
filteredName=${n%.*}
count=0
find . -type f -name "*_$n" | while read name; do
count=$(($count+1))
cp -p $name ../filteredNames/"$filteredName"_"$count"
done
done
Not sure that the renaming of files is one of your expectation. At least for only sorting file name, you don't need to.
You can do this by only using GNU sort command:
sort -t- -k5.4 -k3.1,3.4 -k2.1,2.1 -k1.1,1.2 -k3.6,3.13 <(printf "%s\n" *.txt)
-t sets the field separator to a dash -.
-k enables to sort based on fields. As explained in man sort page, the syntax is -k<start>,<stop> where <start> or is composed of <field number>.<position>. Adding several -k option to the command allows to sort on multiple fields; the first in he command line having more precedence than the other.
For example, the first -k5.4 tells to sort based on the 5th fields with an offset of 4 characters. There isn't a stop field because this is the end of the filename.
The -k3.1,3.4 option sorts based on the 3rd field starting from offset 1 to 4.
The same principle applies to other -k options.
In your example the month field only has 1 digit. If you have files with a month coded with 2 digits, you might want to pad with 0 all month filenames. This can be done by adding to the printf statement this <(... | sed 's/-0\?\([0-9]\)/-0\1/') and change the -k 2.1,2.1 by -k2.1,2.2.

Copy the latest updated file based on substring from filename in bash

I have to archive some files (based on date which is there in file) from a folder but there can be multiple files with same name (substring). I have to copy only the latest one to a saperate folder.
for eg.
20180730.abc.xyz2.jkl.20180729.164918.csv.gz
In this -> 20180730 and 20180729 are representing date from which I have to search by (first date) 20180730. This part is done.
The searching part which i wrote is :
for FILE in $SOURCE_DIR/$BUSINESS_DT*
{
do
# Here I have to search if this FILENAME exists and if yes, then copy that latest file
cp "${FILE}" $TARGET_DIR/
done
Now I have to search if the same SOURCE_DIR contains a file with the name similar to 20180730.abc.xyz2.jkl. and if it exists then I have to copy it.
so basically, I have to extract the portion abc.xyz2.jkl. I can't use cut with fields as the filename could either be like abc.xyz2.jkl or abc.xyz. The portion is variable and can also have numberthe last two numbers are also variable and can change.
Some eg are:
20180730.abc.xyz2.jkl.20170729.890789.csv.gz
20180730.abc.xyz2.20180729.121212.csv.gz
20180730.ab.xy.20180729.11111.csv.gz
Can anybody please help me in doing that. I tried find and cut but didn't got required results.
Many Thanks
Python might be a better choice for implementing something like this, but here is a bash example. You can use sed positional parameter to extract the portion of the filename that you want. Then use an associative array to store the filename of the newest file containing the substring found.
Once that's done, you can go back and do the copy operations. Here is an example which extracts the string between the two 8-digit numbers and periods. This sed expression may not work for your complete data set, but it works for the 3 examples you gave. Also this won't handle cases where one unique identifier is a subset of another unique identifier.
declare -A LATEST
for FILE in $SOURCE_DIR/$BUSINESS_DT*
do
# Extract the substring unique identifier
HASH=$(echo "${FILE}" | sed "s/[0-9]\{8\}\.\(.*\)\.[0-9]\{8\}.*$/\\1/g")
# If this is the first time on this unique identifier,
# then get the latest matching file
if [ ${LATEST[${HASH}]}abc == abc ]
then
LATEST[${HASH}]=$(find . -type f -name '*${HASH}*' -printf '%T# %p\n' | sort -n | tail -1 | cut -f2- -d" ")
fi
done
for FILE in "${!LATEST[#]}"
do
cp "${FILE}" $TARGET_DIR/
done

Building a file index in Linux

I have a filesystem with deeply nested directories. Inside the bottom level directory for any node in the tree is a directory whose name is the guid of a record in a database. This folder contains the binary file(s) (pdf, jpg, etc) that are attached to that record.
Two Example paths:
/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf
/g/camm/MOUNT/raid_fs1/FOO/052014/22/321.654.987/04.20.30--27.04.2014--RJ123.pdf
In the above example, 123.456.789 and 321.654.987 are guids
I want to build an index of the complete filesystem so that I can create a lookup table in my database to easily map the guid of the record to the absolute path(s) of its attached file(s).
I can easily generate a straight list of files with:
find /g/camm/MOUNT -type f > /g/camm/MOUNT/files.index
but I want to parse the output of each file path into a CSV file which looks like:
GUID ABSOLUTEPATH FILENAME
123.456.789 /g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf 04.20.30--27.04.2014--RJ123.pdf
321.654.987 /g/camm/MOUNT/raid_fs1/FOO/052014/22/321.654.987/04.20.30--27.04.2014--RJ123.pdf 04.20.30--27.04.2014--RJ123.pdf
I think I need to pipe the output of my find command into xargs and again into awk to process each line of the output into the desired format for the CSV output... but I can't make it work...
Wait for your long-running find to finish, then you
can pass the list of filenames through awk:
awk -F/ '{printf "%s,%s,%s\n",$(NF-1),$0,$NF}' /g/camm/MOUNT/files.index
and this will convert lines like
/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf
into
123.456.789,/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf,04.20.30--27.04.2014--RJ123.pdf
The -F/ splits the line into fields using "/" as separator, NF is the
number of fields, so $NF means the last field, and $(NF-1) the
next-to-last, which seems to be the directory you want in the first column
of the output. I used "," in the printf to separate the output columns, as
is typical in a csv; you can replace it by any character such as space or ";".
I dont think there can be anything much faster than your find command, but
you may be interested by the locate package. It uses the updatedb command, usually run each night by cron, to traverse the filesystem and creates a file holding all the filenames in a manner than can be easily searched by another command.
The locate command is used to read the database to find matching directories, files, and so on, even using glob wild-card or regex pattern matching. Once tried, it is hard to live without it.
For example, on my system locate -S lists the statistics:
Database /var/lib/mlocate/mlocate.db:
59945 directories
505330 files
30401572 bytes in file names
12809265 bytes used to store database
and I can do
locate rc-dib0700-nec.ko
locate -r rc-.*-nec.ko
locate '*/media/*rc-*-nec.ko*'
to find files like /usr/lib/modules/4.1.6-100.fc21.x86_64/kernel/drivers/media/rc/keymaps/rc-dib0700-nec.ko.xz in no time at all.
You can nearly do what you want with the find's -printf option.
The difficuty is on GUID.
Assuming prefixes have the same length as in your example, I would probably do:
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | colrm 1 37 > /g/camm/MOUNT/files.index
Or if the number of / is constant
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | cut -d '/' -f 9- > /g/camm/MOUNT/files.index
Otherwise, I would use sed:
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | sed -e 's#^.*/\(.*\) #\1 #' > /g/camm/MOUNT/files.index

Resources