Using find command in bash in order to find multiple files with different types

Using find command in bash in order to find multiple files with different types - linux

I am running test scenarios. Each time I a scenario is executed, there are two report files in a specific directory; one is a text file and another is the an HTML file. I want to make an index file to link to all files. I have wrote a for loop to iterate over scenario files and execute them; I also want to read the final result of my test scenario from HTML file and append it to the index file. At the end of the loop, I append <a> tags for links using find command.
for line in $(grep '^scenario ' $scenarioList | cut -d' ' -f2)
do
# Scripts for running tests
find -name '*.txt' -exec sh -c 'f="`basename {}`"; echo "<br><span>Text Report: $f -- </span>" >> index.htm' \;
find -name '*.html' -exec sh -c 'f="`basename {}`"; p=`cat $f | grep -Eo "Final Result :.*\." | cut -d"." -f1`; echo "<span>HTML Report: $f</span> -- <span>$p</span><br>" >> index.htm' \;
# Other scripts
done
It should creates link to text file, link to HTML file and the final result of each scenario in a single line separated by --.
If I run this scripts over a single scenario, everything seems right:
But if I run it over more scenarios, this creates wrong links:
I know that I can use -o option for logical OR, but I don't know how to separate the text file and the HTML file from each other for creating links. Any help would appreciated.

You use the -o and -printf options of find
find . -name '*.txt' -printf '<br><span>Text Report: %f -- </span>' -o -name '*.html' -printf '<span>HTML Report: %f</span>'
I may have messed up your formatting there, but I think you get the idea.
The %p option prints the full file path, relative to the search and %f prints the filename

Perhaps something like this instead. This presumes that the text file's name can be predicted from the HTML file's name. I have also refactored your shell script where it seemed unidiomatic and/or inefficient.
grep '^scenario ' "$scenarioList" |
cut -d' ' -f2 |
while read -r line; do
# Scripts for running tests
find -name '*.html' -exec sh -c '
f="$(basename {} _style-all.html)";
echo "<br><span>Text Report: ${f}blog.txt -- </span>";
h=$(grep -Eo "Final Result :.*\." {} | cut -d"." -f1);
echo "<span>HTML Report: ${f}_style-all.html</span> -- <span>$p</span><br>"'
# Other scripts
done >index.html

Related

Grep files in subdirectories and write out files for each directory

I am working on a bioinformatics workflow in which the tool in question, 'salmon' creates multiple directories having a 'quant.sf' file. I want to find all 'lnc' entries within these files and save them as 'lnc.sf' for all directories.
I was previously running
cat quant.sf | grep 'lnc' > lnc.sf
in all directories individually that seemed to solve my problem. Now I want to write a script that goes into each directory and generates a lnc.sf file.
I have tried doing
find . -name "quant.sf" | while read A
do
cat $A | grep 'lnc' > lnc.sf
done
But this just creates a concatenated lnc.sf file in the current directory. Any help is highly appreciated.
Thank You!

If all your quant.sf files are at the same hierarchy level, the following should work, assuming a folder structure like month/day/quant.sf:
grep -h 'lnc' */*/quant.sf > lnc.sf
Otherwise, find the files, be aware of using find+read instead of exec or xargs; understand variable expansion with whitespaces, get rid of the redundant cat process, and write the file to the correct directory:
find . -name 'quant.sf' | while IFS= read -r A
do
grep 'lnc' "$A" > "${A%/*}/lnc.sf"
done
If you have GNU find + xargs, use -print0 combined with -0:
find . -name 'quant.sf' -print0 | xargs -0 -n1 sh -c 'grep "lnc" "$1" > "${1%/*}/lnc.sf"' -
Or use -exec of find, which avoids problems with weird files names:
find . -name 'quant.sf' -exec sh -c 'grep "lnc" "$1" > "${1%/*}/lnc.sf"' - ';'

How to read out a file line by line and for every line do a search with find and copy the search result to destination?

I hope you can help me with the following problem:
The Situation
I need to find files in various folders and copy them to another folder. The files and folders can contain white spaces and umlauts.
The filenames contain an ID and a string like:
"2022-01-11-02 super important file"
The filenames I need to find are collected in a textfile named ids.txt. This file only contains the IDs but not the whole filename as a string.
What I want to achieve:
I want to read out ids.txt line by line.
For every line in ids.txt I want to do a find search and copy cp the result to destination.
So far I tried:
for n in $(cat ids.txt); do find /home/alex/testzone/ -name "$n" -exec cp {} /home/alex/testzone/output \; ;
while read -r ids; do find /home/alex/testzone -name "$ids" -exec cp {} /home/alex/testzone/output \; ; done < ids.txt
The output folder remains empty. Not using -exec also gives no (search)results.
I was thinking that -name "$ids" is the root cause here. My files contain the ID + a String so I should search for names containing the ID plus a variable string (star)
As argument for -name I also tried "$ids *" "$ids"" *" and so on with no luck.
Is there an argument that I can use in conjunction with find instead of using the star in the -name argument?
Do you have any solution for me to automate this process in a bash script to read out ids.txt file, search the filenames and copy them over to specified folder?
In the end I would like to create a bash script that takes ids.txt and the search-folder and the output-folder as arguments like:
my-id-search.sh /home/alex/testzone/ids.txt /home/alex/testzone/ /home/alex/testzone/output
EDIT:
This is some example content of the ids.txt file where only ids are listed (not the whole filename):
2022-01-11-01
2022-01-11-02
2020-12-01-62
EDIT II:
Going on with the solution from tripleee:
#!/bin/bash
grep . $1 | while read -r id; do
echo "Der Suchbegriff lautet:"$id; echo;
find /home/alex/testzone -name "$id*" -exec cp {} /home/alex/testzone/ausgabe \;
done
In case my ids.txt file contains empty lines the -name "$id*" will be -name * which in turn finds all files and copies all files.
Trying to prevent empty line to be read does not seem to work. They should be filtered by the expression grep . $1 |. What am I doing wrong?

If your destination folder is always the same, the quickest and absolutely most elegant solution is to run a single find command to look for all of the files.
sed 's/.*/-o\n—name\n&*/' ids.txt |
xargs -I {} find -false {} -exec cp {} /home/alex/testzone/output +
The -false predicate is a bit of a hack to allow the list of actual predicates to start with -o (as in "or").
This could fail if ids.txt is too large to fit into a single xargs invocation, or if your sed does not understand \n to mean a literal newline.
(Here's a fix for the latter case:
xargs printf '-o\n-name\n%s*\n' <ids.txt |
...
Still the inherent problem with using xargs find like this is that xargs could split the list between -o and -name or between -name and the actual file name pattern if it needs to run more than one find command to process all the arguments.
A slightly hackish solution to that is to ensure that each pair is a single string, and then separately split them back out again:
xargs printf '-o_-name_%s*\n' <ids.txt |
xargs bash -c 'arr=("$#"); find -false ${arr[#]/-o_-name_/-o -name } -exec cp {} "$0"' /home/alex/testzone/ausgabe
where we temporarily hold the arguments in an array where each file name and its flags is a single item, and then replace the flags into separate tokens. This still won't work correctly if the file names you operate on contain literal shell metacharacters like * etc.)
A more mundane solution fixes your while read attempt by adding the missing wildcard in the -name argument. (I also took the liberty to rename the variable, since read will only read one argument at a time, so the variable name should be singular.)
while read -r id; do
find /home/alex/testzone -name "$id*" -exec cp {} /home/alex/testzone/output \;
done < ids.txt

Please try the following bash script copier.sh
#!/bin/bash
IFS=$'\n' # make newlines the only separator
set -f # disable globbing
file="files.txt" # name of file containing filenames
finish="finish" # destination directory
while read -r n ; do (
du -a | awk '{for(i=2;i<=NF;++i)printf $i" " ; print " "}' | grep $n | sed 's/ *$//g' | xargs -I '{}' cp '{}' $finish
);
done < $file
which copies recursively all the files named in files.txt from . and it's subfiles to ./finish
This new version works even if there are spaces in the directory names or file names.

Passing filename as variable from find's exec into a second exec command

From reading this stackoverflow answer I was able to remove the file extension from the files using find:
find . -name "S4*" -execdir basename {} .fastq.gz ';'
returned:
S9_S34_R1_001
S9_S34_R2_001
I'm making a batch script where I want to extract the filename with the above prefix to pass as arguments into a program. At the moment I'm currently doing this with a loop but am wondering if it can be achieved using find.
for i in $(ls | grep 'S9_S34*' | cut -d '.' -f 1); do echo "$i"_trim.log "$i"_R1_001.fastq.gz "$i"_R2_001.fastq.gz; done; >> trim_script.sh
Is it possible to do something as follows:
find . -name "S4*" -execdir basename {} .fastq.gz ';' | echo {}_trim.log {}_R1_001.fastq.gz {}_R2_001.fastq.gz {}\ ; >> trim_script.sh

You don't need basename at all, or -exec, if all you're doing is generating a series of strings that contain your file's basenames within them; the -printf action included in GNU find can do all that for you, as it provides a %P built-in to insert the basename of your file:
find . -name "S4*" \
-printf '%P_trim.log %P_R1_001.fastq.gz %P_R2_001.fastq.gz %P\n' \
>trim_script.sh
That said, be sure you only do this if you trust your filenames. If you're truly running the result as a script, there are serious security concerns if someone could create a S4$(rm -rf ~).txt file, or something with a similarly malicious name.
What if you don't trust your filenames, or don't have the GNU version of find? Then consider making find pass them into a shell (like bash or ksh) that supports the %q extension, to generate a safely-escaped version of those names (note that you should run the script with the same interpreter you used for this escaping):
find . -name "S4*" -exec bash -c '
for file do # iterates over "$#", so processes each file in turn
file=${file##*/} # get the basename
printf "%q_trim.log %q_R1_001.fastq.gz %q_R2_001.fastq.gz %q\n" \
"$file" "$file" "$file" "$file"
done
' _ {} + >trim_script.sh
Using -exec ... {} + invokes the smallest possible number of subprocesses -- not one per file found, but instead one per batch of filenames (using the largest possible batch that can fit on a command line).

using grep in single-line files to find the number of occurrences of a word/pattern

I have json files in the current directory, and subdirectories. All the files have a single line of content.
I want to a list of all files that contain the word XYZ, and the number of times it occurs in that file.
I want to print the list according to the following format:
file_name pattern_occurence_times
It should look something like:
.\x1\x2\file1.json 3
.\x1\file3.json 2
The problem is that grep counts the NUMBER of lines containing XYZ, not the number of occurrences.
Since the whole content of the files is always contained in a single line, the count is always 1 (if the pattern occurs in the file).
I used this command for that:
find . -type f -name "*.json" -exec grep --files-with-match -i 'xyz' {} \; -exec grep -wci 'xyz' {} \;
I wrote a python code, and it works, but I would like to know if there is any way of doing that using find and grep or any other command line tools.
Thanks

The classical approach to this problem is the pipeline grep -o regex file | wc -l. However, to execute a pipeline in find's -exec you have to run a shell (e.g. sh -c ... ). But all these things together will only print the number of matches, not the file names. Also, files with no matches have to be filtered out.
Because of all of this I think a single awk command would be preferable:
find ... -type f -exec awk '{$0=tolower($0); c+=gsub(/xyz/,"")}
END {if(c>0) print FILENAME " " c}' {} \;
Here the tolower($0) emulates grep's -i option. Make sure to write your search pattern xyz only in lowercase.
If you want to combine this with subsequent filters in find you can add else exit 1 at the end of the last awk block to continue (inside find) only with the printed files.

Use the -o option of grep, e.g. in conjunction with wc, e.g.
find . -name "*.json" | while read -r f ; do
echo $f : $(grep -ow XYZ "$f" | wc -l)
done

Linux piping find and md5sum not sending output

Trying to loop every file, do some cutting, extract the first 4 characters of the MD5.
Here's what I got so far:
find . -name *.jpg | cut -f4 -d/ | cut -f1 -d. | md5sum | head -c 4
Problem is, I don't see any more output at this point. How can I send output to md5sum and continue sending the result?

md5sum reads everything from stdin till end of file (eof) and outputs md5 sum of full file. You should separate input into lines and run md5sum per line, for example with while read var loop:
find . -name *.jpg | cut -f4 -d/ | cut -f1 -d. |
while read -r a;
do echo -n $a| md5sum | head -c 4;
done
read builtin bash command will read one line from input into shell variable $a; while loop will run loop body (commands between do and done) for every return from read, and $a will be the current line. -r option of read is to not convert backslash; -n option of echo command will not add newline (if you want newline, remove -n option of echo).
This will be slow for thousands of files and more, as there are several forks/execs for every file inside loop. Faster will be some scripting with perl or python or nodejs or any other scripting language with builtin md5 hash computing (or with some library).

You can do what you are attempting to do with a short "helper" script that you call from find. For example, you could create a short script to find the basename of each file passed as an argument, remove the '.jpg' extension, and then provide the remaining name w/o extension as input to md5sum on stdin to get the md5sum of the name itself. Call the script anything you like, say namemd5.sh. Example:
#!/bin/bash
[ -z "$1" ] && exit 1 ## validate single argument
fname=$(basename "$1") ## get the filename alone
fname="${fname%.jpg}" ## remove .jpg extension
fnsum=$(md5sum - <<<"$fname") ## get md5sum of name w/o .jpg
fnsum=${fnsum%% *} ## remove trailing ' -'
echo "$fnsum - $fname" ## output md5sum - name
## (remove ' - $fname' for md5sum alone)
(note: the name is provided as part of the output for example purposes, remove if you want the md5sum alone as shown in the comment above)
Example Files
$ find /home/david/img/wp/ -type f -name "*.jpg"
/home/david/img/wp/hacker_manifesto_1200x900.jpg
/home/david/img/wp/hacker_manifesto_by_otalicus.jpg
/home/david/img/wp/reflections-triple-1920x1200.jpg
/home/david/img/wp/hacker_wallpaper_1600x900.jpg
/home/david/img/wp/Zen.jpg
/home/david/img/wp/hacker_wallpaper_by_vanilla23-dot254.jpg
/home/david/img/wp/hacker_manifesto_1600x900.jpg
Example Use/Output
$ find /home/david/img/wp/ -type f -name "*.jpg" -exec ./namemd5.sh '{}' \;
0f7d2aac158eb9f7842215e14ff6573c - hacker_manifesto_1200x900
604bc695a0bb70b8db0352267caf226f - hacker_manifesto_by_otalicus
5decea0e306f185bf988ac9934ec0e2c - reflections-triple-1920x1200
82bd8e1ad3df588eb0e0848c5f764812 - hacker_wallpaper_1600x900
0f4daba431a22c03f28977f087e4c695 - Zen
0c55cd3ebd2a847e10c20d86e80e6ceb - hacker_wallpaper_by_vanilla23-dot254
e5c1da0c2db3827d2bf81c306633cc56 - hacker_manifesto_1600x900
You can also call the script with the -execdir version within find as well, e.g.
$ find /home/david/img/wp/ -type f -name "*.jpg" -execdir \
/full/path/to/namemd5.sh '{}' \;
(note: the use of the /full/path to your helper script above)

How to find all .jpg file then execute md5sum then cut first 4 caracters:
find . -name '*.jpg' -exec md5sum {} \; | cut -b 1-4

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string