Given two directory trees how to find which filenames are the same, considering only filenames satisfying a condition? - linux

This answer tells me how to find the files with the same filename in two directories in bash:
diff -srq dir1/ dir2/ | grep identical
Now I want to consider files which satisfy a condition. If I use ls E*, I get back files starting with E. I want to do the same with the above command: give me the filenames which are different in dir1/ and dir2/, but consider only those starting with E.
I tried the following:
diff -srq dir1/E* dir2/E* | grep identical
but it did not work, I got this output:
diff: extra operand '/home/pal/konkoly/c6/elesbe3/1/EPIC_212291374-
c06-k2sc.dat.flag.spline' diff: Try 'diff --help' for more
information.
((/home/pal/konkoly/c6/elesbe3/1/EPIC_212291374-
c06-k2sc.dat.flag.spline is a file in the so-called dir1, but EPIC_212291374-
c06-k2sc.dat.flag.spline is not in the so-called dir2))
How can I solve this?
I tried doing it in the following way, based on this answer:
DIR1=$(ls dir1)
DIR2=$(ls dir2)
for i in $DIR1; do
for j in $DIR2; do
if [[ $i == $j ]]; then
echo "$i == $j"
fi
done
done
It works as above, but if I write DIR1=$(ls path1/E*) and DIR2=$(ls path2/E*), it does not, I get no output.

This is untested, but I'd try something like:
comm -12 <(cd dir1 && ls E*) <(cd dir2 && ls E*)
Basic idea:
Generate a list of filenames in dir1 that satisfy our condition. This can be done with ls E* because we're only dealing with a flat list of files. For subdirectories and recursion we'd use find instead (e.g. find . -name 'E*' -type f).
Put the filenames in a canonical order (e.g. by sorting them). We don't have to do anything here because E* expands in sorted order anyway. With find we might have to pipe the output into sort first.
Do the same thing to dir2.
Only output lines that are common to both lists, which can be done with comm -12.
comm expects to be passed two filenames on the command line, so we use the <( ... ) bash feature to spawn a subprocess and connect its output to a named pipe; the name of the pipe can then be given to comm.

The accepted answer works fine. Though if someone needs a python implementation, this also works:
import glob
dir1withpath=glob.glob("path/to/dir1/E*")
dir2withpath=glob.glob("path/to/dir2/E*")
dir1=[]
for index,each in enumerate(dir1withpath):
dir1list=dir1withpath[index].split("/")
dir1.append(dir1list[-1])
dir2=[]
for index,each in enumerate(dir2withpath):
dir2list=dir2withpath[index].split("/")
dir2.append(dir2list[-1])
for each1 in dir1:
for each2 in dir2:
if each1 == each2:
print(each1 + "is in both directories")

Related

Bash script that counts and prints out the files that start with a specific letter

How do i print out all the files of the current directory that start with the letter "k" ?Also needs to count this files.
I tried some methods but i only got errors or wrong outputs. Really stuck on this as a newbie in bash.
Try this Shellcheck-clean pure POSIX shell code:
count=0
for file in k*; do
if [ -f "$file" ]; then
printf '%s\n' "$file"
count=$((count+1))
fi
done
printf 'count=%d\n' "$count"
It works correctly (just prints count=0) when run in a directory that contains nothing starting with 'k'.
It doesn't count directories or other non-files (e.g. fifos).
It counts symlinks to files, but not broken symlinks or symlinks to non-files.
It works with 'bash' and 'dash', and should work with any POSIX-compliant shell.
Here is a pure Bash solution.
files=(k*)
printf "%s\n" "${files[#]}"
echo "${#files[#]} files total"
The shell expands the wildcard k* into the array, thus populating it with a list of matching files. We then print out the array's elements, and their count.
The use of an array avoids the various problems with metacharacters in file names (see e.g. https://mywiki.wooledge.org/BashFAQ/020), though the syntax is slightly hard on the eyes.
As remarked by pjh, this will include any matching directories in the count, and fail in odd ways if there are no matches (unless you set nullglob to true). If avoiding directories is important, you basically have to get the directories into a separate array and exclude those.
To repeat what Dominique also said, avoid parsing ls output.
Demo of this and various other candidate solutions:
https://ideone.com/XxwTxB
To start with: never parse the output of the ls command, but use find instead.
As find basically goes through all subdirectories, you might need to limit that, using the -maxdepth switch, use value 1.
In order to count a number of results, you just count the number of lines in your output (in case your output is shown as one piece of output per line, which is the case of the find command). Counting a number of lines is done using the wc -l command.
So, this comes down to the following command:
find ./ -maxdepth 1 -type f -name "k*" | wc -l
Have fun!
This should work as well:
VAR="k"
COUNT=$(ls -p ${VAR}* | grep -v ":" | wc -w)
echo -e "Total number of files: ${COUNT}\n" 1>&2
echo -e "Files,that begin with ${VAR} are:\n$(ls -p ${VAR}* | grep -v ":" )" 1>&2

Batch copy and rename multiple files in the same directory

I have 20 files like:
01a_AAA_qwe.sh
01b_AAA_asd.sh
01c_AAA_zxc.sh
01d_AAA_rty.sh
...
Files have a similar format in their names. They begin with 01, and they have 01*AAA*.sh format.
I wish to copy and rename files in the same directory, changing the number 01 to 02, 03, 04, and 05:
02a_AAA_qwe.sh
02b_AAA_asd.sh
02c_AAA_zxc.sh
02d_AAA_rty.sh
...
03a_AAA_qwe.sh
03b_AAA_asd.sh
03c_AAA_zxc.sh
03d_AAA_rty.sh
...
04a_AAA_qwe.sh
04b_AAA_asd.sh
04c_AAA_zxc.sh
04d_AAA_rty.sh
...
05a_AAA_qwe.sh
05b_AAA_asd.sh
05c_AAA_zxc.sh
05d_AAA_rty.sh
...
I wish to copy 20 of 01*.sh files to 02*.sh, 03*.sh, and 04*.sh. This will make the total number of files to 100 in the folder.
I'm really not sure how can I achieve this. I was trying to use for loop in the bash script. But not even sure what should I need to select as a for loop index.
for i in {1..4}; do
cp 0${i}*.sh 0${i+1}*.sh
done
does not work.
There are going to be a lot of ways to slice-n-dice this one ...
One idea using a for loop, printf + brace expansion, and xargs:
for f in 01*.sh
do
printf "%s\n" {02..05} | xargs -r -I PFX cp ${f} PFX${f:2}
done
The same thing but saving the printf in a variable up front:
printf -v prefixes "%s\n" {02..05}
for f in 01*.sh
do
<<< "${prefixes}" xargs -r -I PFX cp ${f} PFX${f:2}
done
Another idea using a pair of for loops:
for f in 01*.sh
do
for i in {02..05}
do
cp "${f}" "${i}${f:2}"
done
done
Starting with:
$ ls -1 0*.sh
01a_AAA_qwe.sh
01b_AAA_asd.sh
01c_AAA_zxc.sh
01d_AAA_rty.sh
All of the proposed code snippets leave us with:
$ ls -1 0*.sh
01a_AAA_qwe.sh
01b_AAA_asd.sh
01c_AAA_zxc.sh
01d_AAA_rty.sh
02a_AAA_qwe.sh
02b_AAA_asd.sh
02c_AAA_zxc.sh
02d_AAA_rty.sh
03a_AAA_qwe.sh
03b_AAA_asd.sh
03c_AAA_zxc.sh
03d_AAA_rty.sh
04a_AAA_qwe.sh
04b_AAA_asd.sh
04c_AAA_zxc.sh
04d_AAA_rty.sh
05a_AAA_qwe.sh
05b_AAA_asd.sh
05c_AAA_zxc.sh
05d_AAA_rty.sh
NOTE: blank lines added for readability
You can't do multiple copies in a single cp command, except when copying a bunch of files to a single target directory. cp will not do the name mapping automatically. Wildcards are expanded by the shell, they're not seen by the commands themselves, so it's not possible for them to do pattern matching like this.
To add 1 to a variable, use $((i+1)).
You can use the shell substring expansion operator to get the part of the filename after the first two characters.
for i in {1..4}; do
for file in 0${i}*.sh; do
fileend=${file:2}
cp "$file" "0$((i+1))$fileend"
done
done

Rename files into numbers, starting with a specific number

I want to rename all files in a directory to be sequential numbers:
1.txt
2.txt
3.txt
and so on...
Here's the code I'm currently using:
ls | cat -n | while read n f; do mv "$f" "$n.txt"; done
The code does work, but I need to start with a specific number. For example, I may want to start with the number 49 instead of the number 1.
Is there any way to do this in terminal (on a Mac)?
You could use something like nl with the -v option to set a starting line number other than 1, but instead, you can just use Bash features:
i=1
for f in *; do
[[ -f $f ]] && mv "$f" $((i++)).txt
done
where i is set to the initial value you want.
This also avoids parsing the output of ls, which is recommended to avoid. Instead, I use a glob (*) and a test (-f) to make sure that I'm actually manipulating files and not directories.

How to escape square brackets in a ls output

I'm experiencing some problems to escape square brackets in any file name.
I need to compare two list. The ls output is the first list and the second is the ARQ02.
#!/bin/bash
exec 3< <(ls /home/lint)
while read arq <&3; do
var=`grep -e "$arq" ARQ02`
if [ "$?" -ne 0 ] ; then
echo "$arq" >> result
fi
done
exec 3<&-
Sorry for my bad english.
Your immediate problem is that you must instruct grep to interpret the search term as a literal rather than a regular expression, using the -F option:
var=$(grep -Fe "$arq" ARQ02)
That way, any regex metacharacters that happen to be in the output from ls /home/lint - such as [ and ] - will still be treated as literals and won't break the grep invocation.
That said, it looks like your command could be streamlined, such as by using the output from ls /home/lint directly as the set of search strings to pass to grep at once, using the -f option:
grep -Ff <(ls /home/lint) ARQ02 > result
<(...) is a so-called process substitution, which, simply put, presents the output from a command as if it were a (temporary) file, which is what -f expects: a file containing the search terms for grep.
Alternatively, if:
the lines of ARQ02 contain only filenames that fully match (some of) the filenames in the output from ls /home/lint, and
you don't mind sorting or want to sort the matches stored in result,
consider HuStmpHrrr's helpful answer.
i have to assume my interpretation is correct. based on that, i can raise a oneliner easily solve your solution. there are 2 assumption i need to make here: your file name doesn't contain carriage return and you are using modern bash:
comm -23 <(printf "%s\n" * | sort) <(sort ARQ02)
in bash <() emits a subshell and pipe the stdout as a file. comm is the command to compute difference of 2 input stream.
to explain in details,
comm
-23 # suppress files unique in ARQ02 and files in common
<(printf "%s\n" * | # print all the files in local folder with new line breaker
sort) # sort them
<(sort ARQ02)
it's necessary to sort as comm only compare incrementally.

How to tell how many files match description with * in unix

Pretty simple question: say I have a set of files:
a1.txt
a2.txt
a3.txt
b1.txt
And I use the following command:
ls a*.txt
It will return:
a1.txt a2.txt a3.txt
Is there a way in a bash script to tell how many results will be returned when using the * pattern. In the above example if I were to use a*.txt the answer should be 3 and if I used *1.txt the answer should be 2.
Comment on using ls:
I see all the other answers attempt this by parsing the output of
ls. This is very unpredictable because this breaks when you have
file names with "unusual characters" (e.g. spaces).
Another pitfall would be, it is ls implementation dependent. A
particular implementation might format output differently.
There is a very nice discussion on the pitfalls of parsing ls output on the bash wiki maintained by Greg Wooledge.
Solution using bash arrays
For the above reasons, using bash syntax would be the more reliable option. You can use a glob to populate a bash array with all the matching file names. Then you can ask bash the length of the array to get the number of matches. The following snippet should work.
files=(a*.txt) && echo "${#files[#]}"
To save the number of matches in a variable, you can do:
files=(a*.txt)
count="${#files[#]}"
One more advantage of this method is you now also have the matching files in an array which you can iterate over.
Note: Although I keep repeating bash syntax above, I believe the above solution applies to all sh-family of shells.
You can't know ahead of time, but you can count how many results are returned. I.e.
ls -l *.txt | wc -l
ls -l will display the directory entries matching the specified wildcard, wc -l will give you the count.
You can save the value of this command in a shell variable with either
num=$(ls * | wc -l)
or
num=`ls -l *.txt | wc -l`
and then use $num to access it. The first form is preferred.
You can use ls in combination with wc:
ls a*.txt | wc -l
The ls command lists the matching files one per line, and wc -l counts the number of lines.
I like suvayu's answer, but there's no need to use an array:
count() { echo $#; }
count *
In order to count files that might have unpredictable names, e.g. containing new-lines, non-printable characters etc., I would use the -print0 option of find and awk with RS='\0':
num=$(find . -maxdepth 1 -print0 | awk -v RS='\0' 'END { print NR }')
Adjust the options to find to refine the count, e.g. if the criteria is files starting with a lower-case a with .txt extension in the current directory, use:
find . -type f -name 'a*.txt' -maxdepth 1 -print0

Resources