In Linux compare File byte per byte in 2 folder and look for duplicate [closed] - linux

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have several image files (.jpg, png and more) containing photos in 2 directories, how can I compare the files in the 2 directories byte by byte using Bash under Linux so as to:
1) highlight duplicate files in both directories on a stdout or file
2) delete only 1 of the duplicate files, e.g. the most recent one.

You probably don't need byte-per-byte comparison. Calculating checksum and working with it is easier and probability of collision is very low. It can also save some time if you want to perform this multiple times with slow disk.
I have two directories (a and b) with these files:
$ ls *
a:
agetty agetty-2 badblocks bridge btrfs btrfs-image lvreduce lvreduce-2 resize2fs
b:
agetty agetty-2 bridge
1 Calculate checksums first
I will calculate checksums for all files and sort them:
find a b -type f | xargs sha256sum | sort > cksums
You can also use md5sum and others. md5sum is faster than sha256sum but probability of collision (situation where two files have same checksum) is a bit higher (but still enough).
Content of the file:
b1a58ac886f70cb65cc124bcc8e12a52659fbf5ce841956953d70d29b74869d7 a/resize2fs
c0e532634d14783bbd2ec1a1ed9bfc0b64da4a1efea2e9936fb97c6777ac1e10 a/btrfs-image
d00cdf58189e2171e3cb6610e6290c70ba03ecc0dc46b0570595d9187d769d2e a/btrfs
fadc2874feb053947ac1a4d8f14df58dabc093fa00b92f01125497ac9a171999 a/badblocks
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 a/agetty
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 a/agetty-2
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 b/agetty
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 b/agetty-2
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 b/bridge
7e177d31c45ab550b27ca743e4502cc4be519de4c75b2f479f427930bcb7c7bd a/bridge
9954909c3436bef767729b8f6034e5f12ef300fad16dc0e540bfa3c89c38b9c6 a/lvreduce
9954909c3436bef767729b8f6034e5f12ef300fad16dc0e540bfa3c89c38b9c6 a/lvreduce-2
You can even visually compare the files. Files with same content have same checksum. Notice that the SHA256 checksum is 64 hex digits/chars long (32 bytes).
2 Find repeated lines
cat cksums | uniq -Dw 64 | sed 's/^\S*\s*//'
Output:
a/agetty
a/agetty-2
b/agetty
b/agetty-2
b/bridge
a/lvreduce
a/lvreduce-2
You can also group files with same contents:
cat cksums | uniq --group -w 64 --group | sed 's/^\S*\s*//'
a/resize2fs
a/btrfs-image
a/btrfs
a/badblocks
a/agetty
a/agetty-2
b/agetty
b/agetty-2
b/bridge
a/bridge
a/lvreduce
a/lvreduce-2
3 List files for deletion
count=0
cat cksums | uniq --group -w 64 --group | sed 's/^\S*\s*//' | while read filename
do
if [[ -z "$filename" ]]
then
if [[ 1 -lt "$count" ]]
then
echo "$prev"
fi
count=0
else
prev="$filename"
((count++))
fi
done
Delete them by appending | xargs rm -v to done.

Related

How to search in files and output discoveries only if they match both files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to search for a string (which I don't know unless I look inside the files) on Linux command-line.
Example:
A file 1 with text inside
A file 2 with text inside
In both files the word "Apple" is existent.
I want to echo this word (which exists in both files) in a file or store it in a variable.
How is this possible?
You can get a list of all the unique words in a file using:
grep -o -E '\w+' filename | sort -u
where -E '\w+' is matching words and -o outputs the matching parts. We can then use the join command which identifies matching lines in two files, along with process substitution to pass in the results of our word finder:
join <(grep -o -E '\w+' filename1 | sort -u) <(grep -o -E '\w+' filename2 | sort -u)
If there are no duplicates whiten a single file you could use cat file1 file2 |sort | uniq -d
$ cat input_one.txt
FIREFOX a
FIREFOX b
Firefox a
firefox b
$ cat input_two.txt
CHROME a
FIREFOX a
EXPLORER a
$ while read line; do grep "$line" input_two.txt ; done < input_one.txt
FIREFOX a
Explanation:
while will loop every line with input_two.txt file as input and will store the temporary line in the line variable.
In every line will search for it in the input_one.txt file and -o option will make to print only the matched part.
EDIT: See comments
You can write the script to handle this.
You need loop a word on file 1, in the loop use grep command (grep -nwr "" -e "$word") to find a word in file 2.
If match, echo a word.

Create mutiple files in multiple directories [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I 've got tree of folders like:
00 -- 0
-- 1
...
-- 9
...
99 -- 0
-- 1
...
-- 9
How is the simplest way to create in every single subfolders a file like:
/00/0/00_0.txt
and save to every files some kind of data?
I tried with touch and with loop but without success.
Any ideas how to make it very simple?
List all directories using globs. Modify the listed paths with sed so that 37/4 becomes 37/4/37_4.txt. Use touch to create empty files for all modified paths.
touch $(printf %s\\n */*/ | sed -E 's|(.*)/(.*)/|&\1_\2.txt|')
This works even if 12/3 was just a placeholder and your actual paths are something like abcdef/123. However it will fail when your paths contain any special symbols like whitespaces, *, or ?.
To handle arbitrary path names use the following command. It even supports linebreaks in path names.
mapfile -td '' a < <(printf %s\\0 */*/ | sed -Ez 's|(.*)/(.*)/|&\1_\2.txt|')
touch "${a[#]}"
You may use find and then run commands using -exec
find . -type d -maxdepth 2 -mindepth 2 -exec bash -c 'f={};
cmd=$(echo "${f}/${f%/*}_${f##*/}.txt"); touch $cmd' \;
the bash substitution ${f%/*}_${f##*/} replaces the last / with _

Rm and Egrep -v combo

I want to remove all the logs except the current log and the log before that.
These log files are created after 20 minutes.So the files names are like
abc_23_19_10_3341.log
abc_23_19_30_3342.log
abc_23_19_50_3241.log
abc_23_20_10_3421.log
where 23 is today's date(might include yesterday's date also)
19 is the hour(7 o clock),10,30,50,10 are the minutes.
In this case i want i want to keep abc_23_20_10_3421.log which is the current log(which is currently being writen) and abc_23_19_50_3241.log(the previous one)
and remove the rest.
I got it to work by creating a folder,putting the first files in that folder and removing the files and then deleting it.But that's too long...
I also tried this
files_nodelete=`ls -t | head -n 2 | tr '\n' '|'`
rm *.txt | egrep -v "$files_nodelete"
but it didnt work.But if i put ls instead of rm it works.
I am an amateur in linux.So please suggest a simple idea..or a logic..xargs rm i tried but it didnt work.
Also read about mtime,but seems abit complicated since I am new to linux
Working on a solaris system
Try the logadm tool in Solaris, it might be the simplest way to rotate logs. If you just want to get things done, it will do it.
http://docs.oracle.com/cd/E23823_01/html/816-5166/logadm-1m.html
If you want a solution similar (but working) to your try this:
ls abc*.log | sort | head -n-2 | xargs rm
ls abc*.log: list all files, matching the pattern abc*.log
sort: sorts this list lexicographical (by name) from oldes to to newest logfile
head -n-2: return all but the last two entry in the list (you can give -n a negativ count too)
xargs rm: compose the rm command with the entries from stdin
If there are two or less files in the directory, this command will return an error like
rm: missing operand
and will not delete any files.
It is usually not a good idea to use ls to point to files. Some files may cause havoc (files which have a [Newline] or a weird character in their name are the usual exemples ....).
Using shell globs : Here is an interresting way : we count the files newer than the one we are about to remove!
pattern='abc*.log'
for i in $pattern ; do
[ -f "$i" ] || break ;
#determine if this is the most recent file, in the current directory
# [I add -maxdepth 1 to limit the find to only that directory, no subdirs]
if [ $(find . -maxdepth 1 -name "$pattern" -type f -newer "$i" -print0 | tr -cd '\000' | tr '\000' '+' | wc -c) -gt 1 ];
then
#there are 2 files more recent than $i that match the pattern
#we can delete $i
echo rm "$i" # remove the echo only when you are 100% sure that you want to delete all those files !
else
echo "$i is one of the 2 most recent files matching '${pattern}', I keep it"
fi
done
I only use the globbing mechanism to feed filenames to "find", and just use the terminating "0" of the -printf0 to count the outputed filenames (thus I have no problems with any special characters in those filenames, I just need to know how many files were outputted)
tr -cd "\000" will keep only the \000, ie the terminating NUL character outputed by print0. Then I translate each \000 to a single + character, and I count them with the wc -c. If I see 0, "$i" was the most recent file. If I see 1, "$i" was the one just a bit older (so the find sees only the most recent one). And if I see more than 1, it means the 2 files (mathching the pattern) that we want to keep are newer than "$i", so we can delete "$i"
I'm sure someone will step in with a better one, but the idea could be reused, I guess...
Thanks guyz for all the answers.
I found my answer
files=`ls -t *.txt | head -n 2 | tr '\n' '|' | rev |cut -c 2- |rev`
rm `ls -t | egrep -v "$files"`
Thank you for the help

What is cat for and what is it doing here? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have this script I'm studying and I would like to know what is cat doing in this section.
if cat downloaded.txt | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> downloaded.txt
cat $count | egrep -o "http://server.*(png|jpg|gif)" | nice -n -20 wget --no-dns-cache -4 --tries=2 --keep-session-cookies --load-cookies=cookies.txt --referer=http://server.com/wallpaper/$number -i -
rm $count
fi
Like most cats, this is a useless cat.
Instead of:
if cat downloaded.txt | grep "$count" >/dev/null
It could have been written:
if grep "$count" download.txt > /dev/null
In fact, because you've eliminated the pipe, you've eliminated issues with which exit value the if statement is dealing with.
Most Unix cats you'll see are of the useless variety. However, people like cats almost as much as they like using a grep/awk pipe, or using multiple grep or sed commands instead of combining everything into a single command.
The cat command stands for concatenate which is to allow you to concatenate files. It was created to be used with the split command which splits a file into multiple parts. This was useful if you had a really big file, but had to put it on floppy drives that couldn't hold the entire file:
split -b140K -a4 my_really_big_file.txt my_smaller_files.txt.
Now, I'll have my_smaller_files.txt.aaaa and my_smaller_files.txt.aaab and so forth. I can put them on the floppies, and then on the other computer. (Heck, I might go all high tech and use UUCP on you!).
Once I get my files on the other computer, I can do this:
cat my_smaller_files.txt.* > my_really_big_file.txt
And, that's one cat that isn't useless.
cat prints out the contents of the file with the given name (to the standard output or to wherever it's redirected). The result can be piped to some other command (in this case, (e)grep to find something in the file contents). Concretely, here it tries to download the images referenced in that file, then adds the name of the file to downloaded.txt in order to not process it again (this is what the check in if was about).
http://www.linfo.org/cat.html
"cat" is a unix command that reads the contents of one or more files sequentially and by default prints out the information the user console ("stdout" or standard output).
In this case cat is being used to read the contents of the file "downloaded.txt", the pipe "|" is redirecting/feeding its output to the grep program, which is searching for whatever is in the variable "$count" to be matched with.

Split files using tar, gz, zip, or bzip2 [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 13 years ago.
Improve this question
I need to compress a large file of about 17-20 GB. I need to split it into several files of around 1GB per file.
I searched for a solution via Google and found ways using split and cat commands. But they did not work for large files at all. Also, they won't work in Windows; I need to extract it on a Windows machine.
You can use the split command with the -b option:
split -b 1024m file.tar.gz
It can be reassembled on a Windows machine using #Joshua's answer.
copy /b file1 + file2 + file3 + file4 filetogether
Edit: As #Charlie stated in the comment below, you might want to set a prefix explicitly because it will use x otherwise, which can be confusing.
split -b 1024m "file.tar.gz" "file.tar.gz.part-"
// Creates files: file.tar.gz.part-aa, file.tar.gz.part-ab, file.tar.gz.part-ac, ...
Edit: Editing the post because question is closed and the most effective solution is very close to the content of this answer:
# create archives
$ tar cz my_large_file_1 my_large_file_2 | split -b 1024MiB - myfiles_split.tgz_
# uncompress
$ cat myfiles_split.tgz_* | tar xz
This solution avoids the need to use an intermediate large file when (de)compressing. Use the tar -C option to use a different directory for the resulting files. btw if the archive consists from only a single file, tar could be avoided and only gzip used:
# create archives
$ gzip -c my_large_file | split -b 1024MiB - myfile_split.gz_
# uncompress
$ cat myfile_split.gz_* | gunzip -c > my_large_file
For windows you can download ported versions of the same commands or use cygwin.
If you are splitting from Linux, you can still reassemble in Windows.
copy /b file1 + file2 + file3 + file4 filetogether
use tar to split into multiple archives
there are plenty of programs that will work with tar files on windows, including cygwin.
Tested code, initially creates a single archive file, then splits it:
gzip -c file.orig > file.gz
CHUNKSIZE=1073741824
PARTCNT=$[$(stat -c%s file.gz) / $CHUNKSIZE]
# the remainder is taken care of, for example for
# 1 GiB + 1 bytes PARTCNT is 1 and seq 0 $PARTCNT covers
# all of file
for n in `seq 0 $PARTCNT`
do
dd if=file.gz of=part.$n bs=$CHUNKSIZE skip=$n count=1
done
This variant omits creating a single archive file and goes straight to creating parts:
gzip -c file.orig |
( CHUNKSIZE=1073741824;
i=0;
while true; do
i=$[i+1];
head -c "$CHUNKSIZE" > "part.$i";
[ "$CHUNKSIZE" -eq $(stat -c%s "part.$i") ] || break;
done; )
In this variant, if the archive's file size is divisible by $CHUNKSIZE, then the last partial file will have file size 0 bytes.

Resources