Comparison script help - linux

I'm trying to write a Bash script that will go through a set of directories for a cache and make some comparisons on the contents. (I want to find the two that have the smallest differences for purposes of a project I'm working on).
The structure is that there is a root directory; two subdirectories after that; under that up to 52 directories (a AA b BB etc); and under each of those a variable number of directories where the contents actually are. Basically:
root >> a/b >> a/AA/b/BB/.../z/ZZ >> <some hex-named directory>
So I need to get to that last level, then run diff on the file in that directory (the contents are always named identically) and all the other cached files and figure out what the most similar files are.
The two directories at the top never change name, so that's easy. The directories under those follow a set format (they fill sequentially starting with 'a' and 'AA' up through 'z' and 'ZZ'), so I could just hard code an array for that. The best way I can think to do the last level is to run 'ls > dirList', then read dirList into an array, and use that to go into the directories, and run diff through a loop on every other cache thing using the same algorithm (yes, run time is going to be awful, but it will save a tremendous amount of time in the long run).
Is this a reasonable approach? Is there a better, or more efficient way?
Also, is there a way to get diff to count the number of lines that are different?
I know this is a bit long, but any help would be greatly appreciated.
Thanks!

Assuming the 2 directories in your root directory are the ones to compare (a & b), I would try something like that:
min_diff=9999 # big value
file2remember=''
cd a || return $?
find * -type f |while read f
do
n=`diff "$f" "../b/$f"|wc -l`
if [ $n -lt $min_diff ]
then min_diff=$n ; file2remember="$f"
fi
done
echo $file2remember
NB: I do not have a linux or unix box to test that.

Related

Renaming All Files in a Directory

I split a large text file into 60 chunks, which are are named xaa, xab, xac,...xcg. I want to rename these files so that they all end with .txt
How can I do this from the linux command line?
Looked in the split command for the ability to customize the filenames. Looked on Stack Overflow for other solutions but the ones I've come across are all too specific to the OP's situation.
Assuming that your shell is the default Bash:
for f in x??; do mv "$f" "$f.txt"; done
If you want to be more specific, you could say x[abc][a-z] instead of x??.
This is good enough for a one-liner. In a script you would want to check that "$f" exists before trying to rename it.

How to rename files in bash to increase number in name?

I have a few thousand files named as follows:
Cyprinus_carpio_600_nanopore_trim_reads.fasta
Cyprinus_carpio_700_nanopore_trim_reads.fasta
Cyprinus_carpio_800_nanopore_trim_reads.fasta
Cyprinus_carpio_900_nanopore_trim_reads.fasta
Vibrio_cholerae_3900_nanopore_trim_reads.fasta
for 80 variations of the first two words (80 different species), i would like to rename all of these files such that the number is increased by 100 - for example:
Vibrio_cholerae_3900_nanopore_trim_reads.fasta
would become
Vibrio_cholerae_4000_nanopore_trim_reads.fasta
or
Cyprinus_carpio_300_nanopore_trim_reads.fasta
would become
Cyprinus_carpio_400_nanopore_trim_reads.fasta
Unfortunately I can't work out how to get to rename them, i've had some luck with following the solutions on https://unix.stackexchange.com/questions/40523/rename-files-by-incrementing-a-number-within-the-filename
But i can't get it to work for the inside of the name, i'm running on Ubuntu 18.04 if that helps
If you can get hold of the Perl-flavoured version of rename, that is simple like this:
rename -n 's/(\d+)/$1 + 100/e' *fasta
Sample Output
'Ciprianus_maximus_11_fred.fasta' would be renamed to 'Ciprianus_maximus_111_fred.fasta'
'Ciprianus_maximus_300_fred.fasta' would be renamed to 'Ciprianus_maximus_400_fred.fasta'
'Ciprianus_maximus_3900_fred.fasta' would be renamed to 'Ciprianus_maximus_4000_fred.fasta'
If you can't read Perl, that says... "Do a single substitution as follows. Wherever you see a bunch of digits next to each other in a row (\d+), remember them (because I put that in parentheses), and then replace them with the evaluated expression of that bunch of digits ($1) plus 100.".
Remove the -n if the dry-run looks correct. The only "tricky part" is the use of e at the end of the substitution which means evaluate the expression in the substitution - or I call it a "calculated replacement".
If there is only one number in your string then below two line of code should provide help you resolve your issue
filename="Vibrio_cholerae_3900_nanopore_trim_reads.fasta"
var=$(echo $filename | grep -oP '\d+')
echo ${filename/${var}/$((var+100))}
Instead of echoing the changed file name, you can take it into a variable and use mv command to rename it
Considering the filename conflicts in the increasing order, I first thought of reversing the order but there still remains the possibility of conflicts in the alphabetical (standard) sort due to the difference to the numerical sort.
Then how about a two-step solution: in the 1st step, an escape character (or whatever character which does not appear in the filename) is inserted in the filename and it is removed in the 2nd step.
#!/bin/bash
esc=$'\033' # ESC character
# 1st pass: increase the number by 100 and insert a ESC before it
for f in *.fasta; do
num=${f//[^0-9]/}
num2=$((num + 100))
f2=${f/$num/$esc$num2}
mv "$f" "$f2"
done
# 2nd pass: remove the ESC from the filename
for f in *.fasta; do
f2=${f/$esc/}
mv "$f" "$f2"
done
Mark's perl-rename solution looks great but you should apply it twice with a bump of 50 to avoid name conflict. If you can't find this flavor of rename you could try my rene.py (https://rene-file-renamer.sourceforge.io) for which the command would be (also applied twice) rene *_*_*_* *_*_?_* B/50. rene would be a little easier because it automatically shows you the changes and asks whether you want to make them and it has an undo if you change your mind.

How to split large file to small files with prefix in Linux/Bash

I have a file in Linux called test. Now I want to split the test into say 10 small files.
The test file has more than 1000 table names. I want the small files to have equal no of lines, the last file might have the same no of table names or not.
What I want is can we add a prefix to the split files while invoking the split command in the Linux terminal.
Sample:
test_xaa test_xab test_xac and so on..............
Is this possible in Linux.
I was able to solve my question with the following statement
split -l $(($(wc -l < test.txt )/10 + 1)) test.txt test_x
With this I was able to get the desired result
I would've sworn split did this on it's own, but to my surprise, it does not.
To get your prefix, try something like this:
for x in /path/to/your/x*; do
mv $x your_prefix_$x
done

Using diff for two files and send by email [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have files like below. I use crontab every 5 min to check the files to see if the system's added one file, for example like this: AIR_2015xxxxT0yyyyyyyy.cfg. Then I need to use the diff command automatically between the last one and before the last one.
AIR_20151021T163514000.cfg
AIR_20151026T103845000.cfg
AIR_2015xxxxT0yyyyyyyy.cfg
I want to do this in a script like the one below:
#!/bin/bash
/var/opt/fds/
diff AIR_2015xxxxT0yyyyyyyy.cfg AIR_20151026T103845000.cfg > Test.txt
body(){
cat body.txt
}
(echo -e "$(body)") | -a Test.txt mailx -s 'Comparison' user#email.com
Given a list of files in the directory /var/opt/fds with names in the format:
AIR_YYYYmmddTHHMMSSfff.cfg
where the letter Y represents digits for the year, m for month, d for day, H for hour, M for minute, S for second, and f for fraction (milliseconds), then you need to establish the two most recent files in the directory to compare them.
One way to do this is:
cd /var/opt/fds || exit 1
old=
new=
for file in AIR_20[0-9][0-9]????T?????????.cfg
do
old=$new
new=$file
done
if [ -n "$old" ] && [ -n "$new" ]
then
diff "$old" "$new" > test.txt
mailx -a test.txt -s 'Comparison' user#example.com < body.txt
fi
Note that if the new file has a name containing letters x and y as shown in the question and comments, it will be listed after the names containing the time stamp as digits, so it will be picked up as the new file. It also assumes permission to write in the /var/opt/fds directory, and that the mail body file is present in that directory too. Those assumptions can be trivially fixed if necessary. The test.txt file should be deleted after it is sent, too, and you could check that it is non-empty before sending the email (just in case the two most recent files are in fact identical). You could embed a time-stamp in the generated file name containing the diffs instead of using test.txt:
output="diff.$(date +'%Y%m%dT%H%M%S000').txt"
and then use $output in place of test.txt.
The test ensures that there was both an old and a new name. The pattern match is sloppier than it could be, but using [0-9] or an appropriate subrange ([01], [0-3], [0-2], [0-5]) for the question marks makes the pattern unreadably long:
for file in AIR_20[0-9][0-9][01][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9][0-9][0-9][0-9].cfg
It also probably provides very little extra in the way of protection. Of course, as shown, it imposes a Y2.1K crisis on the system, not that it is hard to fix that. You could also cut down the range of valid dates by basing it on today's date, but beware of the end of the year, etc. You might decide you only need entries from the last month or so.
Using globbing is generally better than trying to parse ls or find output. In this context, where the file names have a restricted set of characters in the name (no newlines, no blanks or tabs, no quotes, no dollar signs, etc), it is feasible to use either find or ls — but if you have to deal with arbitrary names created by random end users, those tools are not suitable. (The ls command does all sorts of weird stuff with weird names and basically is hard to use reliably in the face of user cussedness. The find command and its -print0 option can be used, especially if you have a sort that recognizes -z to work with null-terminated 'lines' and an xargs that supports -0 to handle such lines too — but you have to very careful.)
Note that this scheme does not keep a record of the last file analyzed (so if no new files appear for an hour, you might send a dozen copies of the same differences), nor does it directly report on the file names (but using diff -u or diff -c would include the file names being diffed in the output). Again, these issues can be worked around if that's appropriate (and it probably is). Keeping the record of which files have been compared is probably the hardest job; even that's not too bad:
echo "$old" "$new" >> reported.diffs
to record what's been processed; then
if grep -q "$old $new" reported.diffs
then : Already processed
else : Process $old and $new
fi

Compare 2 files with shell script

I was trying to find the way for knowing if two files are the same, and found this post...
Parsing result of Diff in Shell Script
I used the code in the first answer, but i think it's not working or at least i cant get it to work properly...
I even tried to make a copy of a file and compare both (copy and original), and i still get the answer as if they were different, when they shouldn't be.
Could someone give me a hand, or explain what's happening?
Thanks so much;
peixe
Are you trying to compare if two files have the same content, or are you trying to find if they are the same file (two hard links)?
If you are just comparing two files, then try:
diff "$source_file" "$dest_file" # without -q
or
cmp "$source_file" "$dest_file" # without -s
in order to see the supposed differences.
You can also try md5sum:
md5sum "$source_file" "$dest_file"
If both files return same checksum, then they are identical.
comm is a useful tool for comparing files.
The comm utility will read file1 and file2, which should be
ordered in the current collating sequence, and produce three
text columns as output: lines only in file1; lines only in
file2; and lines in both files.

Resources