Comparing list of directories to list in a file and writing another file for comparison [duplicate] - linux

Ok, I have two related lists on my linux box in text files:
/tmp/oldList
/tmp/newList
I need to compare these lists to see what lines got added and what lines got removed. I then need to loop over these lines and perform actions on them based on whether they were added or removed.
How do I do this in bash?

Use the comm(1) command to compare the two files. They both need to be sorted, which you can do beforehand if they are large, or you can do it inline with bash process substitution.
comm can take a combination of the flags -1, -2 and -3 indicating which file to suppress lines from (unique to file 1, unique to file 2 or common to both).
To get the lines only in the old file:
comm -23 <(sort /tmp/oldList) <(sort /tmp/newList)
To get the lines only in the new file:
comm -13 <(sort /tmp/oldList) <(sort /tmp/newList)
You can feed that into a while read loop to process each line:
while read old ; do
...do stuff with $old
done < <(comm -23 <(sort /tmp/oldList) <(sort /tmp/newList))
and similarly for the new lines.

The diff command will do the comparing for you.
e.g.,
$ diff /tmp/oldList /tmp/newList
See the above man page link for more information. This should take care of your first part of your problem.

Consider using Ruby if your scripts need readability.
To get the lines only in the old file:
ruby -e "puts File.readlines('/tmp/oldList') - File.readlines('/tmp/newList')"
To get the lines only in the new file:
ruby -e "puts File.readlines('/tmp/newList') - File.readlines('/tmp/oldList')"
You can feed that into a while read loop to process each line:
while read old ; do
...do stuff with $old
done < ruby -e "puts File.readlines('/tmp/oldList') - File.readlines('/tmp/newList')"

This is old, but for completeness we should say that if you have a really large set, the fastest solution would be to use diff to generate a script and then source it, like this:
#!/bin/bash
line_added() {
# code to be run for all lines added
# $* is the line
}
line_removed() {
# code to be run for all lines removed
# $* is the line
}
line_same() {
# code to be run for all lines at are the same
# $* is the line
}
cat /tmp/oldList | sort >/tmp/oldList.sorted
cat /tmp/newList | sort >/tmp/newList.sorted
diff >/tmp/diff_script.sh \
--new-line-format="line_added %L" \
--old-line-format="line_removed %L" \
--unchanged-line-format="line_same %L" \
/tmp/oldList.sorted /tmp/newList.sorted
source /tmp/diff_script.sh
Lines changed will appear as deleted and added. If you don't like this, you can use --changed-group-format. Check the diff manual page.

I typically use:
diff /tmp/oldList /tmp/newList | grep -v "Common subdirectories"
The grep -v option inverts the match:
-v, --invert-match
Selected lines are those not matching any of the specified pat-
terns.
So in this case it takes the diff results and omits those that are common.

Have you tried diff
$ diff /tmp/oldList /tmp/newList
$ man diff

Related

When comparing directories with diff is there a way to exclude certain file differences from the output?

I am running this command:
diff --recursive a.git b.git
It shows differences for some files that do not concern me. Is there a way to for example not have it show lines that end in:
".xaml.g.cs"
or lines that start in:
"Binary files"
or lines that have this text in them:
".git/objects"
A simple but effective way is to pipe output to grep. With grep -Ev you can ignore lines using regular expressions.
diff --recursive a.git b.git | grep -Ev "^< .xaml.g.cs|^> .xaml.g.cs" | grep -Ev "Binary files$" | grep -v ".git/objects"
This ignores all lines with matching text. As for the regular expressions: ^ means line starts with, $ means line ends with. But at least for ^ you have to adjust it to the diff output (where lines normally start with < or >).
Also note that diff provides a flag --ignore-matching-lines=RE but it might not work as you would expect as mentioned in this question/answer. And because it does not work as I would expect I rather use grep for filtering.
man diff
gives following examples:
...
-x, --exclude=PAT
exclude files that match PAT
-X, --exclude-from=FILE
exclude files that match any pattern in FILE
...
Did you try using those switches (like diff -x=".xaml.g.cs" --recursive a.git b.git)?

Can you use 'less' or 'more' to output one page worth of text?

So in Linux, less is used to read files page by page for better readability. I was wondering if you can do something like less file.txt > output.txt to get one page worth of file.txt and output/write it to `output.txt.
Apparently, this does not work, output.txt is exactly the same as the original file, I'm wondering why this is the case, and if there are other work-arounds. Thank you!
You can use the split command.
split -l 100 -d -a 3 input output
This will split the input file every 100 lines (-l 100), will use numbers as suffixes (-d) and will use 3 numbers as suffix (-a 3) in the output file. Something like this output000, output001, output002
You can use head to get a specific number of lines, and tput lines to see how many lines there are on your current terminal.
Here's a script that fetches a pageful, or the standard 25 lines if no terminal is available:
#!/bin/bash
lines=$(tput lines) || lines=25
head -n "$lines" file.txt > output.txt
we use head and tail to get n lines of top or bottom part of a file
cat /var/log/messages|tail -n20
head -n10 src/main.h

How to escape square brackets in a ls output

I'm experiencing some problems to escape square brackets in any file name.
I need to compare two list. The ls output is the first list and the second is the ARQ02.
#!/bin/bash
exec 3< <(ls /home/lint)
while read arq <&3; do
var=`grep -e "$arq" ARQ02`
if [ "$?" -ne 0 ] ; then
echo "$arq" >> result
fi
done
exec 3<&-
Sorry for my bad english.
Your immediate problem is that you must instruct grep to interpret the search term as a literal rather than a regular expression, using the -F option:
var=$(grep -Fe "$arq" ARQ02)
That way, any regex metacharacters that happen to be in the output from ls /home/lint - such as [ and ] - will still be treated as literals and won't break the grep invocation.
That said, it looks like your command could be streamlined, such as by using the output from ls /home/lint directly as the set of search strings to pass to grep at once, using the -f option:
grep -Ff <(ls /home/lint) ARQ02 > result
<(...) is a so-called process substitution, which, simply put, presents the output from a command as if it were a (temporary) file, which is what -f expects: a file containing the search terms for grep.
Alternatively, if:
the lines of ARQ02 contain only filenames that fully match (some of) the filenames in the output from ls /home/lint, and
you don't mind sorting or want to sort the matches stored in result,
consider HuStmpHrrr's helpful answer.
i have to assume my interpretation is correct. based on that, i can raise a oneliner easily solve your solution. there are 2 assumption i need to make here: your file name doesn't contain carriage return and you are using modern bash:
comm -23 <(printf "%s\n" * | sort) <(sort ARQ02)
in bash <() emits a subshell and pipe the stdout as a file. comm is the command to compute difference of 2 input stream.
to explain in details,
comm
-23 # suppress files unique in ARQ02 and files in common
<(printf "%s\n" * | # print all the files in local folder with new line breaker
sort) # sort them
<(sort ARQ02)
it's necessary to sort as comm only compare incrementally.

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!
I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines
with GNU-grep you could just say
grep --context=10 ...
No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.
What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.
I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!
You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).
I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.
If your line number is 100 to read
head -100 filename | tail -1
Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.
sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times
Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>
With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.
Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}
To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.
print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi
You could try this command:
egrep -n "*" <filename> | egrep "<line number>"
Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

Resources