checking equality of a part of two files

checking equality of a part of two files - linux

Is it possible to check if first line of two files is equal using diff(or another easy bash command)?
[Generally checking equality of first/last k lines, or even lines i to j]

To diff the first k lines of two files:
$ diff <(head -k file1) <(head -k file2)
Similary, to diff the last k lines:
$ diff <(tail -k file1) <(tail -k file2)
To diff lines i to j:
diff <(sed -n 'i,jp' file1) <(sed -n 'i,jp' file2)

My solution seems rather basic and beginner when compared to dogbane's above, but here it is all the same!
echo "Comparing the first line from file $1 and $2 to see if they are the same."
FILE1=`head -n 1 $1`
FILE2=`head -n 1 $2`
echo $FILE1 > tempfile1.txt
echo $FILE2 > tempfile2.txt
if diff "tempfile1.txt" "tempfile2.txt"; then
echo Success
else
echo Fail
fi

My solution uses the filterdiff program of the patchutils program collection. The following command shows the difference between file1 and file2 from line number j to k:
diff -U 0 file1 file2 | filterdiff --lines j-k

below command displays the first line of both the files.
krithika.450> head -1 temp1.txt temp4.txt
==> temp1.txt <==
Starting CXC <...> R5x BCMBIN (c) AB 2012
==> temp4.txt <==
Starting CXC <...> R5x BCMBIN (c) AB 2012
Below command displays yes if the first line in both the filesare equal.
krithika.451> head -1 temp4.txt temp1.txt | awk '{if(NR==2)p=$0;if(NR==5){q=$0;if(p==q)print "yes"}}'
yes
krithika.452>

Related

Join 2 files by common column header (without awk/sed)

Basically I want to get all records from file2, but filter out columns whose header doesn't appear in file1
Example:
file1
Name Location
file2
Name Phone_Number Location Email
Jim 032131 xyz xyz#qqq.com
Tim 037903 zzz zzz#qqq.com
Pimp 039141 xxz xxz#qqq.com
Output
Name Location
Jim xyz
Tim zzz
Pimp xxz
Is there a way to do this without awk or sed, but still using coreutils tools? I've tried doing it with join, but couldn't get it working.

ALL_COLUMNS=$(head -n1 file2)
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Explanation:
ALL_COLUMNS=$(head -n1 file2)
It saves all the column names to filter next
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
For every column in file1, we look for the position of the one with the same name in file2 and append it to JOIN_FORMAT in the way of "2.<number_of_column>,"
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Once we have the option string complete (2.1,2.3,), we pass it to join removing the last ,.
join prints the unpairable lines from the second file provided (-a2 -> file2), but only the columns specified in the -o option.

Not very efficient, but works for your example:
#!/bin/bash
read -r -a cols < file1
echo "${cols[#]}"
read -r -a header < <(head -n1 file2)
keep=()
for (( i=0; i<${#header}; i++ )) ; do
for c in "${cols[#]}" ; do
if [[ ${header[i]} == "$c" ]] ; then
keep+=($i)
fi
done
done
while read -r -a data ; do
for idx in ${keep[#]} ; do
printf '%s ' "${data[idx]}"
done
printf '\n'
done < <(tail -n+2 file2)
Tools used: head and tail. They aren't essential, though. And bash, of course.

Randomly shuffling lines in Linux / Bash

I have some files in linux. For example 2 and i need shuffling the files in one file.
For example
$cat file1
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
and
$cat file2
linea one
linea two
linea three
linea four
linea five
linea six
linea seven
linea eight
And later that i shuffling the two files i can obtain something like:
linea eight
line 4
linea five
line 1
linea three
line 8
linea seven
line 5
linea two
linea one
line 2
linea four
line 7
linea six
line 1
line 6

You should use shuf command =)
cat file1 file2 | shuf
Or with Perl :
cat file1 file2 | perl -MList::Util=shuffle -wne 'print shuffle <>;'

Sort: (similar lines will be put together)
cat file1 file2 | sort -R
Shuf:
cat file1 file2 | shuf
Perl:
cat file1 file2 | perl -MList::Util=shuffle -e 'print shuffle<STDIN>'
BASH:
cat file1 file2 | while IFS= read -r line
do
printf "%06d %s\n" $RANDOM "$line"
done | sort -n | cut -c8-
Awk:
cat file1 file2 | awk 'BEGIN{srand()}{printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-

Just a note to OS X users who use MacPorts: the shuf command is part of coreutils and is installed under name gshuf.
$ sudo port install coreutils
$ gshuf example.txt # or cat example.txt | gshuf

You don't need to use pipes here. Sort alone does this with the file(s) as parameters. I would just do
sort -R file1
or if you have multiple files
sort -R file1 file2

Here's a one-liner that doesn't rely on shuf or sort -R, which I didn't have on my mac:
while read line; do echo $RANDOM $line; done < my_file | sort -n | cut -f2- -d' '
This iterates over all the lines in my_file and reprints them in a randomized order.

I would use shuf too.
another option, gnu sort has:
-R, --random-sort
sort by random hash of keys
you could try:
cat file1 file2|sort -R

This worked for me. It employs the Fisher-Yates shuffle.
randomize()
{
arguments=("$#")
declare -a out
i="$#"
j="0"
while [[ $i -ge "0" ]] ; do
which=$(random_range "0" "$i")
out[j]=${arguments[$which]}
arguments[!which]=${arguments[i]}
(( i-- ))
(( j++ ))
done
echo ${out[*]}
}
random_range()
{
low=$1
range=$(($2 - $1))
if [[ range -ne 0 ]]; then
echo $(($low+$RANDOM % $range))
else
echo "$1"
fi
}

It is clearly biased rand (like half the time the list will start with the first line) but for some basic randomization with just bash builtins I guess it is fine? Just print each line yes/no then print the rest...
shuffle() {
local IFS=$'\n' tail=
while read l; do
if [ $((RANDOM%2)) = 1 ]; then
echo "$l"
else
tail="${tail}\n${l}"
fi
done < $1
printf "${tail}\n"
}

insert the contents of a file to another (in a specific line of the file that is sent)-BASH/LINUX

I tried doing it with cat and then after I type the second file I added | head -$line | tail -1 but it doesn't work because it performs cat first.
Any ideas? I need to do it with cat or something else.

I'd probably use sed for this job:
line=3
sed -e "${line}r file2" file1
If you're looking to overwrite file1 and you have GNU sed, add the -i option. Otherwise, write to a temporary file and then copy/move the temporary file over the original, cleaning up as necessary (that's the trap stuff below). Note: copying the temporary over the file preserves links; moving does not (but is swifter, especially if the file is big).
line=3
tmp="./sed.$$"
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
sed -e "${line}r file2" file1 > $tmp
cp $tmp file1
rm -f $tmp
trap 0

Just for fun, and just because we all love ed, the standard editor, here's an ed version. It's very efficient (ed is a genuine text editor)!
ed -s file2 <<< $'3r file1\nw'
If the line number is stored in the variable line then:
ed -s file2 <<< "${line}r file1"$'\nw'
Just to please Zack, here's one version with less bashism, in case you don't like bash (personally, I don't like pipes and subshells, I prefer herestrings, but hey, as I said, that's only to please Zack):
printf "%s\n" "${line}r file1" w | ed -s file2
or (to please Sorpigal):
printf "%dr %s\nw" "$line" file1 | ed -s file2
As Jonathan Leffler mentions in a comment, and if you intend to use this method in a script, use a heredoc (it's usually the most efficient):
ed -s file2 <<EOF
${line}r file1
w
EOF
Hope this helps!
P.S. Don't hesitate to leave a comment if you feel you need to express yourself about the ways to drive ed, the standard editor.

cat file1 >>file2
will append content of file1 to file2.
cat file1 file2
will concatenate file1 and file2 and send output to terminal.
cat file1 file2 >file3
will create or overwite file3 with concatenation of file1 and file2
cat file1 file2 >>file3
will append concatenation of file1 and file2 to end of file3.
Edit:
For trunking file2 before adding file1:
sed -e '11,$d' -i file2 && cat file1 >>file2
or for making a 500 lines file:
n=$((500-$(wc -l <file1)))
sed -e "1,${n}d" -i file2 && cat file1 >>file2

Lots of ways to do it, but I like to to choose a way that involves making tools.
First, setup test environment
rm -rf /tmp/test
mkdir /tmp/test
printf '%s\n' {0..9} > /tmp/test/f1
printf '%s\n' {one,two,three,four,five,six,seven,eight,nine,ten} > /tmp/test/f2
Now let's make the tool, and in this first pass we'll implement it badly.
# insert contents of file $1 into file $2 at line $3
insert_at () { insert="$1" ; into="$2" ; at="$3" ; { head -n $at "$into" ; ((at++)) ; cat "$insert" ; tail -n +$at "$into" ; } ; }
Then run the tool to see the amazing results.
$ insert_at /tmp/test/f1 /tmp/test/f2 5
But wait, the result is on stdout! What about overwriting the original? No problem, we can make another tool for that.
insert_at_replace () { tmp=$(mktemp) ; insert_at "$#" > "$tmp" ; mv "$tmp" "$2" ; }
And run it
$ insert_at_replace /tmp/test/f1 /tmp/test/f2 5
$ cat /tmp/test/f2
"Your implementation sucks!"
I know, but that's the beauty of making simple tools. Let's replace insert_at with the sed version.
insert_at () { insert="$1" ; into="$2" ; at="$3" ; sed -e "${at}r ${insert}" "$into" ; }
And insert_at_replace keeps working (of course). The implementation of insert_at_replace can also be changed to be less buggy, but I'll leave that as an exercise for the reader.

I like doing this with head and tail if you don't mind managing a new file:
head -n 16 file1 > file3 &&
cat file2 >> file3 &&
tail -n+56 file1 >> file3
You can collapse this onto one line if you like. Then, if you really need it to overwrite file1, do: mv file3 file1 (optionally include && between commands).
Notes:
head -n 16 file1 means first 16 lines of file1
tail -n+56 file1 means file1 starting from line 56 to the end
Hence, I actually skipped lines 17 through 55 from file1.
Of course, if you could change 56 to 17 so no lines are skipped.
I prefer to mix simple head and tail commands then try a magic sed command.

Calculate Word occurrences from file in bash

I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file
I know I can do this:
sort | uniq -c | sort
the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on.
What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow.
I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out.
Could you please lead me to the right way?
I'm also pasting the code I wrote:
#!/bin/bash
# count the number of word occurrences from a file and writes to another file #
# the words are listed from the most frequent to the less one #
touch .check # used to check the occurrances. Temporary file
touch distribution.txt # final file with all the occurrences calculated
page=$1 # contains the file I'm calculating
occurrences=$2 # temporary file for the occurrences
# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check
# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
word=${words} # word I'm calculating
strlen=${#word} # word's length
# I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word was never found before it writes it with 1 occurrence
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
rm .check
# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt

Well, I'm not sure that I've got the point of the thing you are trying to do,
but I would do it this way:
while read file
do
cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list
Now you have statistics for all your file, and now you simple aggregate it:
while read file
do
cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'
Example of usage:
$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF
$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list
$ while read file
> do
> cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head
3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

Linux file splitting

I am using sed to split a file in two
I have a file that has a custom separator "/-sep-/" and I want to split the file where the separator is
currently I have:
sed -n '1,/-sep-/ {p}' /export/data.temp > /export/data.sql.md5
sed -n '/-sep-/,$ {p}' /export/data.temp > /export/data.sql
but the file 1 contains /-sep-/ at the end and the file two begins with /-sep-/
how can I handle this?
note that on the file one I should remove a break line and the /-sep-/ and on the file 2 remove the /-sep-/ and a break line :S

Reverse it: tell it what to not print instead.
sed '/-sep-/Q' /export/data.temp > /export/data.sql.md5
sed '1,/-sep-/d'/export/data.temp > /export/data.sql
(Regarding that break line, I did not understand it. A sample input would probably help.)
By the way, your original code needs only minor addition to do what you want:
sed -n '1,/-sep-/{/-sep-/!p}' /export/data.temp > /export/data.sql.md5
sed -n '/-sep-/,${/-sep-/!p}' /export/data.temp > /export/data.sql

$ cat >testfile
a
a
a
a
/-sep-/
b
b
b
b
and then
$ csplit testfile '/-sep-/' '//'
8
8
8
$ head -n 999 xx*
==> xx00 <==
a
a
a
a
==> xx01 <==
/-sep-/
==> xx02 <==
b
b
b
b

sed -n '/-sep-/q; p' /export/data.temp > /export/data.sql.md5
sed -n '/-sep-/,$ {p}' /export/data.temp | sed '1d' > /export/data.sql
Might be easier to do in one pass with awk:
awk -v out=/export/data.sql.md5 -v f2=/export/data.sql '
/-sep-/ { out=f2; next}
{ print > out }
' /exourt/data.temp

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

checking equality of a part of two files - linux

Is it possible to check if first line of two files is equal using diff(or another easy bash command)? [Generally checking equality of first/last k lines, or even lines i to j]

To diff the first k lines of two files: $ diff <(head -k file1) <(head -k file2) Similary, to diff the last k lines: $ diff <(tail -k file1) <(tail -k file2) To diff lines i to j: diff <(sed -n 'i,jp' file1) <(sed -n 'i,jp' file2)

My solution uses the filterdiff program of the patchutils program collection. The following command shows the difference between file1 and file2 from line number j to k: diff -U 0 file1 file2 | filterdiff --lines j-k

Related

Join 2 files by common column header (without awk/sed)

Randomly shuffling lines in Linux / Bash

insert the contents of a file to another (in a specific line of the file that is sent)-BASH/LINUX

Calculate Word occurrences from file in bash

Linux file splitting

Categories

Resources