Show uncommon part of the line - linux

Hi I have two files which contain paths. I want to compare the two files and show only uncommon part of the line.
1.txt:
/home/folder_name/abc
2.txt:
/home/folder_name/abc/pqr/xyz/mnp
Output I want:
/pqr/xyz/mnp
How can I do this?

This bit of awk does the job:
$ awk 'NR==FNR {a[++i]=$0; next}
{
b[++j]=$0;
if(length(a[j])>length(b[j])) {t=a[j]; a[j]=b[j]; b[j]=t}
sub(a[j],"",b[j]);
print b[j]
}' 2.txt 1.txt # or 2.txt 1.txt, it doesn't matter
Write the line from the first file to the array a.
Write the line from the second to b.
Swap a[j] and b[j] if a[j] is longer than b[j] (this might not be necessary if the longer text is always in b).
Remove the part found in a[j] from b[j] and print b[j].
This is a general solution; it makes no assumption that the match is at the start of the line, or that the contents of one file's line should be removed from the other. If you can afford to make those assumptions, the script can be simplified.
If the match may occur more than once on the line, you can use gsub rather than sub to perform a global substitution.

Considering you have strings in 1.txt and in 2.txt following code will do.
paste 1.txt 2.txt |
while read a b;
do
if [[ ${#a} -gt ${#b} ]];
then
echo ${a/$b};
else
echo ${b/$a};
fi;
done;
This is how it works on my system,
shiplu#:~/test/bash$ cat 1.txt
/home/shiplu/test/bash
/home/shiplu/test/bash/hello/world
shiplu#:~/test/bash$ cat 2.txt
/home/shiplu/test/bash/good/world
/home/shiplu/test/bash
shiplu#:~/test/bash$ paste 1.txt 2.txt |
> while read a b;
> do
> if [[ ${#a} -gt ${#b} ]];
> then
> echo ${a/$b};
> else
> echo ${b/$a};
> fi;
> done;
/good/world
/hello/world

This script will compare all lines in the file and only output the change in the line.
First it counts the number of lines in the first file.
Then i start a loop that will iterate for the number of lines.
Declare two variable that are the same line from both files.
Compare the lines and if they are the same output that they are.
If they are not then replace duplicate parts of the string with nothing(effectively removing them)
I used : as the seperator in sed as your variables contain /. So if they contain : then you may want to consider changing them.
Probably not the most efficient solution but it works.
#!/bin/bash
NUMOFLINES=$(wc -l < "1.txt")
echo $NUMOFLINES
for ((i = 1 ; i <= $NUMOFLINES ; i++)); do
f1=$(sed -n $i'p' 1.txt)
f2=$(sed -n $i'p' 2.txt)
if [[ $f1 < $f2 ]]; then
echo -n "Line $i:"
sed 's:'"$f1"'::' <<< "$f2"
elif [[ $f1 > $f2 ]]; then
echo -n "Line $i:"
sed 's:'"$f2"'::' <<< "$f1"
else
echo "Line $i: Both lines are the same"
fi
echo ""
done

If you happen to use bash, you could try this one:
echo $(diff <(grep -o . 1.txt) <(grep -o . 2.txt) \
| sed -n '/^[<>]/ {s/^..//;p}' | tr -d '\n')
It does a character-by-character comparison using diff (where grep -o . gives an intermediate line for each character to be fed to line-wise diff), and just prints the differences (intermediate diff output lines starting with markers < or > omitted, then joining lines with tr).
If you have multiple lines in your input (which you did not mention in your question) then try something like this (where % is a character not contained in your input):
diff <(cat 1.txt | tr '\n' '%' | grep -o .) \
<(cat 2.txt | tr '\n' '%' | sed -e 's/%/%%/g' | grep -o .) \
| sed -n '/^[<>]/ {s/^..//;p}' | tr -d '\n' | tr '%' '\n'
This extends the single-line solution by adding line end markers (e.g. %) which diff is forced to include in its output by adding % on the left and %% on the right.

If both the files have always a single line in each, then below works:
perl -lne '$a=$_ if($.==1);print $1 if(/$a(.*)/ && $.==2)' 1.txt 2.txt
Tested Below:
> cat 1.txt
/home/folder_name/abc
> cat 2.txt
/home/folder_name/abc/pqr/xyz/mnp
> perl -lne '$a=$_ if($.==1);print $1 if(/$a(.*)/ && $.==2)' 1.txt 2.txt
/pqr/xyz/mnp
>

Related

Bash function with input fails awk command

I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}
You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.
You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv

Difficulty to create .txt file from loop in bash

I've this data :
cat >data1.txt <<'EOF'
2020-01-27-06-00;/dev/hd1;100;/
2020-01-27-12-00;/dev/hd1;100;/
2020-01-27-18-00;/dev/hd1;100;/
2020-01-27-06-00;/dev/hd2;200;/usr
2020-01-27-12-00;/dev/hd2;200;/usr
2020-01-27-18-00;/dev/hd2;200;/usr
EOF
cat >data2.txt <<'EOF'
2020-02-27-06-00;/dev/hd1;120;/
2020-02-27-12-00;/dev/hd1;120;/
2020-02-27-18-00;/dev/hd1;120;/
2020-02-27-06-00;/dev/hd2;230;/usr
2020-02-27-12-00;/dev/hd2;230;/usr
2020-02-27-18-00;/dev/hd2;230;/usr
EOF
cat >data3.txt <<'EOF'
2020-03-27-06-00;/dev/hd1;130;/
2020-03-27-12-00;/dev/hd1;130;/
2020-03-27-18-00;/dev/hd1;130;/
2020-03-27-06-00;/dev/hd2;240;/usr
2020-03-27-12-00;/dev/hd2;240;/usr
2020-03-27-18-00;/dev/hd2;240;/usr
EOF
I would like to create a .txt file for each filesystem ( so hd1.txt, hd2.txt, hd3.txt and hd4.txt ) and put in each .txt file the sum of the value from each FS from each dataX.txt. I've some difficulties to explain in english what I want, so here an example of the result wanted
Expected content for the output file hd1.txt:
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390:/
Expected content for the file hd2.txt:
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
The implementation I've currently tried:
for i in $(cat *.txt | awk -F';' '{print $2}' | cut -d '/' -f3| uniq)
do
cat *.txt | grep -w $i | awk -F';' -v date="$(cat *.txt | awk -F';' '{print $1}' | cut -d'-' -f-2 | uniq )" '{sum+=$3} END {print date";"$2";"sum}' >> $i
done
But it doesn't works...
Can you show me how to do that ?
Because the format seems to be so constant, you can delimit the input with multiple separators and parse it easily in awk:
awk -v FS='[;-/]' '
prev != $9 {
if (length(output)) {
print output >> fileoutput
}
prev = $9
sum = 0
}
{
sum += $9
output = sprintf("%s-%s;/%s/%s;%d;/%s", $1, $2, $7, $8, sum, $11)
fileoutput = $8 ".txt"
}
END {
print output >> fileoutput
}
' *.txt
Tested on repl generates:
+ cat hd1.txt
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390;/
+ cat hd2.txt
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
Alternatively, you could -v FS=';' and use split to split first and second column to extract the year and month and the hdX number.
If you seek a bash solution, I suggest you invert the loops - first iterate over files, then over identifiers in second column.
for file in *.txt; do
prev=
output=
while IFS=';' read -r date dev num path; do
hd=$(basename "$dev")
if [[ "$hd" != "${prev:-}" ]]; then
if ((${#output})); then
printf "%s\n" "$output" >> "$fileoutput"
fi
sum=0
prev="$hd"
fi
sum=$((sum + num))
output=$(
printf "%s;%s;%d;%s" \
"$(cut -d'-' -f1-2 <<<"$date")" \
"$dev" "$sum" "$path"
)
fileoutput="${hd}.txt"
done < "$file"
printf "%s\n" "$output" >> "$fileoutput"
done
You could also almost translate awk to bash 1:1 by doing IFS='-;/' in while read loop.

Removing strings from file using bash script

I want to delete a specific strings from file.
I try to use:
for line3 in $(cat 2.txt)
do
if grep -Fxq $line3 4.txt
then
sed -i /$line3/d 4.txt
fi
done
I want this code to delete lines from 4.txt if they are also in 2.txt, but this loop deletes all lines from 4.txt and I have no idea why. Can someone tell what is wrong with this code ?
2.txt:
a
ab
abc
4.txt:
a
abc
abcdef
You can do this via single awk command:
awk 'ARGV[1] == FILENAME && FNR==NR {a[$1];next} !($1 in a)' 2.txt 4.txt
abcdef
To store output back to 4.txt use:
awk 'ARGV[1] == FILENAME && FNR==NR {a[$1];next} !($1 in a)' 2.txt 4.txt > _tmp && mv _tmp 4.txt
PS: Added ARGV[1] == FILENAME && to take care of empty file case as noted by #pjh below.
grep -F -v -x -f 2.txt 4.txt
or
grep -Fvxf 2.txt 4.txt
or
fgrep -vxf 2.txt 4.txt
Using just Bash (4) builtins:
declare -A found
while IFS= read -r line || [[ $line ]] ; do found[$line]=1 ; done <2.txt
while IFS= read -r line || [[ $line ]] ; do
(( ${found[$line]-0} )) || printf '%s\n' "$line"
done <4.txt
The '[[ $line ]]' tests are to handle files with unterminated last lines.
Use 'printf' instead of 'echo' in case any of the output lines begin with 'echo' options.
Look ma', only sed...
sed $( sed 's,^, -e /^,;s,$,$/d,' 2.txt ) 4.txt
Transform each line in 2.txt in a sed command, e.g., abc -> -e /^abc$/d
Give the list of sed commands to an instance of sed operating on 4.txt
To store output back to 4.txt use:
sed -i $( sed 's,^, -e /^,;s,$,$/d,' 2.txt ) 4.txt
edit: while I love my answer on an aesthetic base, please don't try
this at home! see pjh comment below for a detailed rationale of the
many ways in which my microscript may fail

Remove lines containing space in unix

Below is my comma separated input.txt file, i want to read the columns and write the lines in to the output.txt when any 1 column has a space.
Content of input.txt:
1,Hello,world
2,worl d,hell o
3,h e l l o, world
4,Hello_Hello,World#c#
5,Hello,W orld
Content of output.txt:
1,Hello,world
4,Hello_Hello,World#c#
is't possible to achieve using awk? Please help!
A simple way to filter out lines with spaces is using inverted matching with grep:
grep -v ' ' input.txt
If you must use awk:
awk '!/ /' input.txt
Or perl:
perl -ne '/ / || print' input.txt
Or pure bash:
while read line; do [[ $line == *' '* ]] || echo $line; done < input.txt
# or
while read line; do [[ $line =~ ' ' ]] || echo $line; done < input.txt
UPDATE
To check if let's say field 2 contains space, you could use awk like this:
awk -F, '$2 !~ / /' input.txt
To check if let's say field 2 OR field 3 contains space:
awk -F, '!($2 ~ / / || $3 ~ / /)' input.txt
For your follow-up question in comments
To do the same using sed, I only know these awkward solutions:
# remove lines if 2nd field contains space
sed -e '/^[^,]*,[^,]* /d' input.txt
# remove lines if 2nd or 3rd field contains space
sed -e '/^[^,]*,[^,]* /d' -e '/^[^,]*,[^,]*,[^,]* /d' input.txt
For your 2nd follow-up question in comments
To disregard leading spaces in the 2nd or 3rd fields:
awk -F', *' '!($2 ~ / / || $3 ~ / /)' input.txt
# or perhaps what you really want is this:
awk -F', *' -v OFS=, '!($2 ~ / / || $3 ~ / /) { print $1, $2, $3 }' input.txt
This can also be done easily with sed
sed '/ /d' input.txt
try this one-liner
awk 'NF==1' file
as #jwpat7 pointed out, it won't give correct output if the line has only leading space, then this line, with regex should do, but it has been already posted in janos's answer.
awk '!/ /' file
or
awk -F' *' 'NF==1'
Pure bash for the fun of it...
#!/bin/bash
while read line
do
if [[ ! $line =~ " " ]]
then
echo $line
fi
done < input.txt
columnWithSpace=2
ColumnBef=$(( ${columnWithSpace} - 1 ))
sed '/\([^,]*,\)\{${ColumnBef\}[^ ,]* [^,]*,/ d'
if you know the column directly (by example the 3):
sed '/\([^,]*,\)\{2}[^ ,]* [^,]*,/ d'
If you can trust the input to always have no more than three fields, simply finding a space somewhere after a comma is sufficient.
grep ',.* ' input.txt
If there can be (or usually are) more fields, you can pull that off with grep -E and a suitable ERE, but you are fast approaching the point at which the equivalent Awk solution will be more readable and maintainable.

Calculate Word occurrences from file in bash

I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file
I know I can do this:
sort | uniq -c | sort
the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on.
What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow.
I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out.
Could you please lead me to the right way?
I'm also pasting the code I wrote:
#!/bin/bash
# count the number of word occurrences from a file and writes to another file #
# the words are listed from the most frequent to the less one #
touch .check # used to check the occurrances. Temporary file
touch distribution.txt # final file with all the occurrences calculated
page=$1 # contains the file I'm calculating
occurrences=$2 # temporary file for the occurrences
# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check
# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
word=${words} # word I'm calculating
strlen=${#word} # word's length
# I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word was never found before it writes it with 1 occurrence
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
rm .check
# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt
Well, I'm not sure that I've got the point of the thing you are trying to do,
but I would do it this way:
while read file
do
cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list
Now you have statistics for all your file, and now you simple aggregate it:
while read file
do
cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'
Example of usage:
$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF
$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list
$ while read file
> do
> cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head
3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

Resources