how to merge similar lines in linux - linux

I have a file test.txt on my linux system which has data in following format :
first second third fourth 10
first second third fourth 20
fifth sixth seventh eighth 10
mmm nnn ooo ppp 10
mmm nnn ooo ppp 20
I need to modify the format as below -
first second third fourth 10 20
fifth sixth seventh eighth 10 0
mmm nnn ooo ppp 10 20
I have tried following code
cat test.txt | sed 'N;s/\n/ /' | awk -F" " '{if ($1~$5){print $1" "$2" "$3" "$4" "$8} else { print $0 }}'
But this is not giving required output. When there is a line which doesn't have a similar line below it,this command fails. Can u suggest me any solution for this??

Here is one way to do it:
awk ' {
last=$NF; $NF=""
if($0==previous) {
tail=tail " " last
}
else {
if(previous!="") {
if(split(tail,foo)==1) tail=tail " 0"
print previous tail
}
previous=$0
tail=last
}
}
END {
if(previous!="") print previous tail
}
'

Perl solution:
perl -ne '/^(.*) (\S+)/ and push #{ $h{$1} },$2 }{ print "$_ #{$h{$_}}\n" for keys %h' < test.txt

Reuse of my solution (J4F)
cat file.txt | sort | while read L;
do
y=`echo $L | rev | cut -f2- -d' ' | rev`;
{
test "$x" = "$y" && echo -n " `echo $L | awk '{print $NF}'`";
} ||
{
x="$y";echo -en "\n$L";
};
done

Related

Print line numbers of duplicate entries

I have a file in the following format:
ABRA CADABRA
ABRA CADABRA
boys
girls
meds toys
I'd like to have the line number returned of any duplicate lines, so the results would look like the following:
1
2
I'd prefer a short one-line command with linux tools. I've tried experimenting with awk and sed but have not had success as of yet.
This would work:
nl file.txt | uniq -f 1 -D | cut -f 1
nl prepends a line number to each line
uniq finds duplicates
-f 1 ignores the first field, i.e., the line number
-D prints (only) the lines that are duplicate
cut-f 1 shows only the first field (the line number)
With a combination of sort, uniq, and awk you can use this series of commands.
sort File_Name | uniq -c | awk '{print $2}'
Here:
uniq -d < $file | while read line; do grep -hn "$line" $file; done
Do this:
perl -e 'my $l = 0; while (<STDIN>) { chomp; $l++; if (exists $f{$_}) { if ($f{$_}->[0]++ == 1) { print "$f{$_}->[1]\n"; print "$l\n"; } } else { $f{$_} = [1,$l]; } }' < FILE
Ugly, but works for unsorted files.
$ cat in.txt
ABRA CADABRA
ABRA CADABRA
boys
girls
meds toys
girls
$ perl -e 'my $l = 0; while (<STDIN>) { chomp; $l++; if (exists $f{$_}) { if ($f{$_}->[0]++ == 1) { print "$f{$_}->[1]\n"; print "$l\n"; } } else { $f{$_} = [1,$l]; } }' < in.txt
1
2
4
6
$
EDIT: Actually it will shorten slightly:
perl -ne '$l++; if (exists $f{$_}) { if ($f{$_}->[0]++ == 1) { print "$f{$_}->[1]\n"; print "$l\n"; } } else { $f{$_} = [1,$l]; }' < in.txt
to get all "different" duplicates in all lines you can try:
nl input.txt | sort -k 2 | uniq -D -f 1 | sort -n
this will not give you just the line numbers but also the duplicate found in those lines. Omit the last sort to get the duplicates grouped together.
also try running:
nl input.txt | sort -k 2 | uniq --all-repeated=separate -f 1
This will group the various duplicates by adding an empty line between groups of duplicates.
pipe results through
| cut -f 1 | sed 's/ \+//g'
to only get line numbers.
$ awk '{a[$0]=($0 in a ? a[$0] ORS : "") NR} END{for (i in a) if (a[i]~ORS) print a[i]}' file
1
2

Bash script to isolate words in a file

Here is my initial input data to be extracted:
david ex1=10 ex2=12 quiz1=5 quiz2=9 exam=99
judith ex1=8 ex2=16 quiz1=4 quiz2=10 exam=90
sam ex1=8 quiz1=5 quiz2=11 exam=85
song ex1=8 ex2=20 quiz2=11 exam=87
How do extract each word to be formatted in this way:
david
ex1=10
ex2=12
etc...
As I eventually want to have output like this:
david 12 99
judith 16 90
sam 0 85
song 20 87
when I run my program with the commands:
./marks ex2 exam < file
Supposed your input file is named input.txt, just replace space char by new line char using tr command line tool:
tr ' ' '\n' < input.txt
For your second request, you may have to extract specific field on each line, so the cut and awk commands may be useful (note that my example is certainly improvable):
while read p; do
echo -n "$(echo $p | cut -d ' ' -f1) " # name
echo -n "$(echo $p | cut -d ' ' -f3 | cut -d '=' -f2) " # ex2 val
echo -n $(echo $p | awk -F"exam=" '{ print $2 }') # exam val
echo
done < input.txt
This script does what you want:
#!/bin/bash
a=$#
awk -v a="$a" -F'[[:space:]=]+' '
BEGIN {
split(a, b) # split field names into array b
}
{
printf "%s ", $1 # print first field
for (i in b) { # loop through fields to search for
f = 0 # unset "found" flag
for (j=2; j<=NF; j+=2) # loop though remaining fields, 2 at a time
if ($j == b[i]) { # if field matches value in array
printf "%s ",$(j+1)
f = 1 # set "found" flag
}
if (!f) printf "0 " # add 0 if field not found
}
print "" # add newline
}' file
Testing it out
$ ./script.sh ex2 exam
david 12 99
judith 16 90
sam 0 85
song 20 87

Print columns from specific line of file?

I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?
sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)
You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"
Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')
If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )

sorting a "key/value pair" array in bash

How do I sort a "python dictionary-style" array e.g. ( "A: 2" "B: 3" "C: 1" ) in bash by the value? I think, this code snippet will make it bit more clear about my question.
State="Total 4 0 1 1 2 0 0"
W=$(echo $State | awk '{print $3}')
C=$(echo $State | awk '{print $4}')
U=$(echo $State | awk '{print $5}')
M=$(echo $State | awk '{print $6}')
WCUM=( "Owner: $W;" "Claimed: $C;" "Unclaimed: $U;" "Matched: $M" )
echo ${WCUM[#]}
This will simply print the array: Owner: 0; Claimed: 1; Unclaimed: 1; Matched: 2
How do I sort the array (or the output), eliminating any pair with "0" value, so that the result like this:
Matched: 2; Claimed: 1; Unclaimed: 1
Thanks in advance for any help or suggestions. Cheers!!
Quick and dirty idea would be (this just sorts the output, not the array):
echo ${WCUM[#]} | sed -e 's/; /;\n/g' | awk -F: '!/ 0;?/ {print $0}' | sort -t: -k 2 -r | xargs
echo -e ${WCUM[#]} | tr ';' '\n' | sort -r -k2 | egrep -v ": 0$"
Sorting and filtering are independent steps, so if you only like to filter 0 values, it would be much more easy.
Append an
| tr '\n' ';'
to get it to a single line again in the end.
nonull=$(for n in ${!WCUM[#]}; do echo ${WCUM[n]} | egrep -v ": 0;"; done | tr -d "\n")
I don't see a good reason to end $W $C $U with a semicolon, but $M not, so instead of adapting my code to this distinction I would eliminate this special case. If not possible, I would append a semicolon temporary to $M and remove it in the end.
Another attempt, using some of the bash features, but still needs sort, that is crucial:
#! /bin/bash
State="Total 4 1 0 4 2 0 0"
string=$State
for i in 1 2 ; do # remove unnecessary fields
string=${string#* }
string=${string% *}
done
# Insert labels
string=Owner:${string/ /;Claimed:}
string=${string/ /;Unclaimed:}
string=${string/ /;Matched:}
# Remove zeros
string=(${string[#]//;/; })
string=(${string[#]/*:0;/})
string=${string[#]}
# Format
string=${string//;/$'\n'}
string=${string//:/: }
# Sort
string=$(sort -t: -nk2 <<< "$string")
string=${string//$'\n'/;}
echo "$string"

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?
$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2
You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....
#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.
Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2
Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'
Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

Resources