I want to add a count of only the first column using bash, without doing uniq, like this:
input:
58311s2727 NC_000082.6 100.00 50
58311s2727 NC_000083.6 100.00 60
58311s2727 NC_000084.6 100.00 70
58310s2691 NC_000080.6 100.00 30
58310s2691 NC_000081.6 100.00 20
58308s2441 NC_000074.6 100.00 50
output:
3 58311s2727 NC_000082.6 100.00 50
3 58311s2727 NC_000083.6 100.00 60
3 58311s2727 NC_000084.6 100.00 70
2 58310s2691 NC_000080.6 100.00 30
2 58310s2691 NC_000081.6 100.00 20
1 58308s2441 NC_000074.6 100.00 50
I tried:
sort input.txt | cut -f1 | uniq -c
but the output is not what I want. I want to know if there will be simple ways to solve this.
With sorted input, you can simply use awk, capturing the set of lines that have the same key and printing the previous set out when the key changes. Handling EOF is a tad messy; you have to repeat the printing. You could write an awk function to do the printing, but it is almost overkill for something this simple.
script.awk
$1 != old_key { if (n_keys > 0) for (i = 0; i < n_keys; i++) print n_keys, saved[i]; n_keys = 0 }
{ saved[n_keys++] = $0; old_key = $1 }
END { if (n_keys > 0) for (i = 0; i < n_keys; i++) print n_keys, saved[i] }
Example runs
For the sample input input.txt (which is already grouped), the output is:
$ awk -f script.awk input.txt
3 58311s2727 NC_000082.6 100.00 50
3 58311s2727 NC_000083.6 100.00 60
3 58311s2727 NC_000084.6 100.00 70
2 58310s2691 NC_000080.6 100.00 30
2 58310s2691 NC_000081.6 100.00 20
1 58308s2441 NC_000074.6 100.00 50
$
If you want it sorted, sort it first:
$ sort input.txt | awk -f script.awk
1 58308s2441 NC_000074.6 100.00 50
2 58310s2691 NC_000080.6 100.00 30
2 58310s2691 NC_000081.6 100.00 20
3 58311s2727 NC_000082.6 100.00 50
3 58311s2727 NC_000083.6 100.00 60
3 58311s2727 NC_000084.6 100.00 70
$
Note that amongst other advantages, this can process data from a pipeline because it doesn't need to process the file twice, unlike at least one of the other solutions, which is currently accepted. It also only keeps as many lines in memory as there are lines in the biggest group of a common key, so even fairly big files are unlikely to stress the memory on the system. (The sort probably imposes more memory load than the awk does.)
script2.awk
Using a function, and some white space, the code becomes:
function dump_keys( i) {
if (n_keys > 0)
{
for (i = 0; i < n_keys; i++)
print n_keys, saved[i]
}
n_keys = 0
}
$1 != old_key { dump_keys() }
{ saved[n_keys++] = $0; old_key = $1 }
END { dump_keys() }
The variable i is local to the function (a quirk of awk). I could simply omit it from the argument list since i is not used elsewhere in the script.
This produces the same output as script.awk.
I would do this in awk. But as Aaron said, it will require reading the input twice, since the first time you hit a particular line, you don't know how many other times you'll hit it.
$ awk 'NR==FNR{a[$1]++;next} {print a[$1],$0}' inputfile inputfile
This goes through the file the first time, populating an array with a counter of the first field. Then it goes through a second time, printing the count along with each line.
You can adjust the print statement to suite your formatting requirements (perhaps replacing it with printf).
If you don't want to use awk, and really want this to work natively in bash, you could use a couple of one-liners with for loops to achieve nearly the same results:
$ declare -A a
$ while read word therest; do ((a[$word]++)); done < inputfile
$ while read word therest; do printf "%5d\t%s\t%s\n" "${a[$word]}" "$word" "$therest"; done < inputfile
The declare -A is required because $a needs to be an associative array, with the first word of each line as the key. awk, on the other hand, treats every array as associative. Note that this solution does not maintain your whitespace.
Without uniq, you'll have to read the input twice. There are ways to do that in pure BASH, but this is when I'd switch to a proper scripting language like Python 2:
import codecs
from collections import Counter
filename='...'
encoding='...' # file encoding
counter = Counter()
with codecs.open(filename, 'r', encoding) as fh:
for line in fh:
parts = line.split(' ')
counter[parts[0]] += 1
with codecs.open(filename, 'r', encoding) as fh:
for line in fh:
parts = line.split(' ')
count = counter[parts[0]]
print '%d%s' % (count, line),
Related
I wrote a script in AWK called exc7
./exc7 file1 file2
In every file there is a matrix
file1 :
2 6 7
10 5 4
3 8 4
file2:
-60 10
10 -60
The code that I wrote is :
#!/usr/bin/awk -f
{
for (i=1;i<=NF;i++)
A[NR,i]=$i
}
END{
for (i=1;i<=NR;i++){
sum += A[i,1]
}
for (i=1;i<=NF;i++)
sum2 += A[1,i]
for (i=0;i<=NF;i++)
sum3 += A[NR,i]
for (i=0;i<=NR;i++)
sum4 += A[i,NF]
print sum,sum2,sum3,sum4
if (sum==sum2 && sum==sum3 && sum==sum4)
print "yes"
}
It should check for every file if the sum of the first column and the last and the first line and the last is the same. It will print the four sum and say yes if they are equal.Then it should print the largest sum of all number in all the files.
when I try it on one file it is right like when I try it on file1 it prints:
15 15 15 15
yes
but when I try it on two or more files like file1 file 2 the output is :
-35 8 -50 -31
you should use FNR instead of NR and with gawk you can use ENDFILE instead of END. However, this should work with any awk
awk 'function sumline(last,rn) {n=split(last,lr);
for(i=1;i<=n;i++) rn+=lr[i];
return rn}
function printresult(c1,r1,rn,cn) {print c1,r1,rn,cn;
print (r1==rn && c1==cn && r1==c1)?"yes":"no"}
FNR==1{if(last)
{rn=sumline(last);
printresult(c1,r1,rn,cn)}
rn=cn=c1=0;
r1=sumline($0)}
{c1+=$1;cn+=$NF;last=$0}
END {rn=sumline(last);
printresult(c1,r1,rn,cn)}' file1 file2
15 15 15 15
yes
-50 -50 -50 -50
yes
essentially, instead of checking end of file, you can check start of the file and print out the previous file's results. Need to treat first file differently. You still need the END block to handle the last file.
UPDATE
Based on the questions you asked, I think it's better for you to keep your script as is and change the way you call it.
for file in file1 file2;
do echo "$file"; ./exc7 "$file";
done
you'll be calling the script once for each file, so all the complications will go away.
I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below:
asdljsdbfajhsdbflakjsdff235
asjhbasdjbfajskdfasdbajsdx3
asjhbasdjbfajs23kdfb235ajds
asjhbasdjbfajskdfbaj456fd3v
asjhbasdjb6589fajskdfbaj235
asjhbasdjbfajs54kdfbaj2f879
asjhbasdjbfajskdfbajxdfgsdh
asjhbasdf3709ddjbfajskdfbaj
100
100
150
125
trh77rnv9vnd9dfnmdcnksosdmn
220
225
sdkjNSDfasd89asdg12asdf6asdf
So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}'
Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below):
$ awk '{
a[NR]=$0 # hash all the records
}
END { # after hashing
mid=int(NR/2) # compute the midpoint, int in case NR is uneven
for(i=1;i<=mid;i++) # iterate from start to midpoint
print a[i],a[mid+i] # output
}' file
If you have the memory to hash half of the file ("mid"):
$ awk '
NR==FNR { # on 1st pass hash second half of records
if(FNR>1) { # we dont need the 1st record ever
a[FNR]=$0 # hash record
if(FNR%2) # if odd record
delete a[int(FNR/2)+1] # remove one from the past
}
next
}
FNR==1 { # on the start of 2nd pass
if(NR%2==0) # if record count is uneven
exit # exit as there is always even count of them
offset=int((NR-1)/2) # compute offset to the beginning of hash
}
FNR<=offset { # only process the 1st half of records
print $0,a[offset+FNR] # output one from file, one from hash
next
}
{ # once 1st half of 2nd pass is finished
exit # just exit
}' file file # notice filename twice
And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"):
$ awk '
NR==FNR { # just get the NR of 1st pass
next
}
FNR==1 {
mid=(NR-1)/2 # get the midpoint
file=FILENAME # filename for getline
while(++i<=mid && (getline line < file)>0); # jump getline to mid
}
{
if((getline line < file)>0) # getline read from mid+FNR
print $0,line # output
}' file file # notice filename twice
Standard disclaimer on getline and no real error control implemented.
Performance:
I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max.
+------------------+-----------+-----------+
| which | | |
| min | mid | max |
+------------------+-----------+-----------+
| time | | |
| real 1m7.027s | 1m30.146s | 0m48.405s |
| user 1m6.387s | 1m27.314 | 0m43.801s |
| sys 0m0.641s | 0m2.820s | 0m4.505s |
+------------------+-----------+-----------+
| mem | | |
| 3 MB | 6.8 GB | 13.5 GB |
+------------------+-----------+-----------+
Update:
I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively:
real 0m41.455s
user 0m39.086s
sys 0m2.369s
and
real 0m39.577s
user 0m37.037s
sys 0m2.541s
Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
if you want just one space between columns, change to
$ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g.
awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
Example Use/Output
With your data in file, then
$ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines:
paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt)
You can use columns utility from autogen:
columns -c2 --by-columns file.txt
You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can:
column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you:
wc -l file.txt
gives number of lines
echo $(($(wc -l < file.txt)/2))
gives a half
head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt
tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt
create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1)
#!/usr/bin/env bash
array=()
file=$1
total=$(wc -l < "$file")
half=$(( total / 2 ))
plus1=$(( half + 1 ))
for ((m=1;m<=half;m++)); do
array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j')
done
After all of that if just want to print the output to stdout. Add the line below to the script.
printf '%s\n' "${array[#]}" ,p Q | ed -s "$file"
If you want to write the changes directly to the file itself, Use this code instead below the script.
printf '%s\n' "${array[#]}" w | ed -s "$file"
Here is an example.
printf '%s\n' {1..10} > file.txt
Now running the script against that file.
./myscript file.txt
Output
1 6
2 7
3 8
4 9
5 10
Or using bash4+ feature mapfile aka readarray
Save the file in an array named array.
mapfile -t array < file.txt
Separate the files.
left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}")
loop and print side-by-side
for i in "${!left[#]}"; do
printf '%s %s\n' "${left[i]}" "${right[i]}"
done
What you said The only known thing in my scenario is, they are even number all the time. That solution should work.
I have a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
Here I need to grep most occurring sequence of 3 letters within a word
Output should be
acc = 5
aco = 3
Is that possible in Bash?
I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible...
PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...
Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1
This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.
awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:
awk -v n=3 -v RS='[[:space:]]' '
(length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) print s,a[s] }' file
This might work for you (GNU sed, sort and uniq):
sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
Use the first sed invocation to output 3 letter lower case words.
Sort the words.
Count the duplicates.
Sort the counts in reverse numerical order maintaining the alphabetical order.
Use the second sed invocation to manipulate the results into the desired format.
If you only want lines with duplicates and in alphabetical order and case wise, use:
sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p
Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.
file1:
1 4
5 7
8 11
12 15
file2:
3 4
8 13
20 24
Desired output:
3 4
8 13
My best attempt has been:
awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next};
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then
{print $1, $2}}}}}' file1 file2 > output.txt
This returns an empty file.
I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.
Any help appreciated!
It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:
declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges
# read ranges in second file, store then in arrays min and max
while read a b; do
min+=( "$a" );
max+=( "$b" );
done < file2
# read ranges in first file
while read a b; do
# loop over indexes of min (and max) array
for i in "${!min[#]}"; do
if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
echo "${min[i]} ${max[i]}" # print range
unset min[i] max[i] # performance optimization
fi
done
done < file1
This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.
EDIT 1: improved the range overlap test.
EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.
EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:
awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
{ for (i in a)
if ($1 <= b[i] && a[i] <= $2) {
print a[i], b[i]; delete a[i]; delete b[i];
}
}' file2 file1
awk solution:
awk 'NR==FNR{ a[$1]=$2; next }
{ for(i in a)
if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) {
print i,a[i]; delete a[i];
}
}' file2 file1
The output:
3 4
8 13
awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2
Break down:
FNR == 1 && NR == 1 {
file=1
}
FNR == 1 && NR != 1 {
file=2
}
file == 1 {
for (q=1;q<=NF;q++) {
nums[$q]=$0
}
}
file == 2 {
for ( p=1;p<=NF;p++) {
for (i in nums) {
if (i == $p) {
print $0
}
}
}
}
Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.
For GNU awk as I'm controlling the for scanning order for optimizing time:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_desc"
}
NR==FNR { # hash file1 to a
if(($2 in a==0) || $1<a[$2]) # avoid collisions
a[$2]=$1
next
}
{
for(i in a) { # in desc order
# print "DEBUG: For:",$0 ":", a[i], i # remove # for debug
if(i+0>$1) { # next after
if($1<=i+0 && a[i]<=$2) {
print
next
}
}
else
next
}
}
Test data:
$ cat file1
0 3 # testing for completely overlapping ranges
1 4
5 7
8 11
12 15
$ cat file2
1 2 # testing for completely overlapping ranges
3 4
8 13
20 24
Output:
$ awk -f program.awk file1 file2
1 2
3 4
8 13
and
$ awk -f program.awk file2 file1
0 3
1 4
8 11
12 15
If Perl solution is preferred, then below one-liner would work
/tmp> cat marla1.txt
1 4
5 7
8 11
12 15
/tmp> cat marla2.txt
3 4
8 13
20 24
/tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
3 4
8 13
/tmp>
If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in file2, we take further and further ranges in file1 until we know whether these overlap with R. Once we know this, we switch to the next range in file2.
#!/bin/bash
exec 3< "$1" # file whose ranges are checked for overlap with those ...
exec 4< "$2" # ... from this file, and if so, are written to stdout
l4=-1 # lower bound of current range from file 2
u4=-1 # upper bound
# initialized with -1 so the first range is read on the first iteration
echo "Ranges in $1 that intersect any ranges in $2:"
while read l3 u3; do # read next range from file 1
if (( u4 >= l3 )); then
(( l4 <= u3 )) && echo "$l3 $u3"
else # the upper bound from file 2 is below the lower bound from file 1, so ...
while read l4 u4; do # ... we read further ranges from file 2 until ...
if (( u4 >= l3 )); then # ... their upper bound is high enough
(( l4 <= u3 )) && echo "$l3 $u3"
break
fi
done <&4
fi
done <&3
The script can be called with ./script.sh file2 file1
I have a log file which contains lot of logs and first column contains the epoch time stamp, Ideally they should be in sequence/sorted. however if something goes wrong in system timestamp resets to some default value and sequence breaks and restarts. I am trying to find the lines where sequence/sorting is getting borken. Please notice the sequence number in column one.
For example:
101 aaa bbb ccc
102 aa dd ff gg
103 asd asd asdas
104 something goes wrong
101 restarting the time stamp
103 new start
104 going fine
105 smae here
102 ahh something unexpected
Desired output:
104 something goes wrong
101 restarting the time stamp
105 smae here
102 ahh something unexpected
sort -c helps, but I cannot store its output to any file for further processing. Please let me know if any alternate way is possible.
With awk:
awk '$1 < prev { print saved "\n" $0 "\n" } { prev = $1; saved = $0 }' filename
There's not much to explain here, as you can see:
$1 < prev { # if a line is out of order
print saved "\n" $0 "\n" # print it and the previous line
}
{ # and for all lines:
prev = $1 # remember the things you need to determine
saved = $0 # when that is the case.
}
prev is initially empty and considered equal to zero, which should be fine for Epoch time stamps unless you have log entries from before 1970. If you do, replace $1 < prev with NR > 1 && $1 < prev, since the first line cannot be out of order.