Group the consecutive numbers in shell - linux

$ foo="1,2,3,6,7,8,11,13,14,15,16,17"
In shell, how to group the numbers in $foo as 1-3,6-8,11,13-17

Given the following function:
build_range() {
local range_start= range_end=
local -a result
end_range() {
: range_start="$range_start" range_end="$range_end"
[[ $range_start ]] || return
if (( range_end == range_start )); then
# single number; just add it directly
result+=( "$range_start" )
elif (( range_end == (range_start + 1) )); then
# emit 6,7 instead of 6-7
result+=( "$range_start" "$range_end" )
else
# larger span than 2; emit as start-end
result+=( "$range_start-$range_end" )
fi
range_start= range_end=
}
# use the first number to initialize both values
range_start= range_end=
result=( )
for number; do
: number="$number"
if ! [[ $range_start ]]; then
range_start=$number
range_end=$number
continue
elif (( number == (range_end + 1) )); then
(( range_end += 1 ))
continue
else
end_range
range_start=$number
range_end=$number
fi
done
end_range
(IFS=,; printf '%s\n' "${result[*]}")
}
...called as follows:
# convert your string into an array
IFS=, read -r -a numbers <<<"$foo"
build_range "${numbers[#]}"
...we get the output:
1-3,6-8,11,13-17

awk solution for an extended sample:
foo="1,2,3,6,7,8,11,13,14,15,16,17,19,20,33,34,35"
awk -F',' '{
r = nxt = 0;
for (i=1; i<=NF; i++)
if ($i+1 == $(i+1)){ if (!r) r = $i"-"; nxt = $(i+1) }
else { printf "%s%s", (r)? r nxt : $i, (i == NF)? ORS : FS; r = 0 }
}' <<<"$foo"
The output:
1-3,6-8,11,13-17,19-20,33-35

As an alternative, you can use this awk command:
cat series.awk
function prnt(delim) {
printf "%s%s", s, (p > s ? "-" p : "") delim
}
BEGIN {
RS=","
}
NR==1 {
s = $1
}
p < $1-1 {
prnt(RS)
s = $1
}
{
p = $1
}
END {
prnt(ORS)
}
Now run it as:
$> foo="1,2,3,6,7,8,11,13,14,15,16,17"
$> awk -f series.awk <<< "$foo"
1-3,6-8,11,13-17
$> foo="1,3,6,7,8,11,13,14,15,16,17"
$> awk -f series.awk <<< "$foo"
1,3,6-8,11,13-17
$> foo="1,3,6,7,8,11,13,14,15,16,17,20"
$> awk -f series.awk <<< "$foo"
1,3,6-8,11,13-17,20
Here is an one-liner for doing the same:
awk 'function prnt(delim){printf "%s%s", s, (p > s ? "-" p : "") delim}
BEGIN{RS=","} NR==1{s = $1} p < $1-1{prnt(RS); s = $1} {p = $1}END {prnt(ORS)}' <<< "$foo"
In this awk command we keep 2 variables:
p for storing previous line's number
s for storing start of the range that need to be printed
How it works:
When NR==1 we set s to first line's number
When p is less than (current_number -1) or $1-1 that indicates we have a break in sequence and we need to print the range.
We use a function prnt for doing the printing that accepts only one argument that is end delimiter. When prnt is called from p < $1-1 { ...} block then we pass RS or comma as end delimiter and when it gets called from END{...} block then we pass ORS or newline as delimiter.
Inside p < $1-1 { ...} we reset s (start range) to $1
After processing each line we store $1 in variable p.
prnt uses printf for formatted output. It always prints starting number s first. Then it checks if p > s and prints hyphen followed by p if that is the case.

Related

How to search for string with one variable position in the string?

I would like to find in a large file all lines, which contain a string and allow ONE character in my string to be different and still consider it a match.
For example I have this file:
>1 agctcaTATAAGtataagctagaagta
>2 gatgctagcgaagtaatgc
>3 atatagcgctagagccgtagta
>4 gctagcaTATCAGgatgtagtagta
...
and this string: tataag, so I get this output:
>1 agctcaTATAAGtataagctagaagta
>4 gctagcaTATCAGgatgtagtagta
Because line 1 matches directly and line 4 is a match for all but the letter A where it has a C instead.
To allow one char to be different:
$ cat tst.awk
BEGIN {
lgth = length(str)
for (i=1; i<=lgth; i++) {
head = esc(substr(str,1,i-1))
tail = esc(substr(str,i+1))
part = head "." tail
reg = (i>1 ? reg "|" : "") part
}
reg = "(" tolower(reg) ")"
printf "Searching for string \"%s\"\n", str | "cat>&2"
printf "Searching for regexp \"%s\"\n", reg | "cat>&2"
}
tolower($0) ~ reg
function esc(str) {
gsub(/[^^\\]/,"[&]",str)
gsub(/\^|\\/,"\\\\&",str)
return str
}
.
$ awk -v str='tataag' -f tst.awk file
>1 agctcaTATAAGtataagctagaagta
>4 gctagcaTATCAGgatgtagtagta
Searching for string "tataag"
Searching for regexp "(.[a][t][a][a][g]|[t].[t][a][a][g]|[t][a].[a][a][g]|[t][a][t].[a][g]|[t][a][t][a].[g]|[t][a][t][a][a].)"
To allow one char to be missing:
$ cat tst.awk
BEGIN {
lgth = length(str)
for (i=1; i<=lgth; i++) {
head = esc(substr(str,1,i))
tail = esc(substr(str,i+1))
part = head "?" tail
reg = (i>1 ? reg "|" : "") part
}
reg = "(" tolower(reg) ")"
printf "Searching for string \"%s\"\n", str | "cat>&2"
printf "Searching for regexp \"%s\"\n", reg | "cat>&2"
}
tolower($0) ~ reg
function esc(str) {
gsub(/[^^\\]/,"[&]",str)
gsub(/\^|\\/,"\\\\&",str)
return str
}
.
$ awk -v str='tataag' -f tst.awk file
>1 agctcaTATAAGtataagctagaagta
>3 atatagcgctagagccgtagta
Searching for string "tataag"
Searching for regexp "([t]?[a][t][a][a][g]|[t][a]?[t][a][a][g]|[t][a][t]?[a][a][g]|[t][a][t][a]?[a][g]|[t][a][t][a][a]?[g]|[t][a][t][a][a][g]?)"
All the escaping above is to ensure that your string gets treated as a literal string even if/when it contains regexp metacharacters.
You can remove the 2 print statements when you're done testing.
$ # generate the different combinations
$ # assumes search term doesn't have regex metacharacters
$ echo 'tataag' | awk 'BEGIN{FS=OFS=""} {orig=$0; for(i=1;i<=NF;i++)
{ $i = "."; ORS=(i==NF)?"\n":"|"; print; $0=orig }}'
.ataag|t.taag|ta.aag|tat.ag|tata.g|tataa.
$ # pass it to grep as the regex to be used
$ echo 'tataag' | awk 'BEGIN{FS=OFS=""} {orig=$0; for(i=1;i<=NF;i++)
{ $i = "."; ORS=(i==NF)?"\n":"|"; print; $0=orig }}' | grep -iEf - ip.txt
>1 agctcaTATAAGtataagctagaagta
>4 gctagcaTATCAGgatgtagtagta
You can also make it stricter by using [acgt] instead of .
$ echo 'tataag' | awk 'BEGIN{FS=OFS=""} {orig=$0; for(i=1;i<=NF;i++)
{ $i = "[acgt]"; ORS=(i==NF)?"\n":"|"; print; $0=orig }}'
[acgt]ataag|t[acgt]taag|ta[acgt]aag|tat[acgt]ag|tata[acgt]g|tataa[acgt]

manipulating files using awk linux

I have a 1.txt file (with field separator as ||o||):
aidagolf6#gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bf6265003ae067b19b88fa4359d5c392||o||Aida||o||Aida||o||Garic Gara
aidagolf6#gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolfa#hotmail.com||o||14f87ec1e760d16c0380c74ec7678b04||o||Aida||o||Aida||o||Rodriguez Puerto
2.txt (with field separator as :):
bf6265003ae067b19b88fa4359d5c392:hyworebu:#
14f87ec1e760d16c0380c74ec7678b04:sujycugu
I have a result.txt file (which will match 2nd column of 1.txt with first column of 2.txt and if results match, it will replace the 2nd column of 1.txt with 2nd column of 2.txt)
aidagolf6#gmail.com||o||hyworebu:#||o||Aida||o||Aida||o||Garic Gara
aidagolfa#hotmail.com||o||sujycugu||o||Aida||o||Aida||o||Rodriguez Puerto
And a left.txt file (which contains unmatched rows from 1.txt that have no match in 2.txt):
aidagolf6#gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
The script I am trying is:
awk -F '[|][|]o[|][|]' -v s1="||o||" '
NR==FNR {
a[$2] = $1;
b[$2]= $3s1$4s1$5;
next
}
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' 1.txt 2.txt
The problem is the script is using ||o|| in 2.txt also due to which I am getting wrong results.
EDIT
Modified script:
awk -v s1="||o||" '
NR==FNR {
a[$2] = $1;
b[$2]= $3s1$4s1$5;
next
}
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' FS = "||o||" 1.txt FS = ":" 2.txt
Now, I am getting following error:
awk: fatal: cannot open file `FS' for reading (No such file or
directory)
I've modified your original script:
awk -F'[|][|]o[|][|]' -v s1="||o||" '
NR == FNR {
a[$2] = $1;
b[$2] = $3 s1 $4 s1 $5;
c[$2] = $0; # keep the line for left.txt
}
NR != FNR {
split($0, d, ":");
r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
if (a[d[1]] != "") {
print a[d[1]] s1 r s1 b[d[1]] > "result.txt";
c[d[1]] = ""; # drop from the list of left.txt
}
}
END {
for (var in c) {
if (c[var] != "") {
print c[var] > "left.txt"
}
}
}' 1.txt 2.txt
Next verion changes the order of file reading to reduce memory consumption:
awk -F'[|][|]o[|][|]' -v s1="||o||" '
NR == FNR {
split($0, a, ":");
r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
map[a[1]] = r;
}
NR != FNR {
if (map[$2] != "") {
print $1 s1 map[$2] s1 $3 s1 $4 s1 $5 > "result.txt";
} else {
print $0 > "left.txt"
}
}' 2.txt 1.txt
and the final version makes use of file-based database which minimizes DRAM consumption, although I'm not sure if Perl is acceptable in your system.
perl -e '
use DB_File;
$file1 = "1.txt";
$file2 = "2.txt";
$result = "result.txt";
$left = "left.txt";
my $dbfile = "tmp.db";
tie(%db, "DB_File", $dbfile, O_CREAT|O_RDWR, 0644) or die "$dbfile: $!";
open(FH, $file2) or die "$file2: $!";
while (<FH>) {
chop;
#_ = split(/:/, $_, 2);
$db{$_[0]} = $_[1];
}
close FH;
open(FH, $file1) or die "$file1: $!";
open(RESULT, "> $result") or die "$result: $!";
open(LEFT, "> $left") or die "$left: $!";
while (<FH>) {
#_ = split(/\|\|o\|\|/, $_);
if (defined $db{$_[1]}) {
$_[1] = $db{$_[1]};
print RESULT join("||o||", #_);
} else {
print LEFT $_;
}
}
close FH;
untie %db;
'
rm tmp.db

Whitewashing files with a whitelist using multiple instances of awk from within a bash script , troubles

I need to whitewash a set of files with a static whitelist, I am having problems because when I use the following commands on a small scale, they seem to work, however when I attempt to run them in parrallel from within a bash script, I am getting inconsistant results in my files. Not all entries are being removed as I intended, which means dirty data is still in my target files that need to be washed. I need a resolution, this is a life altering problem that must be solved, if anybody can give me a heads up it would be very much helpful.
(btw I split the whitelist into multiple copies hoping it would resolve the issue, it did not)
the file*s here are over 100,000 lines each of plain text domain names
Whitelist.txt is over 25,000 entries
google.com
1.google.net
websitetowhitelist.org
and so on...
example:
#!/bin/bash
# Whitewash script washes blacklists against whitelist to remove domains that should never be blacklisted.
#
#
echo 'Washing file1 blacklist with whitelist.txt ...'
cat 'file1.acl' | awk '{ m=0 ; while ((getline row < "whitelist.txt") == 1) { if (row == $0) { m=1 ; break } } ; close("whitelist.txt") ; if (m == 0) { print $0 }}' > 'file1.out' &
echo 'Washing file2 blacklist with whitelist.txt ...'
cat 'file2' | awk '{ m=0 ; while ((getline row < "whitelist.txt") == 1) { if (row == $0) { m=1 ; break } } ; close("whitelist.txt") ; if (m == 0) { print $0 }}' > 'file2.out' &
echo 'Washing file3 blacklist with whitelist.txt ...'
cat 'file3.acl' | awk '{ m=0 ; while ((getline row < "whitelist.txt") == 1) { if (row == $0) { m=1 ; break } } ; close("whitelist.txt") ; if (m == 0) { print $0 }}' > 'file3.out' &
With files of this size, it is general a good idea to look at blocks in stead of single lines or, perhaps, try perl or another language.
So, another solution could be:
tag the whitelist and dirtyfile
sort them in order of the keys
remove the duplicates
sed 's/$/;a/' < whitelist > whitelisttagged
sed 's/$/;b/' < dirtyfile > dirtyfiletagged
cat whitelisttagged dirtyfiletagged > alltagged
sort alltagged > allsorted
cat allsorted | awk -F';' 'BEGIN {a=""} /;a$/{a=$1} /;b$/ { if ($1 != a) {print $1}}'
You will notice that the awk is a lot less complicated.

calculation of average with linux shell and searching for maximum with its label

I have an input like following
*KEYWORD
$TIME_VALUE = 9.9999993e-004
$STATE_NO = 2
$Output for State 2 at time = 0.001
*END
$NODAL_RESULTS
$RESULT OF Resultant Displacement
721810 1.7188E-2
721812 6.1973E-2
721825 1.1481E+0
721827 1.0962E+0
721852 5.1831E-1
721854 1.3085E-2
721867 1.1077E+0
. .
. .
. .
I need to find the maximum of the value in column 2 and also its average. Then I also need to output the
number which stands in the first column for the maximum value.
I used the following code for calculation of maximum and the average however a division by zero came.
awk: cmd. line:5: fatal: division by zero attempted
The code is as follows
# 1.k is the input file name.
sed -n '/^[0-9]\{1\}/p' 1.k > 2.k # delete all lines not starting with number
mv 2.k 1.k
sed -i -e '/^$/d' 1.k # delete all lines that are empty
#sed -i -e 's/^[ \t]*//;s/[ \t]*$//' 1.k
awk 'BEGIN{min=999}
{a[NR]=$0;if($2<min){min=$2;m[1]=NR;}if($2>max){max=$2;m[2]=NR;}m[2]+=$2;}
END{print "Min:"a[m[1]];
print "Max:"a[m[2]];
print "Number Of Nodes:" NR;
print "Avg:"m[3]/NR}' 1.k
Can anybody help me with this problem?
regards,
calculate.awk:
{
sum += $2
if (NR == 1) {
min = max = $2
minv= maxv= $1
}
if (min > $2) { min = $2; minv = $1 }
if (max < $2) { max = $2; maxv = $1 }
}
END {
print "Min: " minv ", " min
print "Avg: " sum / NR
print "Max: " maxv ", " max
print "# Nodes: " NR
}
If you first filter out the non-numeric info, then this awk script should do:
awk 'BEGIN{max=-999}\
{\
col1[NR]=$1;\
col2[NR]=$2;\
if($2>max){max=$2;imax=NR};\
sum+=$2\
}\
END{print col1[imax]" "col2[imax]" average: "sum/NR}' yourinputfile
After trying and trying, I found a working but I think not an optimumum solution to the problem.
sed -i -e 's/^[ \t]*//;s/[ \t]*$//' 1.k
sed -n '/^[0-9]\{1\}/p' 1.k > 2.k
mv 2.k 1.k
sed -i -e '/^$/d' 1.k
awk 'BEGIN{min=999}
{a[NR]=$0;if($2<min){min=$2;m[1]=NR;}if($2>max){max=$2;m[2]=NR;}m[3]+=$2;}
END{ print "Max:"a[m[2]];
print "Min:"a[m[1]];
print "Number Of Calls:" NR;
print "Avg:"m[3]/NR}' 1.k > result
Thanks for your valueable suggestion folks.
Perl solution:
<1.k perl -ne 'next unless ($key, $val) = /^([0-9]+)\s+([-+E.0-9]+)/;# Only process the important lines.
if ($val > $max) { # New maximum.
$max = $val;
#maxk = $key;
} elsif ($max == $val) { # The maximum appears more than once.
push #maxk, $key;
}
$sum += $val;
$count++;
} {
print "MAX: $max at #maxk, AVG: ", $sum / $count ,"\n"; '
Try simple-r solution:
perl -walne 'print $F[1] if /^\d/' | r summary -
Simple-r is an R wrapper for fast statistical analysis in command line. It can be found at:
https://code.google.com/p/simple-r/

How to split values in a column into separate column

My tab-delimited file looks like this:
ID Pop snp1 snp2 snp3 snp4 snp5
AD62 1 0/1 1/1 . 1/1 0/.
AD75 1 0/0 1/1 . ./0 1/0
AD89 1 . 1/0 1/1 0/0 1/.
I want to separate the columns (starting from column 3) so that the values separated by the "/" character are delimited into a column of its own. However there are also columns whereby the values are missing (they only contain the "." character) and I want this to be treated as though it was "./." so that the two "." characters are then divided into their own columns. For example:
ID Pop snp1 snp2 snp3 snp4 snp5
AD62 1 0 1 1 1 . . 1 1 0 .
AD75 1 0 0 1 1 . . . 0 1 0
AD89 1 . . 1 0 1 1 0 0 1 .
Thanks
You can use sed:
sed -e 's/ \. /\.\t\. /g' -e 's/\//\t/g' <your_file>
Tried this and works well, you can tweak this as per your requirement.
Assuming data is in data.txt file.
cat data.txt | sed 1d | tr '/' '\t'| sed 's/\./.\t./g'
This gives the output, but you need to get a work around for the spaces and tab that are getting messed up.
This might work for you (GNU sed):
sed ''1s/\t/&&/3g;s/\t\.\t/\t.\t.\t/g;y/\//\t/' file
A fairly robust way, using awk and a few if statements:
awk '{ for (i = 1; i <= NF; i++) if (i >= 3 && i < NF && NR == 1) printf "%s\t\t", $i; else if (i == NF && NR == 1) print $i; else if ($i == "." && NR >= 2) printf ".\t.\t", $i; else { sub ("/", "\t", $i); if (i == NF) printf "%s\n", $i; else { printf "%s\t", $i; } } }' file.txt
Broken out on multiple lines:
awk '{ for (i = 1; i <= NF; i++)
if (i >= 3 && i < NF && NR == 1) printf "%s\t\t", $i;
else if (i == NF && NR == 1) print $i;
else if ($i == "." && NR >= 2) printf ".\t.\t", $i;
else {
sub ("/", "\t", $i);
if (i == NF) printf "%s\n", $i;
else {
printf "%s\t", $i;
}
}
}' file.txt
HTH

Resources