I have this code in a test.awk file:
{FS=","} {gsub(/\/[0-9]{1,2}\/[0-9]{4}/,"");}1
{FS=","} {gsub(/:[0-9][0-9]/,"");}1
The code makes transformations in a dataset from a dataset.csv file.
I want that using the following command at the shell, returns me
a newdataset.csv with all the modifications:
gawk -f test.awk dataset.csv
Put both commands in the same block.
BEGIN {FS=","}
{ gsub(/\/[0-9]{1,2}\/[0-9]{4}/,"");
gsub(/:[0-9][0-9]/,"");
}1
You could also do them in the same regexp with alternation, since the replacement is the same.
And since you never do anything that operates on individual fields, there's no need to set the field separator.
{gsub(/:[0-9][0-9]|\/[0-9]{1,2}\/[0-9]{4}/, "")}1
#bramaawk : what this -
echo 'abc:12/31/2046def' |
awk '{gsub(/:[0-9][0-9]|\/[0-9]{1,2}\/[0-9]{4}/, "")}1'
.. looks like to any awk is -
abcdef
# gawk profile, created Sun Jun 5 10:59:06 2022
# Rule(s)
1 {
1 gsub(/:[0-9][0-9]|\/[0-9]{1,2}\/[0-9]{4}/,"",$0)
}
1 1 { # 1
1 print
}
what I'm suggesting is to streamline those into 1 single block :
awk 'gsub(":[0-9]{2}|[/][0-9]{1,2}[/][0-9]{4}",_)^_'
so that awk only needs to see :
# gawk profile, created Sun Jun 5 10:59:34 2022
# Rule(s)
1 gsub(":[0-9]{2}|[/][0-9]{1,2}[/][0-9]{4}",_,$0)^_ { # 1
1 print
}
instead of 2 boolean evaluations (or the poorly-termed "patterns"), and 2 action blocks, make them 1 each instead.
To make your solution generic for gawk+mawk+nawk, just do
{m,n,g}awk NF++ FS=':[0-9][0-9]|[/][0-9][0-9]?[/][0-9][0-9][0-9][0-9]' OFS=
Related
I have a directory with files that looks like this:
CCG02-215-WGS.format.flt.txt
CCG05-707-WGS.format.flt.txt
CCG06-203-WGS.format.flt.txt
CCG04-967-WGS.format.flt.txt
CCG05-710-WGS.format.flt.txt
CCG06-215-WGS.format.flt.txt
Contents of each files look like this
1 9061390 14 93246140
1 58631131 2 31823410
1 108952511 3 110694548
1 168056494 19 23850376
etc...
Ideal output would be a file, let's call it all-samples.format.flt.txt, that would contain the concatenation of all files, but an additional column that displays which sample/file the row came from ( some minor formatting involved to remove the .format.flt.txt ):
1 9061390 14 93246140 CCG02-215-WGS
...
1 58631131 2 31823410 CCG05-707-WGS
...
1 108952511 3 110694548 CCG06-203-WGS
...
1 168056494 19 23850376 CCG04-967-WGS
Currently, I have the following code which works for individual files.
awk 'BEGIN{OFS="\t"; split(ARGV[1],f,".")}{print $1,$2,$3,$4,f[1]}' CCG05-707-WGS.format.flt.txt
#OUTPUT
1 58631131 2 31823410 CCG05-707-WGS
...
However, when I try to apply it to all files, using the star, it adds the first filename it finds to all the files as the 4th column.
awk 'BEGIN{OFS="\t"; split(ARGV[1],f,".")}{print $1,$2,$3,$4,f[1]}' *
#OUTPUT, 4th column should be as seen in previous code block
1 9061390 14 93246140 CCG02-215-WGS
...
1 58631131 2 31823410 CCG02-215-WGS
...
1 108952511 3 110694548 CCG02-215-WGS
...
1 168056494 19 23850376 CCG02-215-WGS
I feel like the solution may just lie in adding an additional parameter to awk... but I'm not sure where to start.
Thanks!
UPDATE
Using OOTB awk var FILENAME solved the issue, plus some elegant formatting logic for the file names.
Thank #RavinderSingh13!
awk 'BEGIN{OFS="\t"} FNR==1{file=FILENAME;sub(/..*/,"",file)} {print $0,file}' *.txt
With your shown samples please try following awk code. We need to use FILENAME OOTB variable here of awk. Then whenever there is first line of any txt file(all txt files passed to this program) then remove everything from . to till last of value and in main program printing current line followed by file(file's name as per requirement)
awk '
BEGIN { OFS="\t" }
FNR==1{
file=FILENAME
sub(/\..*/,"",file)
}
{
print $0,file
}
' *.txt
OR in a one-liner form try following awk code:
awk 'BEGIN{OFS="\t"} FNR==1{file=FILENAME;sub(/\..*/,"",file)} {print $0,file}' *.txt
You may use:
Any version awk:
awk -v OFS='\t' 'FNR==1{split(FILENAME, a, /\./)} {print $0, a[1]}' *.txt
Or in gnu-awk:
awk -v OFS='\t' 'BEGINFILE{split(FILENAME, a, /\./)} {print $0, a[1]}' *.txt
I want to run a Perl script multiple times using the command line for a folder containing .coordinates.txt files, doing multiples "actions," and as the last step, I want to make a sort based on first-line value.
I wrote this:
for i in ./*gb.coordinates.txt; do perl myscript $i |
awk 'NR==1 {print $2,"\t***here"; next } 1'|sed '2d'| #the output has an empty line in the second row so I remove it and I add "\t***here" to have and idea about the first line value after my final sorting
if [[awk 'FNR == 1 && $1>0']] then {sort -k1nr} else {sort -k1n} fi
> $i.allvalues.txt;
done
Until here:
for i in ./*gb.coordinates.txt; do perl myscript $i | awk 'NR==1 {print $2,"\t***here"; next } 1'|sed '2d' > $i.allvalues.txt; done
Everything works properly.
So as l wrote above my final step I want to obtain is a sort like this:
if the first line of my output >=0 then sort -k1n else sort -k1nr
The output before the if condition is :
XXXX eiter positive number or negative \t***here
32
4455
-2333
23
-123
And I want my output be like:
if xxxx= positive
xxxx (going in the correct order) \t***here
4455
32
23
-123
-2333
if xxxx= negative
xxxx (going in the correct order) \t***here
-2333
-123
23
32
4455
So my problem is that I don't know who to connect if statement along with sort.
There's no need to use awk. Pipe the output of the perl script to a shell block that reads the first line, tests whether it's positive or negative, and then calls the appropriate sort.
for i in ./*gb.coordinates.txt; do
perl myscript $i | {
read _ firstline __
if (( firstline > 0 ))
then sort -k1nr
else sort -k1n
fi
} > $i.allvalues.txt
done
I have a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
Here I need to grep most occurring sequence of 3 letters within a word
Output should be
acc = 5
aco = 3
Is that possible in Bash?
I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible...
PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...
Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1
This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.
awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:
awk -v n=3 -v RS='[[:space:]]' '
(length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) print s,a[s] }' file
This might work for you (GNU sed, sort and uniq):
sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
Use the first sed invocation to output 3 letter lower case words.
Sort the words.
Count the duplicates.
Sort the counts in reverse numerical order maintaining the alphabetical order.
Use the second sed invocation to manipulate the results into the desired format.
If you only want lines with duplicates and in alphabetical order and case wise, use:
sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p
How to cut a specific field from a line?
The problem is I can't use cut -d ' ' -f 1,2,3,4,5,9,10,11,12,13,14, since the field changes.
Let's say I have a file called /var/log/test, and one of the lines inside the file looks like this :
Apr 12 07:48:11 172.89.92.41 %ASA-5-713120: Group = People, Username = james.robert, IP = 219.89.259.32, PHASE 2 COMPLETED (msgid=9a4ce822)
I only need to get the Username and Time/Date ( please note columns keep changing, that's why I need to match the Username = james.robert and Apr 12 07:48:11
When I use :
grep "james" /var/log/tes | cut -d ' ' -f 1,2,3,4,5,9,10,11,12,13,14
Doesn't work for me. So it has to match the username and prints only username and data/time. Any suggestions?
Ok so when I use this :
awk -F'[ ,]' '$12~/username/{print $1,$2,$3,$12}' /var/log/test
but it works for some users, but not the others, because fields keep moving.
The sample output of this command is :
Apr 12 06:00:39 james.robert
But when I try this command on this username, it doesn't work. See below :
here is another example that with the above command doesn't show anything:
Apr 8 12:16:13 172.24.32.1 %ASA-6-713228: Group = people, Username = marry.tarin, IP = 209.157.190.11, Assigned private IP address 192.168.237.38 to remote user
if your file is structured consistently
awk -F'[ ,]' '{print $1,$2,$3,$12}' file
Apr 12 07:48:11 james.robert
if you need to match the username, using your sample input
$ awk -F'[ ,]' '$12~/james/{print $1,$2,$3,$12}' file
Apr 12 07:48:11 james.robert
UPDATE
OK, your spaces are not consistent, to fix change the -F
$ awk -F' +|,' '{print $1,$2,$3,$12}' file
Apr 12 07:48:11 james.robert
Apr 8 12:16:13 marry.tarin
you can add the /pattern/ to restrict the match to users as above. Note the change in -F option.
-F' +|,' sets the field separator to spaces (one or more) or comma,
the rest is counting the fields and picking up the right one to print.
/pattern/ will filter the lines that matches the regex pattern, which can > be constrained to certain field only (e.g. 12) by $12~/pattern/
if your text may contain mixed case and you want to be case insensitive, use tolower() function, for example
$ awk -F' +|,' 'tolower($12)~/patterninlowercase/{print $1,$2,$3,$12}' file
With sed:
sed -r 's/^([A-Za-z]{3} [0-9]{1,2} [0-9]{2}:[0-9]{2}:[0-9]{2}).*(Username = [^,]*).*/\1 \2/g' file
You could use awk to delimit by comma and then use substr() and length() to get at the pieces you care about:
awk -F"," '{print substr($1,1,15), substring($3, 13, length($3)-12)}' /var/log/test
With gawk
awk '{u=gensub(/.*(Username = [^,]*).*/,"\\1","g",$0);if ( u ~ "james") {print u,$1,$2,$3}}' file
The following perl will print the date and username delimited by a tab. Add additional valid username characters to [\w.]:
perl -ne '
print $+{date}, "\t", $+{user}, "\n" if
/^(?<date>([^\s]+\s+){2}[^\s]+).*\bUsername\s*=\s*(?<user>[\w.]+)/
'
Varying amounts a tabs and spaces are allowed.
I'm trying to write a script, In this script i'm passing a shell variable into an awk command, But when i run it nothing happens, i tried to run that line only in the shell, i found that no variable expansion happened like i expected. Here's the code :
1 #!/bin/bash
2
3 # Created By Rafael Adel
4
5 # This script is to start dwm with customizations needed
6
7
8 while true;do
9 datestr=`date +"%r %d/%m/%Y"`
10 batterystr=`acpi | grep -oP "([a-zA-Z]*), ([0-9]*)%"`
11 batterystate=`echo $batterystr | grep -oP "[a-zA-Z]*"`
12 batterypercent=`echo $batterystr | grep -oP "[0-9]*"`
13
14 for nic in `ls /sys/class/net`
15 do
16 if [ -e "/sys/class/net/${nic}/operstate" ]
17 then
18 NicUp=`cat /sys/class/net/${nic}/operstate`
19 if [ "$NicUp" == "up" ]
20 then
21 netstr=`ifstat | awk -v interface=${nic} '$1 ~ /interface/ {printf("D: %2.1fKiB, U: %2.1fKiB",$6/1000, $8/1000)}'`
22 break
23 fi
24 fi
25 done
26
27
28 finalstr="$netstr | $batterystr | $datestr"
29
30 xsetroot -name "$finalstr"
31 sleep 1
32 done &
33
34 xbindkeys -f /etc/xbindkeysrc
35
36 numlockx on
37
38 exec dwm
This line :
netstr=`ifstat | awk -v interface=${nic} '$1 ~ /interface/ {printf("D: %2.1fKiB, U: %2.1fKiB",$6/1000, $8/1000)}'`
Is what causes netstr variable not to get assigned at all. That's because interface is not replaced with ${nic} i guess.
So could you tell me what's wrong here? Thanks.
If you want to /grep/ with your variable, you have 2 choices :
interface=eth0
awk "/$interface/{print}"
or
awk -v interface=eth0 '$0 ~ interface{print}'
See http://www.gnu.org/software/gawk/manual/gawk.html#Using-Shell-Variables
it's like I thought, awk substitutes variables properly, but between //, inside regex ( or awk regex, depending on some awk parameter AFAIR), awk variable cannot be used for substitution
I had no issue grepping with variable inside an awk program (for simple regexp cases):
sawk1='repo\s+module2'
sawk2='#project2\s+=\s+module2$'
awk "/${sawk1}/,/${sawk2}/"'{print}' aFile
(Here the /xxx/,/yyy/ displays everything between xxx and yyy)
(Note the double-quoted "/${sawk1}/,/${sawk2}/", followed by the single-quoted '{print}')
This works just fine, and comes from "awk: Using Shell Variables in Programs":
A common method is to use shell quoting to substitute the variable’s value into the program inside the script.
For example, consider the following program:
printf "Enter search pattern: "
read pattern
awk "/$pattern/ "'{ nmatches++ }
END { print nmatches, "found" }' /path/to/data
The awk program consists of two pieces of quoted text that are concatenated together to form the program.
The first part is double-quoted, which allows substitution of the pattern shell variable inside the quotes.
The second part is single-quoted.
It does add the caveat though:
Variable substitution via quoting works, but can potentially be messy.
It requires a good understanding of the shell’s quoting rules (see Quoting), and it’s often difficult to correctly match up the quotes when reading the program.