Awk using index with Substring - linux

I have one command to cut string.
I wonder detail of control index of command in Linux "awk"
I have two different case.
I want to get word "Test" in below example string.
1. "Test-01-02-03"
2. "01-02-03-Test-Ref1-Ref2
First one I can get like
substr('Test-01-02-03',0,index('Test-01-02-03',"-"))
-> Then it will bring result only "test"
How about Second case I am not sure how can I get Test in that case using index function.
Do you have any idea about this using awk?
Thanks!

This is how to use index() to find/print a substring:
$ cat file
Test-01-02-03
01-02-03-Test-Ref1-Ref2
$ awk -v tgt="Test" 's=index($0,tgt){print substr($0,s,length(tgt))}' file
Test
Test
but that may not be the best solution for whatever your actual problem is.
For comparison here's how to do the equivalent with match() for an RE:
$ awk -v tgt="Test" 'match($0,tgt){print substr($0,RSTART,RLENGTH)}' file
Test
Test
and if you like the match() synopsis, here's how to write your own function to do it for strings:
awk -v tgt="Test" '
function strmatch(source,target) {
SSTART = index(source,target)
SLENGTH = length(target)
return SSTART
}
strmatch($0,tgt){print substr($0,SSTART,SLENGTH)}
' file

If these lines are the direct input to awk then the following work:
echo 'Test-01-02-03' | awk -F- '{print $1}' # First field
echo '01-02-03-Test-Ref1-Ref2' | awk -F- '{print $NF-2}' # Third field from the end.
If these lines are pulled out of a larger line in an awk script and need to be split again then the following snippets will do that:
str="Test-01-02-03"; split(str, a, /-/); print a[1]
str="01-02-03-Test-Ref1-Ref2"; numfields=split(str, a, /-/); print a[numfields-2]

Related

How to pass column name of a file as variable in awk

I trying to printing the data in the particular columns by passing them int awk command
I have tried using "-v" to set is as variable but its considering "$" as string. And my delimiter is special character ^A (ctrl+v+a).
vi test_file.dat
a^Ab^Ac^Ad^Ae^Af^Ag^Ah^Ai^Aj^Ak^Al^Am^An^Ao^Ap
Working code
awk -F'^A' '{print $2,$5,$7}' test_file.dat
It's Printing
b e g
But if I try
export fields='$2,$5,$7'
export file='test_file.dat'
awk -v sample_file="$test_file.dat" -v columns="$fileds" -F'^A' '{print columns}' sample_file
It's printing
$2 $5 $7
I expect the output as
b e g
And I want to pass the delimiter, columns, file name as a parameter like
export fields='$2,$5,$7'
export file='test_file.dat'
export delimiter='^A'
awk -v sample_file="$test_file.dat" -v columns="$fields" -v file_delimiter="$delimiter" -F'file_delimiter' '{print columns}' sample_file
In awk, the $ symbol is effectively an operator which takes the field numbers as arguments. The field names are expressions, which is why $NF works for denoting the last field: NF is evaluated by the $ operator. So as you can see, we should not include the dollar sign in the field names.
If you're using the environment to pass material to Awk, the right thing to do is to have Awk pick it up from the environment.
The environment can be accessed using the ENVIRON associative array. If a variable called delimiter holds the field separator, we might do something like
BEGIN { FS = ENVIRON["delimiter"] }
in the Awk code. Then we aren't dealing with yet another round of shell parameter interpolation issues.
We can pick up the field numbers similarly. The split function can be used to get them into an array. Refer to this one-liner:
$ fields=1,3,9 awk 'BEGIN { split(ENVIRON["fields"], f, ",") ;
for (i in f)
printf("f[%d] = %d\n", i, f[i]) ; }'
f[1] = 1
f[2] = 3
f[3] = 9
GNU Awk, the expression length(f) gives the number of fields.
In order to get awk to see the special characters while reading the file you could use cat -v file (there might be a built-in method, although I'm not aware of it). Then the key to getting the character ^A (Control-A) delimiter to be recognized is to escape it with a \, otherwise the regex capability of awk,gawk, etc. is to treat ^ as start of line.
export fields='$2,$5,$7'
export test_file='test_file.dat'
export delimiter='\\^A'
awk -F $delimiter '{ print '$fields' }' < <(cat -v test_file)
There's also no need to set awk variables for bash variables that have already set — so you can eliminate all of them essentially.
One thing to note if you did want to set them in awk is that columns wouldn't work because usually setting an awk variable from a bash variable would be assigned individually. For example -v var1='$2' -v var2='$5' -v var3='$7', so you'd end up for { print var1, var2, var3 } in awk. It's doubtful a single string can translated it into three variables without additional steps.

awk Print Line Issue

I'm experiencing some issues with a awk command right now. The original script was developed using awk on MacOS and was then ported to Linux. There awk shows a different behavior.
What I want to do is to count the occurrences of single strings provided via /tmp/test.uniq.txt in the file /tmp/test.txt.
awk '{print $1, system("cat /tmp/test.txt | grep -o -c " $1)}' /tmp/test.uniq.txt
Mac delivers an expected output like:
test1 2
test2 1
The output is in one line, the sting and the number of occurrences, separated by a whitespace.
Linux delivers an output like:
2
test1 1
test2
The output is not in one line an the output of the system command is printed first.
Sample input:
test.txt looks like:
test1 test test
test1 test test
test2 test test
test.uniq.txt looks like:
test1
test2
As comments suggested that using grep and cat etc using system function is not recommended as awk is complete language that can perform most of these tasks.
You can use following awk command to replace your cat | grep functionality:
awk 'FNR == NR {a[$1]=0; next} {for (i=1; i<=NF; i++) if ($i in a) a[$i]++}
END { for (i in a) print i, a[i] }' uniq.txt test.txt
test1 2
test2 1
Note that this output doesn't match with the count 5 as your question states as your sample data is probably different.
References:
Effective AWK Programming
Awk Tutorial
It looks to me as if you're trying to count the number of line containing each unique string in the uniq file. But the way you're doing it is .. awkward, and as you've demonstrated, inconsistent between versions of awk.
The following might work a little better:
$ awk '
NR==FNR {
a[$1]
next
}
{
for (i in a) {
if ($1~i) {
a[i]++
}
}
}
END {
for (i in a)
printf "%6d\t%s\n",a[i],i
}
' test.uniq.txt test.txt
2 test1
1 test2
This loads your uniq file into an array, then for every line in your text file, steps through the array to count the matches.
Note that these are being compared as regular expressions, without word boundaries, so test1 will also be counted as part of test12.
Another way might be to use grep+sort+uniq:
grep -o -w -F -f uniq.txt test.txt | sort | uniq -c
It's a pipeline but a short one
From man grep:
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
(-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

Obtaining the field that contains a value or string on Linux shell

Case example:
$ cat data.txt
foo,bar,moo
I can obtain the field data by using cut, assuming , as separator, but only if I know which position it has. Example to obtain value bar (second field):
$ cat data.txt | cut -d "," -f 2
bar
How can I obtain that same bar (or number field == 2) if I only know it contains a a letter?
Something like:
$ cat data.txt | reversecut -d "," --string "a"
[results could be both "2" or "bar"]
In other words: how can I know what is the field containing a substring in a text-delimited file using linux shell commands/tools?
Of course, programming is allowed, but do I really need looping and conditional structures? Isn't there a command that solves this?
Case of specific shell, I would prefer Bash solutions.
A close solution here, but not exactly the same.
More same-example based scenario (upon requestion):
For a search pattern of m or mo, the results could be both 3 or moo.
For a search pattern of f or fo, the results could be both 1 or foo.
Following simple awk may also help you in same.
awk -F, '$2~/a/{print $2}' data.txt
Output will be bar in this case.
Explanation:
-F,: Setting field separator for lines as comma, to identify the fields easily.
$2~/a/: checking condition here if 2nd field is having letter a in it, if yes then printing that 2nd field.
EDIT: Adding solution as per OP's comment and edited question too now.
Let's say following Input_file is there
cat data.txt
foo,bar,moo
mo,too,far
foo,test,test1
fo,test2,test3
Then following is the code for same:
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /fo/){print $i}}}' data.txt
foo
foo
fo
OR
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /mo/){print $i}}}' data.txt
moo
mo

Replace last character in specific column with value 0

How to replace the last character in column 2 with value 0
input
1232;1001;1
2231;2007;1
2234;2009;2
2003;1114;1
output desired
1232;1000;1
2231;2000;1
2234;2000;2
2003;1110;1
Modifying Input with gensub()
You can use any number of GNU awk string functions to do this, but the gensub() command is particularly useful. It has the signature:
gensub(regexp, replacement, how [, target])
which makes it extremely flexible for these sorts of transformations.
Converting Your Example
# Store your input in a shell variable for MCVE convenience, although
# you can have this data in a file or pass it on standard input if you
# prefer.
example_input='1232;1001;1
2231;2007;1
2234;2009;2
2003;1114;1'
# Use awk's gensub() string function.
echo "$example_input" | awk '{print gensub(/.;/, "0;", 2, $1)}'
This results in the following output:
1232;1000;1
2231;2000;1
2234;2000;2
2003;1110;1
awk approach:
awk -F';' '{ sub(/.$/,0,$2) }1' OFS=';' file
The output:
1232;1000;1
2231;2000;1
2234;2000;2
2003;1110;1
Or the same with substr() function:
awk -F';' '{ $2=substr($2,0,3)0 }1' OFS=';' file
not necessarily better, but a mathematical approach for numerical data...
$ awk 'BEGIN{FS=OFS=";"} {$2=int($2/10)*10}1'
round down the last digits (ones), to round down two digits (ones and tens) replace 10 with 100.
Or, simple replacement is easier with GNU sed
$ sed 's/.;/0;/2'
I would do that with sed:
sed -e 's/^\([^;]*;[^;]*\).;/\10;/' filename

Grep / search for a string in a csv, then parse that line and do another grep

The title is probably not very well worded, but I currently need to script a search that finds a given string in a CSV, then parses the line that's found and do another grep with an element within that line.
Example:
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2
,TRACKINGKEY1,TRACKINGNUMBER1-3,PACKAGENUM1-3
,TRACKINGKEY1,TRACKINGNUMBER1-4,PACKAGENUM1-4
,TRACKINGKEY1,TRACKINGNUMBER1-5,PACKAGENUM1-5
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2
What I need to do is grep the .csv file for a given key [key1 in this example] and then grab TRACKINGKEY1 so that I can grep the remaining lines. Our shipping software doesn't output the packingslip key on every line, which is why I have to first search by KEY and then by TRACKINGKEY in order to get all of the tracking numbers.
So using KEY1 initially I eventually want to output myself a nice little string like "TRACKINGNUMBER1-1;TRACKINGNUMBER1-2;TRACKINGNUMBER1-3;TRACKINGNUMBER1-4;TRACKINGNUMBER1-5"
$ awk -v key=KEY1 -F, '$1==key{f=1} ($1!~/^ *$/)&&($1!=key){f=0} f{print $3}' file
TRACKINGNUMBER1-1
TRACKINGNUMBER1-2
TRACKINGNUMBER1-3
TRACKINGNUMBER1-4
TRACKINGNUMBER1-5
glennjackman helpfully points out that by using a "smarter" value for FS the internal logic can be simpler.
awk -v key=KEY1 -F' *,' '$1==key{f=1} $1 && $1!=key{f=0} f{print $3}' file
-v key=KEY1 assign the value KEY1 to the awk variable key
-F' *,' assign the value *, (which is a regular expression) to the awk FS variable (controls field splitting)
$1==key{f=1} if the first key of the line is equal to the value of the key variable (KEY1) then assign the value 1 to the variable f (find our first desired key line)
$1 && $1!=key{f=0} if the first field has a truth-y value (in awk a non-zero, non-empty string) and the value of the first field is not equal to the value of the key variable assign the value 0 to the variable f (find the end of our keyless lines)
f{print $3} if the variable f has a truth-y value (remember non-zero, non-empty string) then print the third field of the line
awk '/KEY1/ {print $3}' FS=,
Result
TRACKINGNUMBER1-1
TRACKINGNUMBER1-2
TRACKINGNUMBER1-3
TRACKINGNUMBER1-4
TRACKINGNUMBER1-5
$ sed -nr '/^KEY1/, /^KEY/ { /^(KEY1| )/!d; s/.*(TRACKINGNUMBER[^,]+).*/\1/ p}' input
TRACKINGNUMBER1-1
TRACKINGNUMBER1-2
TRACKINGNUMBER1-3
TRACKINGNUMBER1-4
TRACKINGNUMBER1-5
One more awk
awk -F, '/KEY1/,/KEY/{print $3}' file
or given the sample data
awk 'match($0,/([^,]+NUMBER1[^,]+)/,a){print a[0]}'
or even
awk -F, '$3~/NUMBER1/&&$0=$3' file

Resources