Issues with the AWK function - linux

Does Awk have a limit to the amount of data it can process?
for i in "052" "064" "060" "070" "074" "076" "178"
do
awk -v f="${i}" -F, 'match ($1,f) { print $2","$3 }' uls.csv > ul$i.csv
awk -v f="${i}" -F, 'match ($1,f) { print $2","$3 }' dls.csv > dl$i.csv
awk -v n="${i}" -F, 'match ($1,n) { print $2","$3 }' dlsur.csv >> dlu$i.csv
awk -v k="${i}" -F, 'match ($1,k) { print $2","$3 }' dailyd.csv >> dla$i.csv
awk -v m="${i}" -F, 'match ($1,m) { print $2","$3 }' dailyu.csv >> ula$i.csv
done
When I run that piece of code, it basically pulls data from csv files and creates new files.
that piece of code works perfectly.
but when i add an extra file (in the for loop), for example "180" it will create that file, but will also include a few lines of data from other files. I went over the code many times. I even checked the raw data before it goes into this loop, and it is all correct. This seems like a glitch in awk.
Do I need to apply a wait function so that it can catch up?

Also something like
for file in uls dls dlsur dailyd dailyu; do
awk -F, -vOFS=, -vfile=$i '$1 ~ /052|064|060|070|074|076|178/ {print $2,$3 >> file$1.csv}' $file.csv
done
is probably better if it does what you want. Many fewer invocations of awk and loops through your files. (Slightly different output file names. That would be fixable but complicate the script a bit more than I thought was necessary for the purpose.)

No. What you say you think is happening cannot be happening - awk WILL NOT randomly pull data from un-specified files and put it in it's output stream.
Note that in your 3rd and subsequent lines you are using '>>' instead of '>' for your output redirection - have you accounted for that?
If you update your question (i.e. do NOT try to do it in a comment!) to tell us what you're trying to do with some representative sample input and expected output (just 2 input files, not 5, should be enough to explain your problem), we can help you write a correct script to do that.

Related

split and write the files with AWK -Bash

INPUT_FILE.txt in c:\Pro\usr\folder1
ABCDEFGH123456
ABCDEFGH123456
ABCDEFGH123456
BBCDEFGH123456
BBCDEFGH123456
used the below AWK command in .SH script which runs from c:\Pro\usr\folder2 to split the file to multiple txt files with an extension of _kg based on first 8 characters.
awk '{ F=substr($0,1,8) "_kg" ".txt"; print $0 >> F; close(F) }' ' "c:\Pro\usr\folder1\input_file.txt"
this is working good , but the files are writing in the main location where the bash is pointing. How can I route the created files to another location like c:\Pro\usr\folder3.
Thanks
Following awk code may help you in same, written and tested with shown samples in GNU awk.
awk -v outPath='c:\\Pro\\usr\\folder3' -v FPAT='^.{8}' '{outFile=($1"_kg.txt");outFile=outPath"\\"outFile;print > (outFile);close(outFile)}' Input_file
Explanation: Creating an awk variable named outPath which has path mentioned by OP in samples. Then setting FPAT(field separator settings as a regex), where I am creating field of 8 characters starting from first character. In main program of awk, creating outFile variable which has output file names in it(1st field following by _kg.txt), then printing whole line to output file and closing the output file in backend to avoid "too many opened files" error.
Pass the destination folder as a variable to awk:
awk -v dest='c:\\Pro\\usr\\folder3\\' '{F=dest substr($0,1,8) "_kg" ".txt"; print $0 >> F; close(F) }' "c:\Pro\usr\folder1\input_file.txt"
I think the doubled backslashes are required.

Awk to read file as a whole

Let a file with content as under -
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
In general if any operation using awk is performed, it iterates line by line and performs that action on each line.
For e.g:
awk '{print substr($0,8,10)}' file
O/P:
hijklmn
wxyzabc
klmnopq
I would like to know an approach in which all the contents inside the file is treated as a single variable and awk prints just one output.
Example Desired O/P:
hijklmnpqr
It's not that I wish for the desired output for the given question but in general would appreciate if anyone could suggest an approach to provide the content of a file as a whole to the awk.
This is a gawk solution
From the docs:
There are times when you might want to treat an entire data file as a single record.
The only way to make this happen is to give RS a value that you know doesn’t occur in the input file.
This is hard to do in a general way, such that a program always works for arbitrary input files.
$ cat file
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
The RS must be set to a pattern not present in archive, following Denis Shirokov suggestion on the docs (Thanks #EdMorton):
$ gawk '{print ">>>"$0"<<<<"}' RS='^$' file
>>>abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
<<<<
The trick is in bold font:
It works by setting RS to ^$, a regular expression that will never
match if the file has contents. gawk reads data from the file into
tmp, attempting to match RS. The match fails after each read, but fails quickly, such that gawk fills tmp with the entire contents of the file
So:
$ gawk '{gsub(/\n/,"");print substr($0,8,10)}' RS='^$' file
Returns:
hijklmnpqr
With GNU awk for multi-char RS (best approach):
$ awk -v RS='^$' '{print substr($0,8,10)}' file
hijklmn
pq
With other awks if your input can't contain NUL characters:
$ awk -v RS='\0' '{print substr($0,8,10)}' file
hijklmn
pq
With other awks otherwise:
$ awk '{rec = rec $0 ORS} END{print substr(rec,8,10)}' file
hijklmn
pq
Note that none of those produce the output you say you wanted:
hijklmnpqr
because they do what you say you wanted (a newline is just another character in your input file, nothing special):
"read file as a whole"
To get the output you say you want would require removing all newlines from the file first. You can do that with gsub(/\n/,"") or various other methods such as:
$ awk '{rec = rec $0} END{print substr(rec,8,10)}' file
hijklmnpqr
if that's really what you want.

How can I get past file length limit?

I am trying to parse 50+ files in a shell script in a single call like the following,
for i in {0..49}
do
_file_list="$_file_list $_srcdir01/${_date_a[$i]}.gz"
done
eval zcat "$_file_list" | awk '{sum += 1} END {print sum;}'
But when I do this, I get the 'file name too long' error with zcat.
The reason I am trying to do this in a single call is because to my knowledge, awk cannot retain information from previous call. And I have to go through the entire list by considering it as a whole (e.g. finding a unique word in that list)
I also don't want to combine files because each of them are large files already.
Is there a clever way to solve this or Do I need to split the call and write out the intermediate results along the way?
You can pipe directly from a loop:
for date in "${_date_a[#]}"
do
zcat "$_srcdir01/$date.gz"
done | awk '{sum += 1} END {print sum;}'
In any case, that code shouldn't give that error as posted.
Since your example is not complete or self-contained, I added some code to initialize datafiles to test:
$ cat testscript
_srcdir01="./././././././././././././././././././"
_date_a=(foo{0001..0050})
for file in "${_date_a[#]}"
do
echo "hello world" | gzip > "$file.gz"
done
for i in {0..49}
do
_file_list="$_file_list $_srcdir01/${_date_a[$i]}.gz"
done
eval zcat "$_file_list" | awk '{sum += 1} END {print sum;}'
Running it generates a bunch of test data and correctly sums the number of lines:
$ bash testscript
50
I can reproduce your issue if I e.g. remove the eval:
$ bash testscript
(...)/foo0045.gz ./././././././././././././././././././/foo0046.gz ././././././.
/././././././././././././/foo0047.gz ./././././././././././././././././././/foo0
048.gz ./././././././././././././././././././/foo0049.gz ./././././././././././.
/./././././././/foo0050.gz: file name too long
So please double check that the code you post is the code you run, and not one of several other attempts you made while trying to solve it.
$ awk '{sum += 1} END {print sum}' files...
will work, but perhaps you just need to use wc -l
Manually building the file list is unnecessary,
$ zcat path/to/files{1..49} | awk ...
will work as well.

awk output to variable [duplicate]

This question already has answers here:
How do I set a variable to the output of a command in Bash?
(15 answers)
Closed 6 years ago.
[Dd])
echo"What is the record ID?"
read rID
numA= awk -f "%" '{print $1'}< practice.txt
I cannot figure out how to set numA = to the output of the awk in order to compare rID and numA. numA is equal to the first field of a txt file which is separated by %. Any suggestions?
You can capture the output of any command in a variable via command substitution:
numA=$(awk -F '%' '{print $1}' < practice.txt)
Unless your file contains only one line, however, the awk command you presented (as corrected above) is unlikely to be what you want to use. If the practice.txt file contains, say, answers to multiple questions, one per line, then you probably want to structure the script altogether differently.
You don't need to use awk, just use parameter expansion:
numA=${rID%%\%*}
this is the correct syntax.
numA=$(awk -F'%' '{print $1}' practice.txt)
however, it will be easier to do comparisons in awk by passing the bash variable in.
awk -F'%' -v r="$rID" '$1==r{... do something ...}' practice.txt
since you didn't specify any details it's difficult to suggest more...
to remove rID matching line from the file do this
awk -F'%' -v r="$rID" '$1!=r' practice.txt > output
will print the lines where the condition is met ($1 not equal to rID), equivalent to deleting the ones which are equal. You can mimic in place replacement by
awk ... practice.txt > temp && mv temp practice.txt
where you fill in ... from the line above.
Try using
$ numA=`awk -F'%' '{ if($1 != $0) { print $1; exit; }}' practice.txt`
From the question, "numA is equal to the first field of a txt file which is separated by %"
-F'%', meaning % is the only separator we care about
if($1 != $0), meaning ignore lines that don't have the separator
print $1; exit;, meaning exit after printing the first field that we encounter separated by %. Remove the exit if you don't want to stop after the first field.

Linux scripting: Search a specific column for a keyword

I have a large text file that contains multiple columns of data. I'm trying to write a script that accepts a column number and keyword from the command line and searches for any hits before displaying the entire row of any matches.
I've been trying something along the lines of:
grep $fileName | awk '{if ($'$columnNumber' == '$searchTerm') print $0;}'
But this doesn't work at all. Am I on the right lines? Thanks for any help!
The -v option can be used to pass shell variables to awk command.
The following may be what you're looking for:
awk -v s=$SEARCH -v c=$COLUMN '$c == s { print $0 }' file.txt
EDIT:
I am always trying to write more elegant and tighter code. So here's what Dennis means:
awk -v s="$search" -v c="$column" '$c == s { print $0 }' file.txt
Looks reasonable enough. Try using set -x to look at exactly what's being passed to awk. You can also use different and/or more awk things, including getting rid of the separate grep:
awk -v colnum=$columnNumber -v require="$searchTerm"
"/$fileName/ { if (\$colnum == require) print }"
which works by setting awk variables (colnum and require, in this case) and then using the literal string $colnum to get the desired field, and the variable require to get the required-string.
Note that in all cases (with or without the grep command), any regular expression meta-characters in $fileName will be meta-y, e.g., this.that will match the file named this.that but also the file named thisXthat.

Resources