awk printing nothing when used in loop [duplicate] - linux

This question already has answers here:
How can I use a file in a command and redirect output to the same file without truncating it?
(14 answers)
Closed 2 years ago.
I have a bunch of files using the format file.1.a.1.txt that look like this:
A 1
B 2
C 3
D 4
and was using the following command to add a new column containing the name of each file:
awk '{print FILENAME (NF?"\t":"") $0}' file.1.a.1.txt > file.1.a.1.txt
which ended up making them look how I want:
file.1.a.1.txt A 1
file.1.a.1.txt B 2
file.1.a.1.txt C 3
file.1.a.1.txt D 4
However, I need to do this for multiple files as a job on an HPC using sbatch submission. But when I run the following job script:
#!/bin/bash
#<other SBATCH info>
#SBATCH --array=1-10
N=$SLURM_ARRAY_TASK_ID
for j in {a,b,c};
do
for i in {1,2,3}
do awk '{print FILENAME (NF?"\t":"") $0}' file.${N}."$j"."$i".txt > file.${N}."$j"."$i".txt
done
done
awk is generating empty files. I have tried using cat to call the file and then piping it to awk but that also hasn't worked.

You don't need a loop and you cannot redirect STDOUT to the same file you're reading from STDIN, you will get blank files if you do that.
Try this:
#!/bin/bash
N=$SLURM_ARRAY_TASK_ID
awk '
NF{
print FILENAME "\t" $0 > FILENAME".tmp"
}
ENDFILE{ # requires gawk
close(FILENAME".tmp")
}' file."$N".{a,b,c}.{1,2,3}.txt
for file in file*.tmp; do
mv "$file" "${file%.tmp}"
done
Note that if you don't have GNU awk to use ENDFILE{} you can remove that stanza and get away with either:
Putting the close() statement just after the print statement (comes with lots of overhead)
Don't call close() at all and as long as you don't have a lot of files, you should be fine.

Related

BASH Split CSV Into Multiple Files Based on Column Value [duplicate]

This question already has answers here:
How to split a CSV file into multiple files based on column value
(2 answers)
Closed 5 months ago.
I have a file named fulldata.tmp which contains pipe delimited data (I can change it to comma if needed but generally like using pipe). With a BASH Shell script I would like to split lines out to new files based on the value in column 1 and retain the header. I'm pulling this data via SQL so I can pre-sort if needed but I don't have direct access to the terminal running this script so development and debugging is difficult. I've searched dozens of examples mostly recommending awk but I'm not connecting the dots. This is my core need and below are a couple quality of life options I'd like if it's easy along with example data.
Nice if possible: I would like to specify which columns print to the new files (my example desired output shows I want columns 1-4 out of the initial 5 columns).
Nice if possible: I would like the new files named with a prefix then the data that is being split on followed by extension: final_$col1.csv
GROUPID|LABEL|DATE|ACTIVE|COMMENT
ABC|001|2022-09-15|True|None
DEF|001|2022-09-16|False|None
GHI|002|2022-10-17|True|Future
final_ABC.csv
ABC|001|2022-09-15|True
final_DEF.csv
DEF|001|2022-09-16|False
final_GHI.csv
GHI|002|2022-10-17|True
Maybe awk
awk -F'|' -v OFS='|' 'NR>1{print $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp
Check the created csv files and it's content.
tail -n+1 final*.csv
Output
==> final_ABC.csv <==
ABC|001|2022-09-15|True
==> final_DEF.csv <==
DEF|001|2022-09-16|False
==> final_GHI.csv <==
GHI|002|2022-10-17|True
Here is how I would do the header.
IFS= read -r head < fulldata.tmp
Then use the variable to awk.
awk -F'|' -v header="${head%|*}" 'NR>1{printf "%s\n%s|%s|%s|%s\n", header, $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp
Run tail again to check.
tail -n+1 final*.csv
Output
==> final_ABC.csv <==
GROUPID|LABEL|DATE|ACTIVE
ABC|001|2022-09-15|True
==> final_DEF.csv <==
GROUPID|LABEL|DATE|ACTIVE
DEF|001|2022-09-16|False
==> final_GHI.csv <==
GROUPID|LABEL|DATE|ACTIVE
GHI|002|2022-10-17|True
You did find a solution with pure awk.
This works and preserves the header which I believe was a requirement.
cut -d '|' -f 1 fulldata.tmp | grep -v GROUPID | sort -u | while read -r id; do grep -E "^${id}|^GROUPID" fulldata.tmp > final_${id}.csv; done
I think a pure awk solution is better though.

split and write the files with AWK -Bash

INPUT_FILE.txt in c:\Pro\usr\folder1
ABCDEFGH123456
ABCDEFGH123456
ABCDEFGH123456
BBCDEFGH123456
BBCDEFGH123456
used the below AWK command in .SH script which runs from c:\Pro\usr\folder2 to split the file to multiple txt files with an extension of _kg based on first 8 characters.
awk '{ F=substr($0,1,8) "_kg" ".txt"; print $0 >> F; close(F) }' ' "c:\Pro\usr\folder1\input_file.txt"
this is working good , but the files are writing in the main location where the bash is pointing. How can I route the created files to another location like c:\Pro\usr\folder3.
Thanks
Following awk code may help you in same, written and tested with shown samples in GNU awk.
awk -v outPath='c:\\Pro\\usr\\folder3' -v FPAT='^.{8}' '{outFile=($1"_kg.txt");outFile=outPath"\\"outFile;print > (outFile);close(outFile)}' Input_file
Explanation: Creating an awk variable named outPath which has path mentioned by OP in samples. Then setting FPAT(field separator settings as a regex), where I am creating field of 8 characters starting from first character. In main program of awk, creating outFile variable which has output file names in it(1st field following by _kg.txt), then printing whole line to output file and closing the output file in backend to avoid "too many opened files" error.
Pass the destination folder as a variable to awk:
awk -v dest='c:\\Pro\\usr\\folder3\\' '{F=dest substr($0,1,8) "_kg" ".txt"; print $0 >> F; close(F) }' "c:\Pro\usr\folder1\input_file.txt"
I think the doubled backslashes are required.

Shell script make lines in one huge file into two seperate files in one go? [duplicate]

This question already has answers here:
How to save both matching and non-matching from grep
(3 answers)
Closed 1 year ago.
Currently My shell script iterate the lines in one huge file two times:
(What I want to do is just like the shell script below.)
grep 'some_text' huge_file.txt > lines_contains_a.txt
grep -v 'some_text' huge_file.txt > lines_not_contains_a.txt
but it is slow.
How to do the same thing only iterate the lines once?
Thanks!
With GNU awk:
awk '/some_text/ { print >> "lines_contains_a.txt" }
!/some_text/ { print >> "lines_not_contains_a.txt" }' huge_file.txt
With sed:
sed -n '/some_text/ w lines_contains_a.txt
/some_text/! w lines_not_contains_a.txt' huge_file.txt

Reading a number by awk

I have the following code which successfully reads the line of the file that I want:
tail -n 9 myfile | awk 'NR==1'
although I do not want that this writes anything in my script. So I tried to assign a parameter into it, but it doesn't work in this way:
this=tail -n 9 myfile | awk 'NR==1'
Eventually, I want to read the second argument, which is a number, by ${1}. Could you tell me how can I do that?
It sounds like you just want to capture the output in a variable. Thus you can do this:
awkOutput=$(tail -n 9 myfile | awk 'NR==1')
and later you cant print it out
echo $awkOutput

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

Resources