How can I get past file length limit? - linux

I am trying to parse 50+ files in a shell script in a single call like the following,
for i in {0..49}
do
_file_list="$_file_list $_srcdir01/${_date_a[$i]}.gz"
done
eval zcat "$_file_list" | awk '{sum += 1} END {print sum;}'
But when I do this, I get the 'file name too long' error with zcat.
The reason I am trying to do this in a single call is because to my knowledge, awk cannot retain information from previous call. And I have to go through the entire list by considering it as a whole (e.g. finding a unique word in that list)
I also don't want to combine files because each of them are large files already.
Is there a clever way to solve this or Do I need to split the call and write out the intermediate results along the way?

You can pipe directly from a loop:
for date in "${_date_a[#]}"
do
zcat "$_srcdir01/$date.gz"
done | awk '{sum += 1} END {print sum;}'
In any case, that code shouldn't give that error as posted.
Since your example is not complete or self-contained, I added some code to initialize datafiles to test:
$ cat testscript
_srcdir01="./././././././././././././././././././"
_date_a=(foo{0001..0050})
for file in "${_date_a[#]}"
do
echo "hello world" | gzip > "$file.gz"
done
for i in {0..49}
do
_file_list="$_file_list $_srcdir01/${_date_a[$i]}.gz"
done
eval zcat "$_file_list" | awk '{sum += 1} END {print sum;}'
Running it generates a bunch of test data and correctly sums the number of lines:
$ bash testscript
50
I can reproduce your issue if I e.g. remove the eval:
$ bash testscript
(...)/foo0045.gz ./././././././././././././././././././/foo0046.gz ././././././.
/././././././././././././/foo0047.gz ./././././././././././././././././././/foo0
048.gz ./././././././././././././././././././/foo0049.gz ./././././././././././.
/./././././././/foo0050.gz: file name too long
So please double check that the code you post is the code you run, and not one of several other attempts you made while trying to solve it.

$ awk '{sum += 1} END {print sum}' files...
will work, but perhaps you just need to use wc -l
Manually building the file list is unnecessary,
$ zcat path/to/files{1..49} | awk ...
will work as well.

Related

Length comparison of one specific field in linux

I was trying to check the length of second field of a TSV file (hundreds of thousands of lines). However, it runs very very slowly. I guess it should be something wrong with "echo", but not sure how to do.
Input file:
prob name
1.0 Claire
1.0 Mark
... ...
0.9 GFGKHJGJGHKGDFUFULFD
So I need to print out what went wrong in the name. I tested with a little example using "head -100" and it worked. But just can't cope with original file.
This is what I ran:
for title in `cat filename | cut -f2`;do
length=`echo -n $line | wc -m`
if [ "$length" -gt 10 ];then
echo $line
fi
done
awk to rescue:
awk 'length($2)>10' file
This will print all lines having the second field length longer than 10 characters.
Note that it doesn't require any block statement {...} because if the condition is met, awk will by default print the line.
Try this probably:
cat file.tsv | awk '{if (length($2) > 10) print $0;}'
This should be a bit faster since the whole processing is done by the single awk process, while your solution starts 2 processes per loop iteration to make that comparison.
We can use awk if that helps.
awk '{if(length($2) > 10){print}}' filename
$2 here is 2nd field in filename which runs for every line. It would be faster.

Bash awk script: document splitter issue

I wrote this script to split one large document into several 500 line documents. It works with an exception of the first rendered file, which is one line short(499 lines).
The first line of the master document is transferred to "file01" correctly, & line 1 of "file02" is the next sequential line from line 499 of "file01."
Below is my script. Thank you all.
to use in terminal: Splitter.sh "filetosplit.txt"
#!/bin/bash
find $1 -type f | sort -n > $1_TapeList.txt
mkdir 500FileTL_$1
cd 500FileTL_$1
awk '{outfile=sprintf("file%02d.txt",NR/500);print > outfile}' ../$1_TapeList.txt
NR starts at 1, not 0. So you could just fix it like this
awk '{outfile=sprintf("file%02d.txt",(NR-1)/500) ...
I think I'd approach it this way:
#!/bin/sh
set -e
dir="500FileTL_$1"
mkdir "$dir"
find "$1" -type f |
sort -n |
awk dir="$dir" \
'(NR-1) % 500 == 0 {
outfile = sprintf("%s/file%02d.txt", dir, (NR - 1)/500)
}
{ print > outfile }'
No point in slowing things down with bash, or requiring it, if you don't need it. No need change directories. No need to create a temporary file, or wait for it to be written. If you want $1_TapeList.txt, add a print statement at the end of the awk script, and redirect that.
Also, you really want set -e. You do not want the script to proceed if mkdir fails.
The above won't run much faster because awk can't begin until sort ends. But there's less wasted motion, and the awk will be a tad faster because it runs sprintf 0.5% as often.

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

how to pass the filename as variable to a awk command from a shell script

in my shell script i have the following line
PO_list=$(awk -v col="$1" -F";" '!seen[$col]++ {print $col}' test.csv)
which generates a list with the values from column "col" which came from "$1" from file test.csv.
it might be possible to have several files in same location and for this would need to loop among them with a for sentence. For this I have to replace the filename test.csv with a variable, $i for example, which is the index from the list of files.
trying to fulfill my request, I was modifying my line with
PO_list=$(awk -v col="$1" -F";" '!seen[$col]++ {print $col}' $j)
unfortunately, i receive the error message:
awk: cannot open test.csv (No such file or directory)
Can anyone tell me why this error occur and how can I solve it, please?
Thank you,
As you commented in your previous question, you are calling it with
abc$ ./test.sh 2
So you just need to add another parameter when you call it:
abc$ ./test.sh 2 "test.csv"
and the script can be like this:
PO_list=$(awk -v col="$1" -F";" '!seen[$col]++ {print $col}' "$2")
# ^^^^
Whenever you want to use other parameters, remember they are positional. Hence, the first one is $1, second is $2 and so on.
In case the file happens to be in another directory, you can replace ./test.sh 2 "test.csv" by something like ./test.sh 2 "/full/path/of/test.csv" or whatever relative path you may need.

Issues with the AWK function

Does Awk have a limit to the amount of data it can process?
for i in "052" "064" "060" "070" "074" "076" "178"
do
awk -v f="${i}" -F, 'match ($1,f) { print $2","$3 }' uls.csv > ul$i.csv
awk -v f="${i}" -F, 'match ($1,f) { print $2","$3 }' dls.csv > dl$i.csv
awk -v n="${i}" -F, 'match ($1,n) { print $2","$3 }' dlsur.csv >> dlu$i.csv
awk -v k="${i}" -F, 'match ($1,k) { print $2","$3 }' dailyd.csv >> dla$i.csv
awk -v m="${i}" -F, 'match ($1,m) { print $2","$3 }' dailyu.csv >> ula$i.csv
done
When I run that piece of code, it basically pulls data from csv files and creates new files.
that piece of code works perfectly.
but when i add an extra file (in the for loop), for example "180" it will create that file, but will also include a few lines of data from other files. I went over the code many times. I even checked the raw data before it goes into this loop, and it is all correct. This seems like a glitch in awk.
Do I need to apply a wait function so that it can catch up?
Also something like
for file in uls dls dlsur dailyd dailyu; do
awk -F, -vOFS=, -vfile=$i '$1 ~ /052|064|060|070|074|076|178/ {print $2,$3 >> file$1.csv}' $file.csv
done
is probably better if it does what you want. Many fewer invocations of awk and loops through your files. (Slightly different output file names. That would be fixable but complicate the script a bit more than I thought was necessary for the purpose.)
No. What you say you think is happening cannot be happening - awk WILL NOT randomly pull data from un-specified files and put it in it's output stream.
Note that in your 3rd and subsequent lines you are using '>>' instead of '>' for your output redirection - have you accounted for that?
If you update your question (i.e. do NOT try to do it in a comment!) to tell us what you're trying to do with some representative sample input and expected output (just 2 input files, not 5, should be enough to explain your problem), we can help you write a correct script to do that.

Resources