AWK very slow when splitting large file based on column value and using close command - linux

I need to split a large log file into smaller ones based on the id found in the first column, this solution worked wonders and very fast for months:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;
Where $nome is a file and directory name.
Its very fast and worked until the log file reachead several million lines +2GB text file, then it started to show
"Too many open files"
The solution is indeed very simple, adding the close command:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;
The problem is, now its VERY slow, its taking forever to do something that was done in seconds and I need to optmize this.
AWK is not mandatory, I can use an alternative, I just dont know how

Untested since you didn't provide any sample input/output to test with but this should do it:
sort -t';' -k1,1 "${nome}.all" |
awk -v dir="$nome" -F\; '$1!=prev{close(out); out=dir"/"dir"_"$1".log"; prev=$1} {print > out}'
Your first script:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;
had 3 problems:
It wasn't closing file names as you go and so exceeded the threshold you saw.
It had an unparenthesized expression on the right side of output redirection which is undefined behavior per POSIX.
It wasn't quoting the shell variable ${nome} in the file name.
It's worth mentioning that gawk would be able to handle 1 and 2 without failing but it would slow down as the number of open files grew and it was having to manage the opens/closes internally.
Your second script:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;
though now closing the output file name, still had problems 2 and 3 and then added 2 new problems:
It was opening and closing the output files once per input line instead of only when the output file name had to change.
It was overwriting the output file for each $1 for every line written to it instead of appending to it.
The above assumes you have multiple lines of input for each $1 and so each output file will have multiple lines. Otherwise the slow down you saw when closing the output files wouldn't have happened.
The above sort could rearrange the order of input lines for each $1. If that's a problem add -s for "stable sort" if you have GNU sort or let us know as it's easy to work around with POSIX tools.

Related

Split large gz files while preserving rows

I have a larger .gz file (2.1G) that I am trying to load into R, but it is large enough that I have to split it into pieces and load each individually before recombining them. However, I am having difficulty in splitting the file in a way that preserves the structure of the data. The file itself, with the exception of the first two rows, is a 56318 x 9592 matrix with non-homogenous entries.
I'm using Ubuntu 16.04. First, I tried using the split command from terminal as suggested by this link (https://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts?rq=1)
$ split --lines=10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
Doing this, though, creates far more files than I would have expected (since my matrix has 57000 rows, I was hoping to output 6 files, each 10000 rows in size). When reading one of these into R and investigating the dimensions, I see that each is a matrix of 62x9592, indicating that the columns have all been preserved, but I'm getting significantly less rows than I would have hoped. Further, when reading it in, I get an error specifying an unexpected end of file. My thought is that it's not reading in how I want it to.
I found a two possible alternatives here - https://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
In particular, I've tried piping different arguments using gunzip, and then passing the output through to split (with the assumption that perhaps the file being compressed is what led to inconsistent end lines). I tried
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
but, doing this, I ended up with the exact same splits that I had previously. I have the same problem replacing "zcat" with "gunzip -c", which should have sent the uncompressed output to the split command.
Another answer on that link suggested piping to head or tail with something like zcat, for example
$ zcat originalFile.gct.gz | head -n 10000 >> "originalFile.gct.gz.1"
With zcat, this works perfectly, and it's exactly what I want. The dimension for this ends up being 10000x9592, so this is the ideal solution. One thing that I'll note is that this output is an ASCII text file rather than a compressed file, and I'm perfectly OK with that.
However, I want to be able to do this until end up file, making an additional output file for each 10000 rows. For this particular case, it's not a huge deal to just make the six, but I have tens of files like this, some of which are >10gb. My question, then, is how can I use split command that will take the first 10000 lines of the unzipped file and then output them, automatically updating the suffix with each new file? Basically, I want the output that I got from using "head", but with "split" so that I can do it over the entire file.
Here is the solution that ended up working for me
$ zcat originalFile.gct.gz | split -l 10000 - "originalFile.gtc.gz-"
As Guido mentioned in the comment, my original command
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
was discarding the output of zcat, and split was once again reading from the compressed data. By including the "-" after the split argument, I was able to pass the standard output from zcat into split, and now the piping works as I was expecting it to.
When you want to control your splitting better, you can use awk.
You mentioned that the first two rows were special.
Try something like
zcat originalFile.gct.gz |
awk 'BEGIN {j=1} NR<3 {next} {i++} i%5==0 {j++} {print > "originalFile.gct.part"j }'
When you want your outfiles compressed, modify the awk command: Let is print the completed files and use xargs to gzip them.
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/name_"$1".gct.gz";}'
and example line of my file was: 2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/file_"$1".gct.gz";}'
and example line of my file was:
2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the variable $1
Getting and ouput of:
/tmp/file_2014.gct.gz
/tmp/file_2015.gct.gz
/tmp/file_2016.gct.gz
/tmp/file_2017.gct.gz
/tmp/file_2018.gct.gz

text file contains lines of bizarre characters - want to fix

I'm an inexperienced programmer grappling with a new problem in a large text file which contains data I am trying to process. Here's a screen capture of what I'm looking at (using 'less' - I am on a linux server):
https://drive.google.com/file/d/0B4VAqfRxlxGpaW53THBNeGh5N2c/view?usp=sharing
Bioinformaticians will recognize this file as a "fastq" file containing DNA sequence data. The top half of the screenshot contains data in its expected format (which I admit contains some "bizarre" characters, but that is not the issue). However, the bottom half (with many characters shaded in white) is completely messed up. If I were to scroll down the file, it eventually returns to normal text after about 500 lines. I want to fix it because it is breaking downstream operations I am trying to perform (which complain about precisely this position in the file).
Is there a way to grep for and remove the shaded lines? Or can I fix this problem by somehow changing the encoding on the offending lines?
Thanks
If you are lucky, you can use
strings file > file2
Oh well, try it another way.
Determine the linelength of the correct lines (I think the first two lines are different).
head -1 file | wc -c
head -2 file | tail -1 | wc -c
Hmm, wc also counts the line-ending, substract 1 from both lengths.
Than try to read the file 1 line a time. Use a case-statement so you do not have to write a lot of else-if constructions for comparing the length to the expected length. In the code I will accept the lengths 20, 100 and 330
Redirect everything to another file outside the loop (inside will overwrite each line).
cat file | while read -r line; do
case ${#line} in
20|100|330) echo $line ;;
esac
done > file2
A total different approach would be filtering the wrong lines with sed, awk or grep but that would require knowledge what characters you will and won't accept.
Yes, when you are a lucky (wo-)man, all ugly lines will have a character in common like '<' or maybe an '#'. In that case you can use egrep:
egrep -v "<|#" file > file2
BASED ON INSPECTION OF THE SNAP
sed -r 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
to make the actual changes in the file and make a backup file with a .bak extension do
sed -r -i.bak 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file

renaming files using loop in unix

I have a situation here.
I have lot of files like below in linux
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaac
SIPTV_FIPTV_ID00$line_T20141003195717_C0000001000_FWD148_IPV_001.DATaag
I want to remove the $line and make a counter from 0001 to 6000 for my 6000 such files in its place.
Also i want to remove the trailer 3 characters after this is done for each file.
After fix file should be like
SIPTV_FIPTV_ID0000001_T20141003195717_C0000001000_FWD148_IPV_001.DAT
SIPTV_FIPTV_ID0000002_T20141003195717_C0000001000_FWD148_IPV_001.DAT
Please help.
With some assumption, I think this should do it:
1. list of the files is in a file named input.txt, one file per line
2. the code is running in the directory the files are in
3. bash is available
awk '{i++;printf "mv \x27"$0"\x27 ";printf "\x27"substr($0,1,16);printf "%05d", i;print substr($0,22,47)"\x27"}' input.txt | bash
from the command prompt give the following command
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}'
%
and check the output, if it looks OK
% echo *.DAT??? | awk '{
old=$0;
sub("\\$line",sprintf("%4.4d",++n));
sub("...$","");
print "mv", old, $1}' | sh
%
A commentary: echo *.DAT??? is meant to give as input to awk a list of all the filenames that you want to modify, you may want something more articulated if the example names you gave aren't representative of the whole spectrum... regarding the awk script itself, I used sprintf to generate a string with the correct number of zeroes for the replacement of $line, the idiom `"\\$..." with two backslashes to quote the dollar sign is required by gawk and does no harm in mawk, and as a last remark I have to say that in similar cases I prefer to make at least a dry run before passing the commands to the shell...

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

Why doesn't "sort file1 > file1" work?

When I am trying to sort a file and save the sorted output in itself, like this
sort file1 > file1;
the contents of the file1 is getting erased altogether, whereas when i am trying to do the same with 'tee' command like this
sort file1 | tee file1;
it works fine [ed: "works fine" only for small files with lucky timing, will cause lost data on large ones or with unhelpful process scheduling], i.e it is overwriting the sorted output of file1 in itself and also showing it on standard output.
Can someone explain why the first case is not working?
As other people explained, the problem is that the I/O redirection is done before the sort command is executed, so the file is truncated before sort gets a chance to read it. If you think for a bit, the reason why is obvious - the shell handles the I/O redirection, and must do that before running the command.
The sort command has 'always' (since at least Version 7 UNIX) supported a -o option to make it safe to output to one of the input files:
sort -o file1 file1 file2 file3
The trick with tee depends on timing and luck (and probably a small data file). If you had a megabyte or larger file, I expect it would be clobbered, at least in part, by the tee command. That is, if the file is large enough, the tee command would open the file for output and truncate it before sort finished reading it.
It doesn't work because '>' redirection implies truncation, and to avoid keeping the whole output of sort in the memory before re-directing to the file, bash truncates and redirects output before running sort. Thus, contents of the file1 file will be truncated before sort will have a chance to read it.
It's unwise to depend on either of these command to work the way you expect.
The way to modify a file in place is to write the modified version to a new file, then rename the new file to the original name:
sort file1 > file1.tmp && mv file1.tmp file1
This avoids the problem of reading the file after it's been partially modified, which is likely to mess up the results. It also makes it possible to deal gracefully with errors; if the file is N bytes long, and you only have N/2 bytes of space available on the file system, you can detect the failure creating the temporary file and not do the rename.
Or you can rename the original file, then read it and write to a new file with the same name:
mv file1 file1.bak && sort file1.bak > file1
Some commands have options to modify files in place (for example, perl and sed both have -i options (note that the syntax of sed's -i option can vary). But these options work by creating temporary files; it's just done internally.
Redirection has higher precedence. So in the first case, > file1 executes first and empties the file.
The first command doesn't work (sort file1 > file1), because when using the redirection operator (> or >>) shell creates/truncates file before the sort command is even invoked, since it has higher precedence.
The second command works (sort file1 | tee file1), because sort reads lines from the file first, then writes sorted data to standard output.
So when using any other similar command, you should avoid using redirection operator when reading and writing into the same file, but you should use relevant in-place editors for that (e.g. ex, ed, sed), for example:
ex '+%!sort' -cwq file1
or use other utils such as sponge.
Luckily for sort there is the -o parameter which write results to the file (as suggested by #Jonathan), so the solution is straight forward: sort -o file1 file1.
Bash open a new empty file when reads the pipe, and then calls to sort.
In the second case, tee opens the file after sort has already read the contents.
You can use this method
sort file1 -o file1
This will sort and store back to the original file. Also, you can use this command to remove duplicated line:
sort -u file1 -o file1

Resources