awk script to process 400 .txt files - linux

I have a nice .awk script that takes the 2nd $2 value and prints it. Because the data in the .txt files only go down 8192 lines, any lines after that are irrelevant (the script takes care of that.) I have 400+ .tst files that need to have the same thing done and have the ouput's placed into a single file. So how would I go through every .tst file in the current directory? I tried piping the cat output to a single line version of the script but it only processed the first file. Any suggestions?
BEGIN{
}
{
print $2 "\n";
if (NR==8192)
exit;
}
END {
print NR "\n";
}

This should work -
awk 'FNR<=8192{ print $2 }' *.tst > finalfile

Just glob all the .tst files in the current directory and redirect the output to outfile:
$ awk 'FNR<=8192{print $2"\n";next}{print FNR"\n";nextfile}' *.tst > outfile

Related

How to print the value in third column of a line which comes after a line which, contains a specific string using AWK to a different file?

I have an output which contains something like this in the middle.
Stopping criterion = max iterations
Energy initial, next-to-last, final =
-83909.5503696 -86748.8150981 -86748.8512012
What I am trying to do is to print out the last value(3rd column) in line after the line which contains the string "Energy" to a different file. and I have to print out these values from 100 different files. currently I have been trying with this line which only looks at a single file.
awk -F: '/Energy/ { getline; print $0 }' inputfile > outputfile
but this gives output like:
-83909.5503696 -86748.8150981 -86748.8512012
Update - With the help of a suggestion below I was able to output the value to a file. but as it reads through different files it overwrites the final output file and prints out value of the final file that it read. What I tried was this,
#SBATCH --array=1-100
num=$SLURM_ARRAY_TASK_ID..
fold=$(printf '%03d' $num)
cd $main_path/surf_$fold
awk 'f{print $3; f=0} /Energy/{f=1}' inputfile > outputfile
This would not be an appropriate job for getline, see http://awk.freeshell.org/AllAboutGetline, and idk why you're setting FS to : with -F: when your fields are space-separated as awk assumes by default.
Here's how to do what I think you're trying to do with 1 call to awk:
awk 'f{print $3; f=0} /Energy/{f=1}' "$main_path/surf_"*"/inputfile > outputfile

Renaming Sequentially Named Files with Date Embedded

The Situation:
I have hundreds of zip files with an arbitrary date/time mixed into its name (4-6-2021 12-34-09 AM.zip). I need to get all of these files in order such that (0.zip, 1.zip 2.zip etc) with in a Linux cli system.
What I've tried:
I've tried ls -tr | while read i; do n=$((n+1)); mv -- "$i" "$(printf '%03d' "$n").zip"; done which almost does what I want but still seems to be out of order (I think its taking the order of when the file was created rather than the filename (which is what I need)).
If I can get this done, my next step would be to rename the file (yes a single file) in each zip to the name of the zip file. I'm not sure how I'd go about this either.
tl;dr
I have these files named with a weird date system. I need the date to be in order and renamed sequentially like 0.zip 1.zip 2.zip etc. It's 3:00 AM and I don't know why I'm up still trying to solve this and I have no idea how I'll rename the files in the zips to that sequential number (read above for more detail on this).
Thanks in advance!
GNU awk is an option here, redirecting the result of the file listing back into awk:
awk '{
fil=$0; # Set a variable fil to the line
gsub("-"," ",$1); # Replace "-" for " " in the first space delimited field
split($1,map," "); # Split the first field into the array map, using " " as the delimiter
if (length(map[1])==1) {
map[1]="0"map[1] # If the length of the day is 1, pad out with "0"
};
if (length(map[2])==1) {
map[2]="0"map[2] # Do the same for month
}
$1=map[1]" "map[2]" "map[3]; # Rebuilt first field based on array values
gsub("-"," ",$2); # Change "-" for " " in time
map1[mktime($1" "$2)]=fil # Get epoch format of date/time using mktime function and use this as an index for array map1 with the original line (fil) as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # At the end of processing, set the array sorting to index number ascending
cnt=0; # Initialise a cnt variable
for (i in map1) {
print "mv \""map1[i]"\" \""cnt".zip\""; # Loop through map1 array printing values and using these values along with cnt to generate and print move command
cnt++
}
}' <(for fil in *AM.zip;do echo $fil;done)
Once you are happy with the way the print command are printed, pipe the result into bash and so:
awk '{ fil=$0;gsub("-"," ",$1);split($1,map," ");if (length(map[1])==1) { map[1]="0"map[1] };if (length(map[2])==1) { map[2]="0"map[2] }$1=map[1]" "map[2]" "map[3];gsub("-"," ",$2);map1[mktime($1" "$2)]=fil} END { PROCINFO["sorted_in"]="#ind_num_asc";cnt=0;for (i in map1) { print "mv \""map1[i]"\" \""cnt".zip\"";cnt++ } }' <(for fil in *AM.zip;do echo $fil;done) | bash

Bash script to list files periodically

I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance
You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program
This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...
Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '
There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

GAWK Script - Print filename in BEGIN section

I am writing a gawk script that begins
#!/bin/gawk -f
BEGIN { print FILENAME }
I am calling the file via ./script file1.html but the script just returns nothing. Any ideas?
you can use ARGV[1] instead of FILENAME if you really want to use it in BEGIN block
awk 'BEGIN{print ARGV[1]}' file
You can print the file name when encounter line 1:
FNR == 1
If you want to be less cryptic, easier to understand:
FNR == 1 {print}
UPDATE
My first two solutions were incorrect. Thank you Dennis for pointing it out. His way is correct:
FNR == 1 {print FILENAME}
Straight from the man page (slightly reformatted):
FILENAME: The name of the current input file. If no files are specified on the command line, the value of FILENAME is “-”. However, FILENAME is undefined inside the BEGIN block (unless set by getline).
Building on Hai Vu's answer I suggest that if you only want the filename printed once per file it needs to be wrapped in a conditional.
if(FNR == 1) { print FILENAME };

Resources