Using Awk to process a file where each record has different fixed-width fields - linux

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.

Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
BEGIN {
field1=""
field5=""
field7=""
}
{
record_type = substr($0,1,2)
if (record_type == "AA")
{
field1=substr($0,3,6)
}
else if (record_type == "BB")
{
field5=substr($0,9,6)
field7=substr($0,21,18)
}
else if (record_type == "CC")
{
print field1"|"field5"|"field7
}
}
Create an awk script file called program.awk and pop that code into it. Execute the script using :
awk -f program.awk < my_multi_line_file.txt

You maybe can use two passes:
1step.awk
/^AA/{printf "2 6 6 12" }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8" }
{printf "\n%s\n", $0}
2step.awk
NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}
And then
awk -f 1step.awk sample | awk -f 2step.awk

You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:
awk '/^AA/ { manually process record AA out of $0 }
/^BB/ { manually process record BB out of $0 }
/^CC/ { manually process record CC out of $0 }' file ...
The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.
I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.

Could you use Perl and then select an unpack template based on the first two chars of the line?

Better use some fully featured scripting language like perl or ruby.

What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Related

Retrieve different information from several files to bring them together in one. BASH

I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.
One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)

How to use awk for filtering(perl automation)

This is my txt file
type=0
vcpu_count=10
maste=0
h=0
p=0
memory=23.59
num=2
I want to get the vcpu_count and memory values and store it in some array through perl(automating script) .
awk -F'=' '/vcpu_count/{printf "\n",$1}' .vmConfig.txt
i am using this command just to test on terminal.but am getting a blank line. How do i do it. I need to get these two values and check for condition
If you are using Perl anyway, just use Perl for this too.
my %array;
open ($config, "<", ".vmConfig.txt") or die "$0: Could not open .vmConfig.txt: $!\n";
while (<$config>) {
next unless /^\s*(vcpu_count|memory)\s*=\s*(.*?)\s*\n/;
$array{$1} = $2;
}
close($config);
If you don't want the result to be an associative array (aka hash), refactoring should be relatively easy.
Following awk may help you on same.
Solution 1st:
awk '/vcpu_count/{print;next} /memory/{print}' Input_file
Output will be as follows:
vcpu_count=10
memory=23.59
Solution 2nd:
In case you want to print the values on a single line using printf then following may help you on same:
awk '/vcpu_count/{val=$0;next} /memory/{printf("%s AND %s\n",val,$0)}' Input_file
Output will be as follows:
vcpu_count=10 AND memory=23.59
when you use awk -F'=' '/vcpu_count/{printf "\n",$1}' .vmConfig.txt there are a couple of mistakes. Firstly, printf "\n" will only ever print a new line, as you have found. You need to add a format specifier - something like printf "%s\n", $2 will treat field 2 as a string and add it into the printed string. Checking out man printf at the command line will explain a bit more,.
Secondly, as I changed there, when you used $1 you were using the first field, which is the key in this case (while $0 is the whole line.)
Triplees solution is probably the most appropriate, but if there is a particular reason to start awk to perform this before perl, the following may help.
As you have done, it splits on =, but then outputs as csv, which you can change as appropriate. Even if input lines are not always in same order, will output in predictable order on single line
awk 'BEGIN {
FS="=";
OFS="," # tabs, etc if wanted, delete for spaces.
}
/vcpu_count/ {cpu=$2}
/memory/ {mem=$2}
END { print cpu, mem }'
This gives
10,23.59

Filtering CSV File using AWK

I'm working on CSV file
This my csv file
Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv
This is an example of my data looks like i have around 43 Row and 12,000 column
i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).
i use the following command to get my output
awk -F"," '{print $3}' out_file.csv > test1.csv
This is my file:
gender|gene_name |disease |1000g_oct2014|Polyphen |SNAP
male |RB1,GTF2A1L|cancer,diabetes |0.1 |0.46 |0.1
male |NONE,LOC441|diabetes |0.003 |0.52 |0.6
male |TBC1D1 |diabetes |0.940 |1 |0.9
male |BCOR |cancer |0 |0.31 |0.2
male |TP53 |diabetes |0 |0.54 |0.4
note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:
But i'm getting the output following way
Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53
While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why
can any one help me with this.
The root of your problem is - you have comma separated values with embedded commas.
That makes life more difficult. I would suggest the approach is to use a csv parser.
I quite like perl and Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
open ( my $data, '<', 'data_file.csv' ) or die $!;
my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );
while ( my $row = $csv -> getline ( $data ) ) {
print $row -> [2],"\n";
}
Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)
But it should do the trick - Text::CSV handles quoted comma fields nicely.
Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)
Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)
The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.
At last this is what i did for getting my answers in a simple way thanks to #peak i found the solution
1st i used the
CSV filter which is an python module used for filtering the csv file.
i changed my delimiters using csvfilter using the following command
csvfilter input_file.csv --out-delimiter="|" > out_file.csv
This command used to change the delimiter ',' into '|'
now i used the awk command to sort and filter
awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv
Thanks for your help.

Bash script key/value pair regardless of bash version

I am writing a curl bash script to test webservices. I will have file_1 which would contain the URL paths
/path/to/url/1/{dynamic_path}.xml
/path/to/url/2/list.xml?{query_param}
Since the values in between {} is dynamic, I am creating a separate file, which will have values for these params. the input would be in key-value pair i.e.,
dynamic_path=123
query_param=shipment
By combining two files, the input should become
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
This is the background of my problem. Now my questions
I am doing it in bash script, and the approach I am using is first reading the file with parameters and parse it based on '=' and store it in key/value pair. so it will be easy to replace i.e., for each url I will find the substring between {} and whatever the text it comes with, I will use it as the key to fetch the value from the array
My approach sounds okay (at least to me) BUT, I just realized that
declare -A input_map is only supported in bashscript higher than 4.0. Now, I am not 100% sure what would be the target environment for my script, since it could run in multiple department.
Is there anything better you could suggest ? Any other approach ? Any other design ?
P.S:
This is the first time i am working on bash script.
Here's a risky way to do it: Assuming the values are in a file named "values"
. values
eval "$( sed 's/^/echo "/; s/{/${/; s/$/"/' file_1 )"
Basically, stick a dollar sign in front of the braces and transform each line into an echo statement.
More effort, with awk:
awk '
NR==FNR {split($0, a, /=/); v[a[1]]=a[2]; next}
(i=index($0, "{")) && (j=index($0,"}")) {
key=substr($0,i+1, j-i-1)
print substr($0, 1, i-1) v[key] substr($0, j+1)
}
' values file_1
There are many ways to do this. You seem to think of putting all inputs in a hashmap, and then iterate over that hashmap. In shell scripting it's more common and practical to process things as a stream using pipelines.
For example, your inputs could be in a csv file:
123,shipment
345,order
Then you could process this file like this:
while IFS=, read path param; do
sed -e "s/{dynamic_path}/$path/" -e "s/{query_param}/$param/" file_1
done < input.csv
The output will be:
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
/path/to/url/1/345.xml
/path/to/url/2/list.xml?order
But this is just an example, there can be so many other ways.
You should definitely start by writing a proof of concept and test it on your deployment server. This example should work in old versions of bash too.

Splitting A File On Delimiter

I have a file on a Linux system that is roughly 10GB. It contains 20,000,000 binary records, but each record is separated by an ASCII delimiter "$". I would like to use the split command or some combination thereof to chunk the file into smaller parts. Ideally I would be able to specify that the command should split every 1,000 records (therefore every 1,000 delimiters) into separate files. Can anyone help with this?
The only unorthodox part of the problem seems to be the record separator. I'm sure this is fixable in awk pretty simply - but I happen to hate awk.
I would transfer it in the realm of 'normal' problems first:
tr '$' '\n' < large_records.txt | split -l 1000
This will by default create xaa, xab, xac... files; look at man split for more options
I love awk :)
BEGIN { RS="$"; chunk=1; count=0; size=1000 }
{
print $0 > "/tmp/chunk" chunk;
if (++count>=size) {
chunk++;
count=0;
}
}
(note that the redirection operator in awk only truncates/creates the file on its first invocation - subsequent references are treated as append operations - unlike shell redirection)
Make sure by default the unix split will exhaust with suffixes once it reaches max threshold of default suffix limit of 2. More info on : https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html

Resources