Retrieve different information from several files to bring them together in one. BASH - linux

I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.

One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)

Related

use awk to left outer join two csv file based on multiple columns while keeping order of the first file observations

I have two csv files.
File 1
ID,Name,Gender,Salary,DOB
11,Jim,M,200,90
12,David,M,100,89
12,David,M,300,89
13,Lucy,F,150,86
14,Lily,F,200,85
13,Lucy,F,100,86
File 2
DOB,Name,Children
90,Jim,2
88,Michael,4
88,Lily,1
85,Lily,0
What I want to do is to left outer join File 2 into File 1 based on DOB and Name while keeping the order of File 1 observations.
So the output is expected to be
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
I learned that we need to sort data if we use join command. So I was wondering whether I could use awk to do my work. But I am new with awk. Is there anyone can help me? By the way, if the data is very big, can I drop print command in awk but simply use > *.csv to save into a new csv file? It's because I found solutions to some related questions in this website often used {print ...}. Thank you.
awk to the rescue!
$ awk -F, 'NR==FNR{a[$1,$2]=$3; next} {print $0 FS a[$NF,$2]}' file2 file1
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
join will require sorted input and you need embellishments to recover initial ordering. You can redirect the output to a file by adding > outputfile.csv

Awk can only print the whole line; cannot access the specific fields

I am currently working on my capstone project for Unix OS I. I'm very close to finishing, but I'm stuck on this part: basically, my assignment is to create a menu-based application wherein a user can enter a first and last name, I take that data, use it create a user name, and then I translate it from lowercase to uppercase, and finally store the data as: firstname:lastname:username.
When asked to display the data I must display it based on the username instead of the first name, and formatted with spaces instead of tabs. For example, it should look like: username firstname lastname. So, I've tried multiple commands, such as sort and awk, but I seem to be only able to access the fields in the file as one big field; e.g when I do awk '{print NF}' users.txt to find the number of fields per row, it will return 1, clearly showing that my data is only being entered as one field, instead of the necessary 3 I need. So my question is this: how do I go about changing the number of fields in the text file? Here is my code to add the firstname:lastname:username to the file users.txt:
userInfo=~/capstoneProject/users.txt
#make sure strings is not empty before writing to disk
if [[ "$fname" != "" && "$lname" != "" ]]
then #write to userInfo (users.txt)
echo "$fname:$lname:$uname" | tr a-z A-Z >> $userInfo
#change to uppercase using |
fi
Is it because of the way I am entering the data into my file? Using echo "$fname:$lname:$uname" ? Because this is the way my textbook showed me how to do it, and they had no trouble later on when using the sort function with specific fields, as I am trying to do now. If more detail is necessary, please let me know; this is the last thing I need before I can submit my project, due tonight.
Your input file has :-separated fields so you need to tell awk that:
awk -F':' '{print NF}' users.txt

pipe an awk object created with awk code

I would like to know if there is a method for creating awk objects inside an awk call. I need to build a key/value map and use it in an awk call. More in details, I have a map linking some labels with a unique id (e.g. "ID1002", "External compartment"). I would like to use this map to identify a set of unique ids from another table. Here is what I was thinking about:
awk 'BEGIN{map=system(awk '{m[$1]=$2}' first.csv)}{print map[$1]}' second.csv
Obviously this doesn't work and I was wondering how can I do something like that without building an awk script.
The common way this done in awk is:
$ awk 'NR==FNR{m[$1]=$2;next}{print m[$1]}' first.csv second.csv
Explanation:
NR is a special variable that gets incremented on each record read
FNR is similar to NR however it is reset for each new file read
next instructs awk to stop executing for the current record and get the next record.
With the definitions set you can read the script as:
NR==FNR # Conditional that is only true when reading the first file
{m[$1]=$2;next} # Create a map and move on to the next line
{print m[$1]} # Using next in the first block means this only runs on the second file

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.
Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
BEGIN {
field1=""
field5=""
field7=""
}
{
record_type = substr($0,1,2)
if (record_type == "AA")
{
field1=substr($0,3,6)
}
else if (record_type == "BB")
{
field5=substr($0,9,6)
field7=substr($0,21,18)
}
else if (record_type == "CC")
{
print field1"|"field5"|"field7
}
}
Create an awk script file called program.awk and pop that code into it. Execute the script using :
awk -f program.awk < my_multi_line_file.txt
You maybe can use two passes:
1step.awk
/^AA/{printf "2 6 6 12" }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8" }
{printf "\n%s\n", $0}
2step.awk
NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}
And then
awk -f 1step.awk sample | awk -f 2step.awk
You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:
awk '/^AA/ { manually process record AA out of $0 }
/^BB/ { manually process record BB out of $0 }
/^CC/ { manually process record CC out of $0 }' file ...
The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.
I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.
Could you use Perl and then select an unpack template based on the first two chars of the line?
Better use some fully featured scripting language like perl or ruby.
What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Resources