Split single record into Multiple records in Unix shell Script - linux

I have record
Example:
EMP_ID|EMP_NAME|AGE|SALARAy
123456|XXXXXXXXX|30|10000000
Is there a way i can split the record into multiple records. Example output should be like
EMP_ID|Attributes
123456|XXXXXXX
123456|30
123456|10000000
I want to split the same record into multiple records. Here Employee id is my unique column and remaining 3 columns i want to run in a loop and create 3 records. Like EMP_ID|EMP_NAME , EMP_ID|AGE , EMP_ID|SALARY. I may have some more columns as well but for sample i have provided 3 columns along with Employee id.
Please help me with any suggestion.

With bash:
record='123456|XXXXXXXXX|30|10000000'
IFS='|' read -ra fields <<<"$record"
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
123456|XXXXXXXXX
123456|30
123456|10000000
For the whole file:
{
IFS= read -r header
while IFS='|' read -ra fields; do
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
done
} < filename

Record of lines with fields separated by a special delimiter character such as | can be manipulated by basic Unix command line tools such as awk. For example with your input records in file records.txt:
awk -F\| 'NR>1{for(i=2;i<=NF;i++){print $1"|"$(i)}}' records.txt
I recommend to read a awk tutorial and play around with it. Related command line tools worth to learn include grep, sort, wc, uniq, head, tail, and cut. If you regularly do data processing of delimiter-separated files, you will likely need them on a daily basis. As soon as your data structuring format gets more complex (e.g. CSV format with possibility to also use the delimiter character in field values) you need more specific tools, for instance see this question on CSV tools or jq for processing JSON. Still knowledge of basic Unix command line tools will save you a lot of time.

Related

Retrieve different information from several files to bring them together in one. BASH

I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.
One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)

How to write single-valued lines from reading a multi-valued delimited file

I have a quick question, and I am sure most of you have an answer to this.
I have a delimited file with the following data:
server1;user1;role
server1;user2;role,role 2
server2;user1;role,role 2,role 3
Please note that the role 'column' is comma-delimited and possibly with multi-valued information and names using spaces, different from the rest of the file that is semicolon-delimited and single-valued.
I need to show each 'role' into a different line, but related to the server and user information. For example:
server1;user1;role
server1;user2;role
server1;user2;role 2
server2;user1;role
server2;user1;role 2
server2;user1;role 3
Instead of having all roles on one server/user line, I require to have one role per line.
Do you have any suggestion to create this on a Bash script? I tried nested while read combos, and also for loops reading arrays, but so far I was unable to accomplished that (I know that probably I will have to use those functions, but in different manner).
This is the Bash script I have been working on:
#!/bin/bash
input="/file/input.csv"
output="/file/output.csv"
declare -a ARRAYROLES
while IFS=';' read -r f1 f2 f3
do
ARRAYROLES=($f3)
field1=$f1
field2=$f2
for element in "${ARRAYROLES[#]}"
do
echo "$field1;$field2;$element" >> "$output"
done
field1=''
field2=''
done < "$input"
And this is the output that I have so far (pretty close but not good enough):
server1;user1;role
server1;user2;role,role
server1;user2;2
server2;user1;role,role
server2;user1;2,role
server2;user1;3
Note that the role 'column' is divided per spaces (I am sure that is because of the for element statement reading the array)
Any idea will be greatly appreciated.
Regards,
Andres.
Change
ARRAYROLES=($f3)
to
IFS=, read -ra ARRAYROLES <<< "$f3"
while IFS=';' read -r server user roles; do
IFS=',' read -r -a arr <<< "$roles"
printf '%s\n' "${arr[#]/#/$server;$user;}"
done < "$input" > "$output"
From help read:
-a array
assign the words read to sequential indices of the array variable ARRAY,
starting at zero
${arr[#]/#/...} is a parameter expansion that in this case eliminated the need for an extra loop.

Awk can only print the whole line; cannot access the specific fields

I am currently working on my capstone project for Unix OS I. I'm very close to finishing, but I'm stuck on this part: basically, my assignment is to create a menu-based application wherein a user can enter a first and last name, I take that data, use it create a user name, and then I translate it from lowercase to uppercase, and finally store the data as: firstname:lastname:username.
When asked to display the data I must display it based on the username instead of the first name, and formatted with spaces instead of tabs. For example, it should look like: username firstname lastname. So, I've tried multiple commands, such as sort and awk, but I seem to be only able to access the fields in the file as one big field; e.g when I do awk '{print NF}' users.txt to find the number of fields per row, it will return 1, clearly showing that my data is only being entered as one field, instead of the necessary 3 I need. So my question is this: how do I go about changing the number of fields in the text file? Here is my code to add the firstname:lastname:username to the file users.txt:
userInfo=~/capstoneProject/users.txt
#make sure strings is not empty before writing to disk
if [[ "$fname" != "" && "$lname" != "" ]]
then #write to userInfo (users.txt)
echo "$fname:$lname:$uname" | tr a-z A-Z >> $userInfo
#change to uppercase using |
fi
Is it because of the way I am entering the data into my file? Using echo "$fname:$lname:$uname" ? Because this is the way my textbook showed me how to do it, and they had no trouble later on when using the sort function with specific fields, as I am trying to do now. If more detail is necessary, please let me know; this is the last thing I need before I can submit my project, due tonight.
Your input file has :-separated fields so you need to tell awk that:
awk -F':' '{print NF}' users.txt

Filtering CSV File using AWK

I'm working on CSV file
This my csv file
Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv
This is an example of my data looks like i have around 43 Row and 12,000 column
i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).
i use the following command to get my output
awk -F"," '{print $3}' out_file.csv > test1.csv
This is my file:
gender|gene_name |disease |1000g_oct2014|Polyphen |SNAP
male |RB1,GTF2A1L|cancer,diabetes |0.1 |0.46 |0.1
male |NONE,LOC441|diabetes |0.003 |0.52 |0.6
male |TBC1D1 |diabetes |0.940 |1 |0.9
male |BCOR |cancer |0 |0.31 |0.2
male |TP53 |diabetes |0 |0.54 |0.4
note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:
But i'm getting the output following way
Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53
While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why
can any one help me with this.
The root of your problem is - you have comma separated values with embedded commas.
That makes life more difficult. I would suggest the approach is to use a csv parser.
I quite like perl and Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
open ( my $data, '<', 'data_file.csv' ) or die $!;
my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );
while ( my $row = $csv -> getline ( $data ) ) {
print $row -> [2],"\n";
}
Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)
But it should do the trick - Text::CSV handles quoted comma fields nicely.
Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)
Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)
The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.
At last this is what i did for getting my answers in a simple way thanks to #peak i found the solution
1st i used the
CSV filter which is an python module used for filtering the csv file.
i changed my delimiters using csvfilter using the following command
csvfilter input_file.csv --out-delimiter="|" > out_file.csv
This command used to change the delimiter ',' into '|'
now i used the awk command to sort and filter
awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv
Thanks for your help.

Bash script key/value pair regardless of bash version

I am writing a curl bash script to test webservices. I will have file_1 which would contain the URL paths
/path/to/url/1/{dynamic_path}.xml
/path/to/url/2/list.xml?{query_param}
Since the values in between {} is dynamic, I am creating a separate file, which will have values for these params. the input would be in key-value pair i.e.,
dynamic_path=123
query_param=shipment
By combining two files, the input should become
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
This is the background of my problem. Now my questions
I am doing it in bash script, and the approach I am using is first reading the file with parameters and parse it based on '=' and store it in key/value pair. so it will be easy to replace i.e., for each url I will find the substring between {} and whatever the text it comes with, I will use it as the key to fetch the value from the array
My approach sounds okay (at least to me) BUT, I just realized that
declare -A input_map is only supported in bashscript higher than 4.0. Now, I am not 100% sure what would be the target environment for my script, since it could run in multiple department.
Is there anything better you could suggest ? Any other approach ? Any other design ?
P.S:
This is the first time i am working on bash script.
Here's a risky way to do it: Assuming the values are in a file named "values"
. values
eval "$( sed 's/^/echo "/; s/{/${/; s/$/"/' file_1 )"
Basically, stick a dollar sign in front of the braces and transform each line into an echo statement.
More effort, with awk:
awk '
NR==FNR {split($0, a, /=/); v[a[1]]=a[2]; next}
(i=index($0, "{")) && (j=index($0,"}")) {
key=substr($0,i+1, j-i-1)
print substr($0, 1, i-1) v[key] substr($0, j+1)
}
' values file_1
There are many ways to do this. You seem to think of putting all inputs in a hashmap, and then iterate over that hashmap. In shell scripting it's more common and practical to process things as a stream using pipelines.
For example, your inputs could be in a csv file:
123,shipment
345,order
Then you could process this file like this:
while IFS=, read path param; do
sed -e "s/{dynamic_path}/$path/" -e "s/{query_param}/$param/" file_1
done < input.csv
The output will be:
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
/path/to/url/1/345.xml
/path/to/url/2/list.xml?order
But this is just an example, there can be so many other ways.
You should definitely start by writing a proof of concept and test it on your deployment server. This example should work in old versions of bash too.

Resources