How to write single-valued lines from reading a multi-valued delimited file - linux

I have a quick question, and I am sure most of you have an answer to this.
I have a delimited file with the following data:
server1;user1;role
server1;user2;role,role 2
server2;user1;role,role 2,role 3
Please note that the role 'column' is comma-delimited and possibly with multi-valued information and names using spaces, different from the rest of the file that is semicolon-delimited and single-valued.
I need to show each 'role' into a different line, but related to the server and user information. For example:
server1;user1;role
server1;user2;role
server1;user2;role 2
server2;user1;role
server2;user1;role 2
server2;user1;role 3
Instead of having all roles on one server/user line, I require to have one role per line.
Do you have any suggestion to create this on a Bash script? I tried nested while read combos, and also for loops reading arrays, but so far I was unable to accomplished that (I know that probably I will have to use those functions, but in different manner).
This is the Bash script I have been working on:
#!/bin/bash
input="/file/input.csv"
output="/file/output.csv"
declare -a ARRAYROLES
while IFS=';' read -r f1 f2 f3
do
ARRAYROLES=($f3)
field1=$f1
field2=$f2
for element in "${ARRAYROLES[#]}"
do
echo "$field1;$field2;$element" >> "$output"
done
field1=''
field2=''
done < "$input"
And this is the output that I have so far (pretty close but not good enough):
server1;user1;role
server1;user2;role,role
server1;user2;2
server2;user1;role,role
server2;user1;2,role
server2;user1;3
Note that the role 'column' is divided per spaces (I am sure that is because of the for element statement reading the array)
Any idea will be greatly appreciated.
Regards,
Andres.

Change
ARRAYROLES=($f3)
to
IFS=, read -ra ARRAYROLES <<< "$f3"

while IFS=';' read -r server user roles; do
IFS=',' read -r -a arr <<< "$roles"
printf '%s\n' "${arr[#]/#/$server;$user;}"
done < "$input" > "$output"
From help read:
-a array
assign the words read to sequential indices of the array variable ARRAY,
starting at zero
${arr[#]/#/...} is a parameter expansion that in this case eliminated the need for an extra loop.

Related

Best way to identify similar text inside strings?

I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?
One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.
The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.
Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"

how to extract the first parameter from a line containing a particular string pattern

I have a file named mail_status.txt The content of the file is as follows.
1~auth_flag~
2~download_flag~
3~copy_flag~
4~auth_flag~
5~auth_flag~
6~copy_flag~
I want to perform some operation on this file so that at the end I should be getting three variables and their respective values should be as follows:
auth_flag_ids="1,4,5"
download_flag_ids="2"
copy_flag_ids="3,6"
I am quite new to this language. Please let me know if some more details are required on this.
Thanks
If you want to generate bash variables based on the file content,
please try the following:
# read the file and extract information line by line
declare -A hash # delcare hash as an associative array
while IFS= read -r line; do
key="${line#*~}" # convert "1~auth_flag~" to "auth_flag~"
key="${key%~*}_ids" # convert "auth_flag~" to "auth_flag_ids"
hash[$key]+="${line%%~*}," # append the value to the hash
done < "mail_status.txt"
# iterate over the hash to create variables
for r in "${!hash[#]}"; do # r is assigned to "auth_flag_ids", "download_flag_ids" and "copy_flag_ids" in tern
printf -v "$r" "%s" "${hash[$r]%,}" # create a variable named "$r" and assign it to the hash value by trimming the trailing comma off
done
# check the result
printf "%s=\"%s\"\n" "auth_flag_ids" "$auth_flag_ids"
printf "%s=\"%s\"\n" "download_flag_ids" "$download_flag_ids"
printf "%s=\"%s\"\n" "copy_flag_ids" "$copy_flag_ids"
First it reads the lines of the file and extracts the variable name
and the value line by line. They are stored in an associative array hash.
Next it iterates over the keys of hash to create variables whose names are
"auth_flag_ids", "download_flag_ids" and "copy_flag_ids".
printf -v var creates a variable var. This mechanism is useful to cause an
indirect reference to a variable.
I'm not going to explain in detail about the bash specific notations
such as ${parameter#word}, ${parameter%%word} or ${!name[#]}.
You can easily find the references and well-explained documents including
the bash man page.
Hope this helps.

Split single record into Multiple records in Unix shell Script

I have record
Example:
EMP_ID|EMP_NAME|AGE|SALARAy
123456|XXXXXXXXX|30|10000000
Is there a way i can split the record into multiple records. Example output should be like
EMP_ID|Attributes
123456|XXXXXXX
123456|30
123456|10000000
I want to split the same record into multiple records. Here Employee id is my unique column and remaining 3 columns i want to run in a loop and create 3 records. Like EMP_ID|EMP_NAME , EMP_ID|AGE , EMP_ID|SALARY. I may have some more columns as well but for sample i have provided 3 columns along with Employee id.
Please help me with any suggestion.
With bash:
record='123456|XXXXXXXXX|30|10000000'
IFS='|' read -ra fields <<<"$record"
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
123456|XXXXXXXXX
123456|30
123456|10000000
For the whole file:
{
IFS= read -r header
while IFS='|' read -ra fields; do
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
done
} < filename
Record of lines with fields separated by a special delimiter character such as | can be manipulated by basic Unix command line tools such as awk. For example with your input records in file records.txt:
awk -F\| 'NR>1{for(i=2;i<=NF;i++){print $1"|"$(i)}}' records.txt
I recommend to read a awk tutorial and play around with it. Related command line tools worth to learn include grep, sort, wc, uniq, head, tail, and cut. If you regularly do data processing of delimiter-separated files, you will likely need them on a daily basis. As soon as your data structuring format gets more complex (e.g. CSV format with possibility to also use the delimiter character in field values) you need more specific tools, for instance see this question on CSV tools or jq for processing JSON. Still knowledge of basic Unix command line tools will save you a lot of time.

Iterating through a data file in bash

I'm new to bash and unfamiliar. I'm attempting to move through a file with data separated by a space and ","and store information into a list. The only container bash seems to have is an array. Any help with this is appreciated.
Say I have a file titled sample.txt with usernames passwords and birthrates and wanted to iterate through and store the users, passwords and birthdate in separate lists, what would be the easiest way to accomplish this
sample.txt
user1, password1, 081192
user2, password2, 092578
user3, password3, 020564
Bash version 4 has associative arrays, which is what I think you're looking for.
Be warned, you require a lot of noisy syntax (braces, brackets and quotes) to work with arrays in bash.
IFS+="," # add comma to the list of characters for word splitting
# you now cannot use a comma in your passwords.
declare -A passwords ids # 2 associative arrays
while read -r user password id; do
passwords["$user"]=$password
ids["$user"]=$id
done < sample.txt
# now that they are stored, let's print them out:
# iterate over the keys of the ids array
for user in "${!ids[#]}"; do
printf "%s:%s:%s\n" "$user" "${passwords["$user"]}" "${ids["$user"]}"
done
I'll provide some links to documentation in the bash manual: it is very dense reading, but is the source of wisdom for bash.
the read command
how the shell splits text into words, using the IFS variable: Word Splitting
and the order of shell expansions is here
bash arrays
You can use pure bashisms for this like so:
# read a csv line by line and fill an array
# called "myArray" with values from somefile.csv
while IFS=$',' read -r -a myArray; do
echo "${myArray[0]}"
echo "${myArray[1]}"
echo "${myArray[2]}"
done < somefile.csv
Example output:
foo
bar
baz
tum
di
dum
Example somefile.csv:
foo,bar,baz
tum,di,dum

Bash script key/value pair regardless of bash version

I am writing a curl bash script to test webservices. I will have file_1 which would contain the URL paths
/path/to/url/1/{dynamic_path}.xml
/path/to/url/2/list.xml?{query_param}
Since the values in between {} is dynamic, I am creating a separate file, which will have values for these params. the input would be in key-value pair i.e.,
dynamic_path=123
query_param=shipment
By combining two files, the input should become
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
This is the background of my problem. Now my questions
I am doing it in bash script, and the approach I am using is first reading the file with parameters and parse it based on '=' and store it in key/value pair. so it will be easy to replace i.e., for each url I will find the substring between {} and whatever the text it comes with, I will use it as the key to fetch the value from the array
My approach sounds okay (at least to me) BUT, I just realized that
declare -A input_map is only supported in bashscript higher than 4.0. Now, I am not 100% sure what would be the target environment for my script, since it could run in multiple department.
Is there anything better you could suggest ? Any other approach ? Any other design ?
P.S:
This is the first time i am working on bash script.
Here's a risky way to do it: Assuming the values are in a file named "values"
. values
eval "$( sed 's/^/echo "/; s/{/${/; s/$/"/' file_1 )"
Basically, stick a dollar sign in front of the braces and transform each line into an echo statement.
More effort, with awk:
awk '
NR==FNR {split($0, a, /=/); v[a[1]]=a[2]; next}
(i=index($0, "{")) && (j=index($0,"}")) {
key=substr($0,i+1, j-i-1)
print substr($0, 1, i-1) v[key] substr($0, j+1)
}
' values file_1
There are many ways to do this. You seem to think of putting all inputs in a hashmap, and then iterate over that hashmap. In shell scripting it's more common and practical to process things as a stream using pipelines.
For example, your inputs could be in a csv file:
123,shipment
345,order
Then you could process this file like this:
while IFS=, read path param; do
sed -e "s/{dynamic_path}/$path/" -e "s/{query_param}/$param/" file_1
done < input.csv
The output will be:
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
/path/to/url/1/345.xml
/path/to/url/2/list.xml?order
But this is just an example, there can be so many other ways.
You should definitely start by writing a proof of concept and test it on your deployment server. This example should work in old versions of bash too.

Resources