How to extract values from text using multiple (nested) delimiters

How to extract values from text using multiple (nested) delimiters - linux

On a day-to-day basis I need to extract bits of text from logs and other text data in various mixed formats. Is there a utility (like awk, grep, etc.) I could use to quickly perform the task without having to resort to writing long bash/perl/python scripts?
Example 1: For input text below
mylog user=UserName;password=Password;other=information
I would like to extract user name and password values. The pseudo-utility would preferably looks like this (a la awk):
cat input-text.txt | magic --delimit-by=";" --then-by="="
'{print "The username is $values[0][1] and password is $values[1][1]"}'
Where the input string delimited by ; is placed in $values array, and each value in that array is further delimited by = to form a nested array.
Even better, would be nice to have something like this:
cat input-text.txt | magic --map-entry-sep=";" --map-key-val-sep="="
'{print "The username is $[user] and password is $[password]"}'
Where the result of parsing is converted into a map for easy lookup by key.
Example 2: Would be nice to parse triple nested elements too. Consider input text like
mylog mylist=one,two,three;other=information
I would like to now extract the 2nd element of list mylist using something like:
cat input-text.txt | magic --delimit-by=";" --then-by="=" --and-then-by=","
'{print "The second element of mylist is: $values[0][1][1]}'
Of course, I would rather use some kind of JSON parser and convert input data into it's respective object/map/list format for easier extraction, but it's not possible because I am working with data in different formats.
I usually use a combination of awk, grep, cut and sed combined using several pipes and extract each value (column) of interest at a time, but that is tedious and requires merging different columns into one later. Usually, I need all extracted columns in CSV format for further processing in Excel.
Would be grateful for any suggestions or comments.

$ echo 'mylog user=UserName;password=Password;other=information' |
awk -F '[ ;]' -v keysep="=" \
'{
for (i=1; i<=NF; i++) {
split($i, t, keysep);
a[t[1]] = t[2]
};
print "The username is " a["user"] " and password is " a["password"]
}'
The username is UserName and password is Password
$ echo 'mylog mylist=one,two,three;other=information' | awk -F "[ =,;]" '{print $4}'
two

Related

Is there a linux command that can cut and pick columns that match string patterns?

I need to analyze logs and my end user has to be able to see them in a formatted way, as mentioned below, and my nature of logs is the key variables will be in different position, rather than at fixed columns based on the application, as these log formats are from various applications.
"thread":"t1","key1":"value1","key2":"value2",......"key15":"value15"
I have a way to split and cut this to analyze only particular keys, using the following,
cat file.txt | grep 'value1' | cut -d',' -f2,7,8-
This is the command I am able to get, the requirement is I need to grep all logs which have 'key1' as 'value1', and this value1 will be most likely unique among all, so I am using a grep directly, if required, I can use grep to pick along with the key and value string, but main problem I am facing, is the part is after cut. I want to pick only key2, key7, key8 among these lines, but key2, key7, key8 might not appear in the same column numbers like in this order, key2 might even be at column 3 or 4 or after key7/key8, so I want pick based on the key value and get exactly
"key2":"value2", "key7":"value7", "key8:value8"
The end user is not particularly picky about the order in which they appear, they need only these keys from each line to be displayed..
Can someone help me? I tried piping with awk / grep again, but they still match the entire line not on the columns alone
My input is
{"#timestamp":"2021-08-05T06:38:48.084Z","level":"INFO","thread":"main","logger":"className1","message":"Message 1"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"main","logger":"className2","message":"Message 2"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"thead1","logger":"className2","message":"Message 2"}
I basically want my output to be more like, find only the "thread":"main" lines and print only the key and values of "logger" and "message" for each line which matched, since the other key and value are irrelevant to me. there is more than 15 to 16 keys in my file and my key positions could be swapped, like "message" could be the first to appear and "logger" could be the second to appear in some log files. Of course, the keys are just an example, the real keys I am trying to find are not "logger" and "message" alone.
There are log analysis tools, but this is a pretty old system, and the logs are not real time ones I am analyzing and displaying some files which are pretty much older than years.

Not sure I really understand your specification but the following awk script could be a starting point:
$ cat foo.awk
BEGIN {
k["\"key1\""] = 1; k["\"key7\""] = 1; k["\"key8\""] = 1;
}
/"key1":"value1"/ {
s = "";
for(i = 1; i <= NF; i+=2)
if($i in k)
s = s (s ? "," : "") $i ":" $(i+1);
print s;
}
$ awk -F',|:' -f foo.awk foo.txt
"key1":"value1","key7":"value7","key8":"value8"
Explanation:
awk is called with the -F',|:' option such that the fields separator in each record is the comma or the colon.
In the BEGIN section we declare an associative array (k) of the selected keys, including the surrounding double quotes.
The rest of the awk script applies to each record containing "key1":"value1".
Variable s is used to prepare the output string; it is initialized to "".
For each odd field (the keys) we check if it is in k. If it is, we concatenate to s:
a comma if s is not empty,
the key field,
a colon,
the following even field (the value).
We print s.

How to extract two part-numerical values from a line in shell script

I have multiple text files in this format. I would like to extract lines matching this pattern "pass filters and QC".
File1:
Before main variant filters, 309 founders and 0 nonfounders present.
0 variants removed due to missing genotype data (--geno).
9302015 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
7758518 variants and 309 people pass filters and QC.
Calculating allele frequencies... done.
I was able to grep the line, but when I tried to assign to line variable it just doesn't work.
grep 'people pass filters and QC' File1
line="$(echo grep 'people pass filters and QC' File1)"
I am new to shell script and would appreciate if you could help me do this.
I want to create a tab separated file with just
"File1" "7758518 variants" "309 people"

GNU awk
gawk '
BEGIN { patt = "([[:digit:]]+ variants) .* ([[:digit:]]+ people) pass filters and QC" }
match($0, patt, m) {printf "\"%s\" \"%s\" \"%s\"\n", FILENAME, m[1], m[2]}
' File1

You are almost there, just remove double quotes and echo from your command:
line=$(grep 'people pass filters and QC' File1)
Now view the value stored in variable:
echo $line
And if your file structure is same, i.e., it will always be in this form: 7758518 variants and 309 people pass filters and QC, you can use awk to get selected columns from output. So complete command would be like below:
OIFS=$IFS;IFS=$'\n';for i in $line;do echo $i;echo '';done | awk -F "[: ]" '{print $1"\t"$2" "$3"\t"$5" "$6}';IFS=$OIFS
Explanation:
IFS means internal field separator, and we are setting it to newline character, because we need to use it in for loop.
But before that, we are taking it's backup in another variable OIFS, so we can restore it later.
We are using a for loop to iterate through all the matched strings, and using awk to select, 1st, 2nd, 3rd , 4th and 5th column as per your requirement.
But please note, if your file structure varies, we may need to use a different technique to extract "7758518 variants" and "309 people" part.

Obtaining the field that contains a value or string on Linux shell

Case example:
$ cat data.txt
foo,bar,moo
I can obtain the field data by using cut, assuming , as separator, but only if I know which position it has. Example to obtain value bar (second field):
$ cat data.txt | cut -d "," -f 2
bar
How can I obtain that same bar (or number field == 2) if I only know it contains a a letter?
Something like:
$ cat data.txt | reversecut -d "," --string "a"
[results could be both "2" or "bar"]
In other words: how can I know what is the field containing a substring in a text-delimited file using linux shell commands/tools?
Of course, programming is allowed, but do I really need looping and conditional structures? Isn't there a command that solves this?
Case of specific shell, I would prefer Bash solutions.
A close solution here, but not exactly the same.
More same-example based scenario (upon requestion):
For a search pattern of m or mo, the results could be both 3 or moo.
For a search pattern of f or fo, the results could be both 1 or foo.

Following simple awk may also help you in same.
awk -F, '$2~/a/{print $2}' data.txt
Output will be bar in this case.
Explanation:
-F,: Setting field separator for lines as comma, to identify the fields easily.
$2~/a/: checking condition here if 2nd field is having letter a in it, if yes then printing that 2nd field.
EDIT: Adding solution as per OP's comment and edited question too now.
Let's say following Input_file is there
cat data.txt
foo,bar,moo
mo,too,far
foo,test,test1
fo,test2,test3
Then following is the code for same:
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /fo/){print $i}}}' data.txt
foo
foo
fo
OR
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /mo/){print $i}}}' data.txt
moo
mo

Bash (or alternative) to find and replace a number of patterns in csv file using another csv file

I have a very large csv file that is too big to open in excel for this operation.
I need to replace a specific string for approx 6000 records out of the 1.5mil in the csv, the string itself is in the comma separated format like so:
ABC,FOO.BAR,123456
With other columns on either side that are of no concern. I only need enough to get enough data to make sure the final data string (the numbers) are unique.
I have another file with the string to replace and the replacement string like (for the above):
"ABC,FOO.BAR,123456","ABC,FOO.BAR,654321"
So in the case above 123456 is being replaced by 654321. A simple (yet maddeningly slow) way to do this is open both docs in notepad++ and find the first string then replace with the second string, but with over 6000 records this isnt great.
I was hoping someone could give advice on a scripting solution? e.g.:
$file1 = base.csv
$file2 = replace.csv
For each row in $file2 {
awk '{sub(/$file2($firstcolumn)/,$file2($Secondcolumn)' $file1
}
Though Im not entirely sure how to adapt awk to do an operation like this..
EDIT: Sorry I should have been more specific, the data in my replacement csv is only in two columns; two raw strings!

it would be easier of course if your delimiter is not used within the fields...
you can do in two steps, create a sed script from the lookup file and use it for the main data file for replacements
for example,
(assumes there is no escaped quotes in the fields, may not hold)
$ awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file > replace.sed
$ sed -f replace.sed data_file

awk -F\" '
NR==FNR { subst[$2]=$4; next }
{
for (s in subst) {
pos = index($0, s)
if (pos) {
$0 = substr($0, 1, pos-1) subst[s] substr($0, pos + length(s))
break
}
}
print
}
' "$file2" "$file1" # > "$file1.$$.tmp" && mv "$file1.$$.tmp" "$file1"
The part after the # shows how you could replace the input data file with the output.
The block associated with NR==FNR is only executed for the first input file, the one with the search and replacement strings.
subst[$2]=$4 builds an associative array (dictionary): the key is the search string, the value the replacement string.
Fields $2 and $4 are the search string and the replacement string, respectively, because Awk was instructed to break in the input into fields by " (-F\"); note that this assumes that your strings do not contain escaped embedded " chars.
The remaining block then processes the data file:
For each input line, it loops over the search strings and looks for a match on the current line:
Once a match is found, the replacement string is substituted for the search string, and matching stops.
print simply prints the (possibly modified) line.
Note that since you want literal string replacements, regex-based functions such as sub() are explicitly avoided in favor of literal string-processing functions index() and substr().
As an aside: since you say there are columns on either side in the data file, consider making the search/replacement strings more robust by placing , on either side of them (this could be done inside the awk script).

I would recommend using a language with a CSV parsing library rather than trying to do this with shell tools. For example, Ruby:
require 'csv'
replacements = CSV.open('replace.csv','r').to_h
File.open('base.csv', 'r').each_line do |line|
replacements.each do |old, new|
line.gsub!(old) { new }
end
puts line
end
Note that Enumerable#to_h requires Ruby v2.1+; replace with this for older Rubys:
replacements = Hash[*CSV.open('replace.csv','r').to_a.flatten]
You only really need CSV for the replacements file; this assumes you can apply the substitutions to the other file as plain text, which speeds things up a bit and avoids having to parse the old/new strings out into fields themselves.

In file editing, how to replace all the string words with incremental integers?

I have a file containing a lot of string words, severed by pipes. I would like to have a script (written in bash or in any other programming language) that is able to replace every word with an incremental unique integer (something like an ID).
From an input like this:
aaa|ccccc|ffffff|iii|j
aaa|ddd|ffffff|iii|j
bb|eeee|hhhhhh|iii|k
I'd like to have something like this
1|3|6|8|9
1|4|6|8|9
2|5|7|8|10
That is: aaa has been replaced by 1, bb has been replaced by 2, and so on.
How to do this? Thanks!

awk to the rescue...
this will do the numbering row-wise, I'm not sure it's important enough to make it columnar.
awk -F "|" -vOFS="|" '{
line=sep="";
for(i=1;i<=NF;i++) {
if(!a[$i])a[$i]=++c;
line=line sep a[$i];
sep=OFS
}
print line
}' words
1|2|3|4|5
1|6|3|4|5
7|8|9|4|10
to get the word associations into another file, you can replace
if(!a[$i])a[$i]=++c;
with
if(!a[$i]){
a[$i]=++c;
print $i"="a[$i] > "assoc"
}

You can define an associative array
declare -A array
use the word as keys and an incremental number as value
array[aaa]=$n
then replace the original words by the values

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract values from text using multiple (nested) delimiters - linux

Related

Is there a linux command that can cut and pick columns that match string patterns?

How to extract two part-numerical values from a line in shell script

Obtaining the field that contains a value or string on Linux shell

Bash (or alternative) to find and replace a number of patterns in csv file using another csv file

In file editing, how to replace all the string words with incremental integers?

Categories

Resources