bash extract segments of a string and store in variables

bash extract segments of a string and store in variables - linux

I want to convert the output from cppclean into cppcheck-like xml sections, such that:
./bit_limits.cpp:25: static data 'bit_limits::max_name_length'
becomes:
<error id="static data" msg="bit_limits::max_name_length">
<location file="./bit_limits.cpp" line="25"/>
</error>
I started with some awk:
test code:
echo "./bit_limits.cpp:25: static data 'bit_limits::max_name_length'" > test
cat test.out | awk -F ":" '{print "<error id=\""$3"\""}
{print "msg=\""}{for(i=4;i<=NF;++i)print ":"$i}{print "\">"}
{print "<location file=\""$1"\" line=\""$2"\"/>"}
{print "</error>"}'
Note: to run this you need to put the cat command back into one line - I printed it over multi-lines for ease of reading.
Explanation:
I am using awk and delimiting by colon ":" - which splits the line into useful chunks which I try to construct into the XML:
{print "<error id=\""$3"\""} - Extract the error ID part
{print "msg=\""}{for(i=4;i<=NF;++i)print ":"$i}{print "\">"} - extract the message (replacing the missing colons, this is all the remaining sections
{print "<location file=\""$1"\" line=\""$2"\"/>"} - extract the file and line, this part is easy since the colons line up nicely
{print "</error>"} - finally print the end tag
This is close, but not quite right, it produces:
<error id=" static data 'bit_limits"
msg="
:
:max_name_length'
">
<location file="./bit_limits.cpp" line="25"/>
</error>
The id field should just be "static data" and the msg field should be "'bit_limits::max_name_length'", but other then that it is ok (I don't mind it being split of multi-lines at the moment - though I would prefer that awk did not print a new line each time.
Update
As #charlesduffy pointed out - for context - I want to do this in bash because I want to embed this code into a makefile (or just a normal bash script) for maximum portability (i.e. no need for python or other tools).

With bash and a regex:
x="./bit_limits.cpp:25: static data 'bit_limits::max_name_length'"
[[ $x =~ (.+):([0-9]+):\ (.+)\ \'(.+)\' ]]
declare -p BASH_REMATCH
Output:
declare -ar BASH_REMATCH='([0]="./bit_limits.cpp:25: static data '\''bit_limits::max_name_length'\''" [1]="./bit_limits.cpp" [2]="25" [3]="static data" [4]="bit_limits::max_name_length")'
The elements 1 to 4 in array BASH_REMATCH contain the searched strings.
From man bash:
BASH_REMATCH: An array variable whose members are assigned by the =~ binary operator to the [[ conditional command. The element with index 0 is the portion of the string matching the entire regular expression. The element with index n is the portion of the string matching the nth parenthesized subexpression. This variable is read-only.

Probably more complex than it needs to be:
awk '{
split($1, file_line, ":")
field = 2
while(substr($field, 1, 1) != "'\''") {
id = id " " $field
++field
}
id = substr(id, 2)
while(field <= NF) {
msg = msg " " $field
++field
}
msg = substr(msg, 3, length(msg) - 1)
printf("<error id=\"%s\" msg=\"%s\">\n", id, msg)
printf(" <location file=\"%s\" line=\"%s\">\n", file_line[1], file_line[2])
print "</error>"
}' test.out

Related

BASH - Extract Data from String

I have a log that returns thousands of lines of data, I want to extract a few values from that.
In the log there is only one line containing the unquie unit reference so I can grep for that using:
grep "unit=Central-C152" logfile.txt
That produces a line of output similar to the following:
a3cd23e,85d58f5,53f534abef7e7,unit=Central-C152,locale=32325687-8595-9856-1236-12546975,11="School",1="Mr Green",2="Qual",3="SWE",8="report",5="channel",7="reset",6="velum"
The format of the line may change in that the order of the values won't always be in the same position.
I'm trying to work out how to get the value of 2 and 7 in to separate variables.
I had thought about cut on , or = but as the values aren't in a set order I couldn't work out that best way to do it.
I' trying to get:
var state=value of 2 without quotes
var mode=value of 7 without quotes
Can anyone advise on the best way to do this ?
Thanks

Could you please try following to create variable's values.
state=$(awk '/unit=Central-C152/ && match($0,/2=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
mode=$(awk '/unit=Central-C152/ && match($0,/7=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
You could print them too by doing following.
echo "$state"
echo "$mode"
Explanation: Adding explanation of command too now.
awk ' ##Starting awk program here.
/unit=Central-C152/ && match($0,/2=\"[^"]*/){ ##Checking condition if a line has string (unit=Central-C152) and using match using REGEX to check from 2 to till "
print substr($0,RSTART+3,RLENGTH-3) ##Printing substring starting from RSTART+3 till RLENGTH-3 characters.
}
' Input_file ##Mentioning Input_file name here.

You are probably better off doing all of the processing in Awk.
awk -F, '/unit=Central-C152/ {
for(i=1;i<=NF;++i)
if($i ~ /^[27]="/) {
b[++k] = $i
sub(/^[27]="/, "", b[k])
sub(/"$/, "", b[k])
gsub(/\\/, "", b[k])
}
print "state " b[1] ", mode " b[2]
}' logfile.txt
This presupposes that the fields always occur in the same order (2 before 7). Maybe you need to change or disable the gsub to remove backslashes in the values.
If you want to do more than print the values, refactoring whatever Bash code you have into Awk is often a better approach than doing this processing in Bash.

Assuming you already have the line in a variable such as with:
line="$(grep 'unit=Central-C152' logfile.txt | head -1)"
You can then simply use the built-in parameter substitution features of bash:
f2=${line#*2=\"} ; f2=${f2%%\"*} ; echo ${f2}
f7=${line#*7=\"} ; f7=${f7%%\"*} ; echo ${f7}
The first command on each line strips off the first part of the line up to and including the <field-number>=". The second command then strips everything off that beyond (and including) the first quote. The third, of course, simply echos the value.
When I run those commands against your input line, I see:
Qual
reset
which is, from what I can see, what you were after.

rearranging column based on condition

I have a *.csv file. with value as below
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
or sometimes the value is as below
"8801989353984","KSDP05"
"8801957608165","ASDP11"
"8801991455848","CSDP10"
"8801981363116","CSDP07"
"8801921247870","KSDP07"
"8801965386240","CSDP06"
"8801956293036","KSDP10"
"8801984383904","KSDP11"
"8801944211742","ASDP09"
I just want to put the numeric value (e.g. 8801989353984) always in 1st column. Is it possible using BASH script?

Sed is also your friend here
Input
cat 41189347
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
Script
sed -E 's/^("[[:alpha:]]+.*"),("[[:digit:]]+")$/\2,\1/' 41189347
Output
"8801942183589","ASDP02"
"8801939151023","ASDP06"
"8801963981740","CSDP04"
"8801946305047","ASDP09"
"8801941195677","ASDP12"
"8801922826186","ASDP05"
"8801983008938","CSDP08"
"8801944346555","ASDP04"
"8801910831518","CSDP11"

awk to the rescue!
$ awk -F, -v OFS=, '$1~/[A-Z]/{t=$2;$2=$1;$1=t}1' file
if first field has alpha chars, swap first and second columns and print.

Bash can do the work but awk might be a better choice for rearrange your file:
sample.csv:
"ASDP02","8801942183589"
"8801944211742","ASDP09"
command:
awk -F, 'BEGIN{OFS=","}{$1=$1;if(substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2)){print $1,$2}else{print $2,$1}}' sample.csv
substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2) checks the column is numeric or not. If it is, print the original line otherwise switch column1 and column2
Output:
"8801942183589","ASDP02"
"8801944211742","ASDP09"

You can create a pure bash script to generate other file which has the structure you need:
#!/bin/bash
csv_file="/path/to/your/csvfile"
output_file="/path/to/output_file"
#Optional
rm -rf "${output_file}"
readarray -t LINES < <(cat < "${csv_file}" 2> /dev/null)
for item in "${LINES[#]}"; do
if [[ $item =~ ^\"([0-9A-Z]+)\"\,\"([0-9]+)\" ]]; then
echo "\"${BASH_REMATCH[2]}\",\"${BASH_REMATCH[1]}\"" >> "${output_file}"
else
echo "$item" >> "${output_file}"
fi
done
This works even if your file is "mixed" I mean with some lines in the right format and other lines in the bad format.

The following commands assume that the cells in the CSV files do not contain newlines and commas. Otherwise, you should write a more complicated script in Perl, PHP, or other programming language capable of parsing CSV files properly. But Bash, definitely, is not appropriate for this task.
Perl
perl -F, -nle '#F = reverse #F if $F[0] =~ /^"\d+"$/;
print join(",", #F)' file
Beware, If the cells contain newlines, or commas, use Perl's Text::CSV module, for instance. Although it is a simple task in Perl, it goes beyond the scope of the current question.
The command splits the input lines by commas (-F,) and stores the result into #F array, for each line. The items in the array are reversed, if the first field $F[0] matches the regular expression. You can also swap the items this way: ($F[0], $F[1]) = ($F[1], $F[0]).
Finally, the joins the array items with commas, and prints to the standard output.
If you want to edit the file in-place, use -i option: perl -i.backup -F, ....
AWK
awk -F, -vOFS=, '/^"[0-9]+",/ {print; next}
{ t = $1; $1 = $2; $2 = t; print }' file
The input and output field separators are set to , with -F, and -vOFS=,.
If the line matches the pattern /^"[0-9]+",/ (the line begins with a "numeric" CSV column), the script prints the record and advances to the next record. Otherwise the next block is executed.
In the next block, it swaps the first two columns and prints the result to the standard output.
If you want to edit the file in-place, see answers to this question.

How to pass quoted arguments but with blank spaces in linux

I have a file with these arguments and their values this way
# parameters.txt
VAR1 001
VAR2 aaa
VAR3 'Hello World'
and another file to configure like this
# example.conf
VAR1 = 020
VAR2 = kab
VAR3 = ''
when I want to get the values in a function I use this command
while read p; do
VALUE=$(echo $p | awk '{print $2}')
done < parameters.txt
the firsts arguments throw the right values, but the last one just gets the 'Hello for the blank space, my question is how do I get the entire 'Hello World' value?

If you can use bash, there is no need to use awk: read and shell parameter expansion can be combined to solve your problem:
while read -r name rest; do
# Drop the '= ' part, if present.
[[ $rest == '= '* ]] && value=${rest:2} || value=$rest
# $value now contains the line's value,
# but *including* any enclosing ' chars, if any.
# Assuming that there are no *embedded* ' chars., you can remove them
# as follows:
value=${value//\'/}
done < parameters.txt
read by default also breaks a line into fields by whitespace, like awk, but unlike awk it has the ability to assign the remainder of the line to a varaible, namely the last one, if fewer variables than fields found are specified;
read's -r option is generally worth specifying to avoid unexpected interpretation of \ chars. in the input.
As for your solution attempt:
awk doesn't know about quoting in input - by default it breaks input into fields by whitespace, irrespective of quotation marks.
Thus, a string such as 'Hello World' is simply broken into fields 'Hello and World'.
However, in your case you can split each input line into its key and value using a carefully crafted FS value (FS is the input field separator, which can be also be set via option -F; the command again assumes bash, this time for use of <(...), a so-called process substitution, and $'...', an ANSI C-quoted string):
while IFS= read -r value; do
# Work with $value...
done < <(awk -F$'^[[:alnum:]]+ (= )?\'?|\'' '{ print $2 }' parameters.txt)
Again the assumption is that values contain no embedded ' instances.
Field separator regex $'^[[:alnum:]]+ (= )?\'?|\'' splits each line so that $2, the 2nd field, contains the value, stripped of enclosing ' chars., if any.
xargs is the rare exception among the standard utilities in that it does understand single- and double-quoted strings (yet also without support for embedded quotes).
Thus, you could take advantage of xargs' ability to implicitly strip enclosing quotes when it passes arguments to the specified command, which defaults to echo (again assumes bash):
while read -r name rest; do
# Drop the '= ' part, if present.
[[ $rest == '= '* ]] && value=${rest:2} || value=$rest
# $value now contains the line's value, strippe of any enclosing
# single quotes by `xargs`.
done < <(xargs -L1 < parameters.txt)
xargs -L1 process one (1) line (-L) at a time and implicitly invokes echo with all tokens found on each line, with any enclosing quotes removed from the individual tokens.

The default field separator in awk is the space. So you are only printing the first word in the string passed to awk.
You can specify the field separator on the command line with -F[field separator]
Example, setting the field separator to a comma:
$ echo "Hello World" | awk -F, '{print $1}'
Hello World

Shell Extract Text Before Digits in a String

I've found several examples of extractions before a single character and examples of extracting numbers, but I haven't found anything about extracting characters before numbers.
My question:
Some of the strings I have look like this:
NUC320 Syllabus Template - 8wk
SLA School Template - UL
CJ101 Syllabus Template - 8wk
TECH201 Syllabus Template - 8wk
Test Clone ID17
In cases where the string doesn't contain the data I want, I need it to be skipped. The desired output would be:
NUC-320
CJ-101
TECH-201
SLA School Template - UL & Test Clone ID17 would be skipped.
I imagine the process being something to the effect of:
Extract text before " "
Condition - Check for digits in the string
Extract text before digits and assign it to a variable x
Extract digits and assign to a variable y
Concatenate $x"-"$y and assign to another variable z
More information:
The strings are extracted from a line in a couple thousand text docs using a loop. They will be used to append to a hyperlink and rename a file during the loop.
Edit:
#!/bin/sh
# my files are named 1.txt through 9999.txt i both
# increments the loop and sets the filename to be searched
i=1
while [ $i -lt 10000 ]
do
x=$(head -n 31 $i.txt | tail -1 | cut -c 7-)
if [ ! -z "$x" -a "$x" != " " ]; then
# I'd like to insert the hyperlink with the output on the
# same line (1.txt;cj101 Syllabus Template - 8wk;www.link.com/cj101)
echo "$i.txt;$x" >> syllabus.txt
# else
# rm $i.txt
fi
i=`expr $i + 1`
sleep .1
done

sed for printing lines starting with capital letters followed by digits. It also adds a - between them:
sed -n 's/^\([A-Z]\+\)\([0-9]\+\) .*/\1-\2/p' input
Gives:
NUC-320
CJ-101
TECH-201

A POSIX-compliant awk solution:
awk '{ if (match($1, /[0-9]+$/)) print substr($1, 1, RSTART-1) "-" substr($1, RSTART) }' \
file |
while IFS= read -r token; do
# Process token here (append to hyperlink, ...)
echo "[$token]"
done
awk is used to extract the reformatted tokens of interest, which are then processed in a shell while loop.
match($1, /[0-9]+$/) matches the 1st whitespace-separated field ($1) against extended regex [0-9]+$, i.e., matches only if the fields ends in one or more digits.
substr($1, 1, RSTART-1) "-" substr($1, RSTART) joins the part before the first digit with the run of digits using -, via the special RSTART variable, which indicates the 1-based character position where the most recent match() invocation matched.

awk '$1 ~/[0-9]/{sub(/...$/,"-&",$1);print $1}' file
NUC-320
CJ-101
TECH-201

Parse columns with awk

I am new at AWK programming and I was wondering how to filter the following text:
Goedel - Declarative language for AI, based on many-sorted logic. Strongly
typed, polymorphic, declarative, with a module system. Supports bignums
and sets. "The Goedel Programming Language", P. M. Hill et al, MIT Press
1994, ISBN 0-262-08229-2. Goedel 1.4 - partial implementation in SICStus
Prolog 2.1.
ftp://ftp.cs.bris.ac.uk/goedel
info: goedel#compsci.bristol.ac.uk
Just to print this:
Goedel
I have used the following sentence but it just does not work as I wished:
awk -F " - " "/ - /{ print $1 }"
It shows the following:
Goedel
1994, ISBN 0-262-08229-2. Goedel 1.4
Could somebody tell me what I have to modify so I can get what I want?
Thanks in advance

awk 'BEGIN { RS = "" } { print $1 }' your_file.txt
which means: splits string into paragraphs by empty line, and then splits words by the default separator (space), and finally print the first word ($1) of every paragraph

this one-liner could work for your requirement:
awk -F ' - ' 'NF>1{print $1;exit}'

awk -F ' - ' ' { if (FNR % 4 == 1) next; print $1; }'
If the format is exactly the same as below, then the code above should work:
1 Author - ...
2 Year ...
3 URL
4 Extra info ...
5 Author - ...
6..N etc.
If there is a blank line between entries, you can set RS to a null string and $1 will be the author as long as the value for -F (the FS variable in an awk script) is the same. This has the advantage that if you don't have "info: ..." or a URL, you can still distinguish between entries, assuming it is not "Author - ...{newline}Year ...{newline}{newline}info: ...{newline}{newline}Author - ..." (you can't have an empty line between parts of an entry if an empty line is what separates entries.) For example:
# A blank line is what separates each entry.
BEGIN { RS = ""; }
{ print $1; }
If you have an awk that supports it, you can make RS a multiple character string if necessary (e.g. RS = "\n--\n" for entries separated by "--" on a line by itself). If you need a regex or simply don't have an awk that supports multiple character record separators, you're forced to use something like the following:
BEGIN { found_sep = 1; }
{ if (found_sep) { print $1; found_sep = 0; } }
# Entry separator is "--\n"
/^--$/ { found_sep = 1; }
More sample input will be required for something more complicated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

bash extract segments of a string and store in variables - linux

Related

BASH - Extract Data from String

rearranging column based on condition

How to pass quoted arguments but with blank spaces in linux

Shell Extract Text Before Digits in a String

Parse columns with awk

Categories

Resources