sed: How to replace all content in a multi-line pattern? - linux

I have to create a sql script from another altering their content. Eg.
SELECT value INTO val FROM table WHERE condition;
SELECT value2 INTO val2 FROM table WHERE condition1
OR condition2;
So I have tried
sed 's/FROM .*;/;/g'
But it's returns this
SELECT value INTO val ;
SELECT value2 INTO val2 FROM table WHERE condition1
OR condition2;
instead of this, which is what I need
SELECT value INTO val ;
SELECT value2 INTO val2 ;
Any ideas? Basically what I want to do is remove all that is included among 'FROM' and the next ';'

sed ':load
# load any multiline sequence before going further
/;[[:space:]]*$/ !{ N;b load
}
# from here you have a full (multi)line to treat
s/[[:space:]]\{1,\}FROM[[:space:]].*;/ ;/
' YourFile
You need to first load the multiline sequence before removing the end (sequence cycling in load section until a ended ; is found)
:load : address label for the 'goto' used later
/;[[:space:]]*$/: when there is no ending ; on the line (eventually some ending space later
N: load a new line in working buffer
b load : goto the label load (goto)
s/[[:space:]]\{1,\}FROM[[:space:]].*;/ ;/ change the whole current working buffer (so mono and multiline but all ending with ;) with your new format. Sed in this case treat the buffer and not a line, New line are character like other in this case.
Last line need to be ended by ; to be treated, if not, the last (uncomplete) sequence is lost

I think you can remove the '\n' in your script and then use sed to remove the from.
Eg
cat test.sql |tr -d '\n'|sed 's/FROM [^;]*;/;\n/g'

awk is record-based, not line-based like sed, so it has no problem handling multi-line strings:
$ awk 'BEGIN{RS=ORS=";"}{gsub(/FROM .*/,"")}1' file
SELECT value INTO val ;
SELECT value2 INTO val2 ;
The above just sets the Record Separator to a ; instead of the default newline and operates on the resulting strings which can contain newlines just like any other characters.

As far as I know, you either have to get rid of newline delimiters
tr -d '\n'
or use Pythons "re.M | re.DOTALL" arguments in re.compile
eg (roughly put):
pattern = re.compile('FROM[^;]*;', re.M | re.DOTALL)
result = re.findall(pattern, file)
Usually, when I needed to regex over newlines, I always ended up with Python. Bash is too newline based that it's hard to bend it to do this.
But replacing a '\n' with a placeholder might suffice, if you really need to use bash.

Related

How do I concatenate each line of 2 variables in bash?

I have 2 variables, NUMS and TITLES.
NUMS contains the string
1
2
3
TITLES contains the string
A
B
C
How do I get output that looks like:
1 A
2 B
3 C
paste -d' ' <(echo "$NUMS") <(echo "$TITLES")
Having multi-line strings in variables suggests that you are probably doing something wrong. But you can try
paste -d ' ' <(echo "$nums") - <<<"$titles"
The basic syntax of paste is to read two or more file names; you can use a command substitution to replace a file anywhere, and you can use a here string or other redirection to receive one of the "files" on standard input (where the file name is then conventionally replaced with the pseudo-file -).
The default column separator from paste is a tab; you can replace it with a space or some other character with the -d option.
You should avoid upper case for your private variables; see also Correct Bash and shell script variable capitalization
Bash variables can contain even very long strings, but this is often clumsy and inefficient compared to reading straight from a file or pipeline.
Convert them to arrays, like this:
NUMS=($NUMS)
TITLES=($TITLES)
Then loop over indexes of whatever array, lets say NUMS like this:
for i in ${!NUMS[*]}; {
# and echo desired output
echo "${NUMS[$i]} ${TITLES[$i]}"
}
Awk alternative:
awk 'FNR==NR { map[FNR]=$0;next } { print map[FNR]" "$0} ' <(echo "$NUMS") <(echo "$TITLE")
For the first file/variable (NR==FNR), set up an array called map with the file number record as the index and the line as the value. Then for the second file, print the entry in the array as well as the line separated by a space.

Bash (or alternative) to find and replace a number of patterns in csv file using another csv file

I have a very large csv file that is too big to open in excel for this operation.
I need to replace a specific string for approx 6000 records out of the 1.5mil in the csv, the string itself is in the comma separated format like so:
ABC,FOO.BAR,123456
With other columns on either side that are of no concern. I only need enough to get enough data to make sure the final data string (the numbers) are unique.
I have another file with the string to replace and the replacement string like (for the above):
"ABC,FOO.BAR,123456","ABC,FOO.BAR,654321"
So in the case above 123456 is being replaced by 654321. A simple (yet maddeningly slow) way to do this is open both docs in notepad++ and find the first string then replace with the second string, but with over 6000 records this isnt great.
I was hoping someone could give advice on a scripting solution? e.g.:
$file1 = base.csv
$file2 = replace.csv
For each row in $file2 {
awk '{sub(/$file2($firstcolumn)/,$file2($Secondcolumn)' $file1
}
Though Im not entirely sure how to adapt awk to do an operation like this..
EDIT: Sorry I should have been more specific, the data in my replacement csv is only in two columns; two raw strings!
it would be easier of course if your delimiter is not used within the fields...
you can do in two steps, create a sed script from the lookup file and use it for the main data file for replacements
for example,
(assumes there is no escaped quotes in the fields, may not hold)
$ awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file > replace.sed
$ sed -f replace.sed data_file
awk -F\" '
NR==FNR { subst[$2]=$4; next }
{
for (s in subst) {
pos = index($0, s)
if (pos) {
$0 = substr($0, 1, pos-1) subst[s] substr($0, pos + length(s))
break
}
}
print
}
' "$file2" "$file1" # > "$file1.$$.tmp" && mv "$file1.$$.tmp" "$file1"
The part after the # shows how you could replace the input data file with the output.
The block associated with NR==FNR is only executed for the first input file, the one with the search and replacement strings.
subst[$2]=$4 builds an associative array (dictionary): the key is the search string, the value the replacement string.
Fields $2 and $4 are the search string and the replacement string, respectively, because Awk was instructed to break in the input into fields by " (-F\"); note that this assumes that your strings do not contain escaped embedded " chars.
The remaining block then processes the data file:
For each input line, it loops over the search strings and looks for a match on the current line:
Once a match is found, the replacement string is substituted for the search string, and matching stops.
print simply prints the (possibly modified) line.
Note that since you want literal string replacements, regex-based functions such as sub() are explicitly avoided in favor of literal string-processing functions index() and substr().
As an aside: since you say there are columns on either side in the data file, consider making the search/replacement strings more robust by placing , on either side of them (this could be done inside the awk script).
I would recommend using a language with a CSV parsing library rather than trying to do this with shell tools. For example, Ruby:
require 'csv'
replacements = CSV.open('replace.csv','r').to_h
File.open('base.csv', 'r').each_line do |line|
replacements.each do |old, new|
line.gsub!(old) { new }
end
puts line
end
Note that Enumerable#to_h requires Ruby v2.1+; replace with this for older Rubys:
replacements = Hash[*CSV.open('replace.csv','r').to_a.flatten]
You only really need CSV for the replacements file; this assumes you can apply the substitutions to the other file as plain text, which speeds things up a bit and avoids having to parse the old/new strings out into fields themselves.

Convert data into desired form using linux

I have data in a tab separated file in the following form (filename.tsv):
#a 0 Espert A trius
#b 9 def J
I want to convert the data into the following form (I am introducing here in every second line):
##<a>
<0 Espert> <abc> <A trius>.
##<b>
<9 def> <abc> <J>.
I am introducing in every line. I know to do the same using python using csv module. But I am trying to learn linux commands, is there a way to do the same in linux terminal using linux commands like grep?
awk seems like the right tool for the job:
awk '{
printf "##<%s>\n<%s %s> <abc> <%s%s%s>.\n",
substr($1,2),
$2,
$3,
$4,
(length($5) ? " " : ""),
$5
}' filename.tsv
awk loops over all lines in the input file and breaks each line into fields by runs of tabs and/or spaces; $1 refers to the first field, $2, to the second, ...
printf functions the same as in C: a format (template) string containing placeholders is followed by corresponding arguments to substitute for the placeholders.
substr($1,2) returns the substring of the 1st field starting at the 2nd character (i.e., a for the 1st line, b for the 2nd) - note that indices in awk are 1-based.
(length($5) ? " " : "") is a C-style ternary expression that returns a single space if the 5th field is nonempty, and an empty string otherwise.

How to grep/split a word in middle of %% or $$

I have a variable from which I have to grep the which in middle of %% adn the word which starts with $$. I used split it works... but for only some scenarios.
Example:
#!/usr/bin/perl
my $lastline ="%Filters_LN_RESS_DIR%\ARC\Options\Pega\CHF_Vega\$$(1212_GV_DATE_LDN)";
my #lastline_temp = split(/%/,$lastline);
print #lastline_temp;
my #var=split("\\$\\$",$lastline_temp[2]);
print #var;
I get the o/p as expected. But can i get the same using Grep command. I mean I dont want to use the array[2] or array[1]. So that I can replace the values easily.
I don't really see how you can get the output you expect. Because you put your data in "busy" quotes (interpolating, double, ...), it comes out being stored as:
'%Filters_LN_RESS_DIR%ARCOptionsPegaCHF_Vega$01212_GV_DATE_LDN)'
See Quote and Quote-like Operators and perhaps read Interpolation in Perl
Notice that the backslashes are gone. A backslash in interpolating quotes simply means "treat the next character as literal", so you get literal 'A', literal 'O', literal 'P', ....
That '0' is the value of $( (aka $REAL_GROUP_ID) which you unwittingly asked it to interpolate. So there is no sequence '$$' to split on.
Can you get the same using a grep command? It depends on what "the same" is. You save the results in arrays, the purpose of grep is to exclude things from the arrays. You will neither have the arrays, nor the output of the arrays if you use a non-trivial grep: grep {; 1 } #data.
Actually you can get the exact same result with this regular expression, assuming that the single string in #vars is the "result".
m/%([^%]*)$/
Of course, that's no more than
substr( $lastline, rindex( $lastline, '%' ) + 1 );
which can run 8-10 times faster.
First, be very careful in your use of quotes, I'm not sure if you don't mean
'%Filters_LN_RESS_DIR%\ARC\Options\Pega\CHF_Vega\$$(1212_GV_DATE_LDN)'
instead of
"%Filters_LN_RESS_DIR%\ARC\Options\Pega\CHF_Vega\$$(1212_GV_DATE_LDN)"
which might be a different string. For example, if evaluated, "$$" means the variable $PROCESS_ID.
After trying to solve riddles (not sure about that), and quoting your string
my $lastline =
'%Filters_LN_RESS_DIR%\ARC\Options\Pega\CHF_Vega\$$(1212_GV_DATE_LDN)'
differently, I'd use:
my ($w1, $w2) = $lastline =~ m{ % # the % char at the start
([^%]+) # CAPTURE everything until next %
[^(]+ # scan to the first brace
\( # hit the brace
([^)]+) # CAPTURE everything up to closing brace
}x;
print "$w1\n$w2";
to extract your words. Result:
Filters_LN_RESS_DIR
1212_GV_DATE_LDN
But what do you mean by replace the values easily. Which values?
Addendum
Now lets extract the "words" delimited by '\'. Using a simple split:
my #words = split /\\/, # use substr to start split after the first '\\'
substr $lastline, index($lastline,'\\');
you'll get the words between the backslashes if you drop the last entry (which is the $$(..) string):
pop #words; # remove the last element '$$(..)'
print join "\n", #words; # print the other elements
Result:
ARC
Options
Pega
CHF_Vega
Does this work better with grep? Seems to:
my #words = grep /^[^\$%]+$/, split /\\/, $lastline;
and
print join "\n", #words;
also results in:
ARC
Options
Pega
CHF_Vega
Maybe that is what you are after? What do you want to do with these?
Regards
rbo

Convert seconds to hh:mm:ss format (or whatever format Excel or LibreOffice likes) and insert it back to a csv file in Bash

I have a csv file like this:
ELAPSEDTIME_SEC;CPU_%;RSS_KBYTES
0;3.4;420012
1;3.4;420012
2;3.4;420012
3;3.4;420012
4;3.4;420012
5;3.4;420012
And I'd like to convert the values (they are seconds) in the first column to hh:mm:ss format (or whatever Excel or LibreOffice can import as time format from csv) and insert it back to the file into a new column following the first. So the output would be something like this:
ELAPSEDTIME_SEC;ELAPSEDTIME_HH:MM:SS;CPU_%;RSS_KBYTES
0;0:00:00;3.4;420012
1;0:00:01;3.4;420012
2;0:00:02;3.4;420012
3;0:00:03;3.4;420012
4;0:00:04;3.4;420012
5;0:00:05;3.4;420012
And I'd have to do this in Bash to work under Linux and OS X as well.
I hope this is what you want:
TZ=UTC awk -F';' -vOFS=';' '
{
$1 = $1 ";" (NR==1 ? "ELAPSEDTIME_HH:MM:SS" : strftime("%H:%M:%S", $1))
}1' input.csv
By thinking about your question I found an interesting manipulation possibility: Insert a formula into the CSV, and how to pass it to ooCalc:
cat se.csv | while read line ; do n=$((n+1)) ; ((n>1)) && echo ${line/;/';"=time(0;0;$A$'$n')";'} ||echo ${line/;/;time of A;} ;done > se2.csv
formatted:
cat se.csv | while read line ; do
n=$((n+1))
((n>1)) && echo ${line/;/';"=time(0;0;$A$'$n')";'} || echo ${line/;/;time of A;}
done > se2.csv
Remarks:
This adds a column - it doesn't replace
You have to set the import options for CSV correctly. In this case:
delimiter = semicolon (well, we had to do this for the original file as well)
text delimiter = " (wasn't the default)
deactivate checkbox "quoted field as text"
depending on your locale, the function name has to be translated. For example, in German I had to use "zeit(" instead of "time("
since formulas use semicolons themselves the approach will be simpler, not needing that much masking, if the delimiter is something else, maybe a tab.
In practice, you might treat the headline like all the other lines, and correct it manually in the end, but the audience of SO expects everything to work out of the box, so the command became something longer.
I would have preferred to replace the whole while read / cat/ loop thing with just a short sed '...' command, and I found a remark in the man page of sed, that = can be used for the rownum, but I don't know how to handle it.
Result:
cat se2.csv
ELAPSEDTIME_SEC;time of A;CPU_%;RSS_KBYTES
0;"=time(0;0;$A$2)";3.4;420012
1;"=time(0;0;$A$3)";3.4;420012
2;"=time(0;0;$A$4)";3.4;420012
3;"=time(0;0;$A$5)";3.4;420012
4;"=time(0;0;$A$6)";3.4;420012
5;"=time(0;0;$A$7)";3.4;420012
In this specific case, the awk-solution seems better, but I guess this approach might sometimes be useful to know.

Resources