Partial String split in Bash - string

Let consider this string:
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\00x\00x\00x\
What I want to retrieve is this:
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\
Basically, the logic is:
As long as it's 00x\ keep reading the remaining of the string.
As long as it's not 00x\ keep reading the remaining of the string.
Split there.
How can this be achieved in bash? Pay attention that there is a "9" in the middle, and a "t". So there might be "garbage" between 2 00x\ tokens. So I can't just split the string into tokens, not I can use cut (not fixed length). Any magic I can do with awk or sed?
Thanks.
Edit: The input string can after other sings after the 00x\. Like this: 00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\00x\00x\00x\00x\00x\00x\00x\00x\00x\GL7Dx\00x\00x\00x\00x\00x\00x\00x\00x\00x\00x\BCx\V6Ax\00x\00x\00x\00x\00x\00x\00x\00x\00x\00x\H50x\ where what I want is still 00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\

Something in awk:
$ awk '
BEGIN {
FS=ORS="\\"
}
{
for(i=1;i<=NF;i++)
if(($i=="00x")&&p!="00x"&&p!="") {
printf "\n"
exit
} else {
p=$i
print $i
}
}' file
Output on the updated data
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\
In perl using negative lookbehind:
$ perl -ne 's/(?<!00x)\\00x.*/\\/g;print' file
00x\00x\00x\00x\00x\00x\00x\00x\00x\g09x\t20x\

Related

Replace multiline string with sed

I have a file that's basically an INI/CFG file the looks like this:
[thing-a]
attribute1=foo
attribute2=bar
attribute3=foobar
attribute4=barfoo
[thing-b]
attribute1=dog
attribute3=foofoo
attribute4=castles
[thing-c]
attribute1=foo
attribute4=barfoo
[thing-d]
attribute1=123455
attribute2=dogs
attribute3=biscuits
attribute4=1234
Each 'thing' has a set of attributes that could include all the same ones or a subset there of.
I am trying to write a small bash script that will replace the attributes for 'thing-c' with a predefined block $a1, $a2 & $a3 are generated elsewhere in the wider script:
NEW_BLOCK="[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}"
I can find the right block with sed like this:
THING_BLOCK=$(sed -nr "/^\[thing-c\]/ { :l /^\s*[^#].*/ p; n; /^\[/ q; b l; }" ./myThingFile)
I'm not sure if i've gone down a rabbit hole or what with this and I'm pretty sure there is a better way of doing it.
I'm wanting to do what is:
sed "s/${THING_BLOCK}/${NEW_BLOCK}/"
But I can't quite figure out the multiline aspect to this and I'm not sure what the best route to take is.
Is there a way to do this sort of multiline find and replace with sed (or a better way with bash)
Is there a way to do this sort of multiline find and replace ...
Yes there is indeed a better way, albeit using awk:
awk -v blk="$NEW_BLOCK" -v RS= '{ORS = RT} $1 == "[thing-c]" {$0 = blk} 1' file
Using -v RS= we use an empty record separator that splits records in input file on each new line.
Another awk. Store the replacement to file2 and:
$ awk -v RS="" '
NR==FNR {
b=$0
next
}
$1~/thing-c/ {
$0=b
}
{
print (++c==1?"":ORS) $0
}' file2 file1
Output:
[thing-a]
attribute1=foo
attribute2=bar
attribute3=foobar
attribute4=barfoo
[thing-b]
attribute1=dog
attribute3=foofoo
attribute4=castles
[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}
[thing-d]
attribute1=123455
attribute2=dogs
attribute3=biscuits
attribute4=1234
When you want to use sed(IMHO awk is better here), you must have "nice" data (no special characters that sed will try to handle and [ inside block thing-3).
I tested with
read -d '' -r NEW_BLOCK <<END
[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}
END
For my solution I first need to replace newlines in $NEW_BLOCK with the two characters \n.
echo "This is the replacement string: ${NEW_BLOCK//$'\n'/\\n}"
With the "multi-line" option "-z" you can do
sed -rz "s/\[thing-c\][^[]*/${NEW_BLOCK//$'\n'/\\n}\n\n/" myThingFile

BASH - Extract Data from String

I have a log that returns thousands of lines of data, I want to extract a few values from that.
In the log there is only one line containing the unquie unit reference so I can grep for that using:
grep "unit=Central-C152" logfile.txt
That produces a line of output similar to the following:
a3cd23e,85d58f5,53f534abef7e7,unit=Central-C152,locale=32325687-8595-9856-1236-12546975,11="School",1="Mr Green",2="Qual",3="SWE",8="report",5="channel",7="reset",6="velum"
The format of the line may change in that the order of the values won't always be in the same position.
I'm trying to work out how to get the value of 2 and 7 in to separate variables.
I had thought about cut on , or = but as the values aren't in a set order I couldn't work out that best way to do it.
I' trying to get:
var state=value of 2 without quotes
var mode=value of 7 without quotes
Can anyone advise on the best way to do this ?
Thanks
Could you please try following to create variable's values.
state=$(awk '/unit=Central-C152/ && match($0,/2=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
mode=$(awk '/unit=Central-C152/ && match($0,/7=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
You could print them too by doing following.
echo "$state"
echo "$mode"
Explanation: Adding explanation of command too now.
awk ' ##Starting awk program here.
/unit=Central-C152/ && match($0,/2=\"[^"]*/){ ##Checking condition if a line has string (unit=Central-C152) and using match using REGEX to check from 2 to till "
print substr($0,RSTART+3,RLENGTH-3) ##Printing substring starting from RSTART+3 till RLENGTH-3 characters.
}
' Input_file ##Mentioning Input_file name here.
You are probably better off doing all of the processing in Awk.
awk -F, '/unit=Central-C152/ {
for(i=1;i<=NF;++i)
if($i ~ /^[27]="/) {
b[++k] = $i
sub(/^[27]="/, "", b[k])
sub(/"$/, "", b[k])
gsub(/\\/, "", b[k])
}
print "state " b[1] ", mode " b[2]
}' logfile.txt
This presupposes that the fields always occur in the same order (2 before 7). Maybe you need to change or disable the gsub to remove backslashes in the values.
If you want to do more than print the values, refactoring whatever Bash code you have into Awk is often a better approach than doing this processing in Bash.
Assuming you already have the line in a variable such as with:
line="$(grep 'unit=Central-C152' logfile.txt | head -1)"
You can then simply use the built-in parameter substitution features of bash:
f2=${line#*2=\"} ; f2=${f2%%\"*} ; echo ${f2}
f7=${line#*7=\"} ; f7=${f7%%\"*} ; echo ${f7}
The first command on each line strips off the first part of the line up to and including the <field-number>=". The second command then strips everything off that beyond (and including) the first quote. The third, of course, simply echos the value.
When I run those commands against your input line, I see:
Qual
reset
which is, from what I can see, what you were after.

Bash, trim a "complicated" string to obtain a new string

I have a txt file contains many strings(every string lies in a line). A typical string has this shape:
sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9
In the above string, "INT1", "INT2" and "INT3" are all integer types and their values might variant for each string in the text file, "D4 - 9" are double type(not fixed value also).
What I need to do is to change the above string to a new string like :
INT3_ovsc_rms_D4_D5_D6_D7_D8_D9
Can anybody tell me how to do it ?
Thanks!
#!/bin/bash
input=$1
left=${input%%.*}
right=${input#*.fz_}
int3=${left##*_}
output=${int3}_${right}
echo "${output}"
.
$ ./foo.sh sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9
INT3_ovsc_rms_D4_D5_D6_D7_D8_D9
$ ./foo.sh sno_Int-300_Exp-1000_1051.fits.fz_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14
1051_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14
Depending on your real input this might break horribly, though.
If you really want to do this in pure Bash, you'll need to split the string by setting IFS and then using read with a "here string". For details, see here: How do I split a string on a delimiter in Bash?
You will probably need to split it multiple times--once by underscore and then by dash, I guess.
If you don't mind awk:
echo sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9 | awk -F_ 'BEGIN{OFS="_"}{sub(/.fits.fz/,"",$4);print $4,$5,$6,$7,$8,$9,$10,$11,$12}'
INT3_ovsc_rms_D4_D5_D6_D7_D8_D9
This awk should work:
s='1000_1051.fits.fz_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14'
awk -F'[_.]' 'NR==1{i3=$2;next} {printf "%s%s%s", i3, RS, $0}' RS='_ovsc_rms' <<< "$s"
1051_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14

Efficient way to replace strings in one file with strings from another file

Searched for similar problems and could not find anything that suits my needs exactly:
I have a very large HTML file scraped from multiple websites and I would like to replace all
class="key->from 2nd file"
with
style="xxxx"
At the moment I use sed - it works well but only with small files
while read key; do sed -i "s/class=\"$key\"/style=\"xxxx\"/g"
file_to_process; done < keys
When I'm trying to process something larger it takes ages
Example:
keys - Count: 1233 lines
file_to_ process - Count: 1946 lines
It takes about 40 s to complete only 1/10 of processing I need
real 0m40.901s
user 0m8.181s
sys 0m15.253s
Untested since you didn't provide any sample input and expected output:
awk '
NR==FNR { keys = keys sep $0; sep = "|"; next }
{ gsub("class=\"(" keys ")\"","style=\"xxxx\"") }
1' keys file_to_process > tmp$$ &&
mv tmp$$ file_to_process
I think it's time to Perl (untested):
my $keyfilename = 'somekeyfile'; // or pick up from script arguments
open KEYFILE, '<', $keyfilename or die("Could not open key file $keyfilename\n");
my %keys = map { $_ => 1 } <KEYFILE>; // construct a map for lookup speed
close KEYFILE;
my $htmlfilename = 'somehtmlfile'; // or pick up from script arguments
open HTMLFILE, '<', $htmlfilename or die("Could not open html file $htmlfilename\n");
my $newchunk = qq/class="xxxx"/;
for my $line (<$htmlfile>) {
my $newline = $line;
while($line =~ m/(class="([^"]+)")/) {
if(defined($keys{$2}) {
$newline =~ s/$1/$newchunk/g;
}
}
print $newline;
}
This uses a hash for lookups of keys, which should be reasonably fast, and does this only on the key itself when the line contains a class statement.
Try to generate a very long sed script with all sub commands from the keys file, something like:
s/class=\"key1\"/style=\"xxxx\"/g; s/class=\"key2\"/style=\"xxxx\"/g ...
and use this file.
This way you will read the input file only once.
Here's one way using GNU awk:
awk 'FNR==NR { array[$0]++; next } { for (i in array) { a = "class=\"" i "\""; gsub(a, "style=\"xxxx\"") } }1' keys.txt file.txt
Note that the keys in keys.txt are taken as the whole line, including whitespace. If leading and lagging whitespace could be a problem, use $1 instead of $0. Unfortunately I cannot test this properly without some sample data. HTH.
First convert your keys file into a sed or-pattern which looks like this: key1|key2|key3|.... This can be done using the tr command. Once you have this pattern, you can use it in a single sed command.
Try the following:
sed -i -r "s/class=\"($(tr '\n' '|' < keys | sed 's/|$//'))\"/style=\"xxxx\"/g" file

Implement tail with awk

I am struggling with this awk code which should emulate the tail command
num=$1;
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[$i]
}
So what I'm trying to achieve here is an tail command emulated by awk/
For example consider cat somefile | awk -f tail.awk 10
should print the last 10 lines of a text file, any suggestions?
All of these answers store the entire source file. That's a horrible idea and will break on larger files.
Here's a quick way to store only the number of lines to be outputted (note that the more efficient tail will always be faster because it doesn't read the entire source file!):
awk -vt=10 '{o[NR%t]=$0}END{i=(NR<t?0:NR);do print o[++i%t];while(i%t!=NR%t)}'
more legibly (and with less code golf):
awk -v tail=10 '
{
output[NR % tail] = $0
}
END {
if(NR < tail) {
i = 0
} else {
i = NR
}
do {
i = (i + 1) % tail;
print output[i]
} while (i != NR % tail)
}'
Explanation of legible code:
This uses the modulo operator to store only the desired number of items (the tail variable). As each line is parsed, it is stored on top of older array values (so line 11 gets stored in output[1]).
The END stanza sets an increment variable i to either zero (if we've got fewer than the desired number of lines) or else the number of lines, which tells us where to start recalling the saved lines. Then we print the saved lines in order. The loop ends when we've returned to that first value (after we've printed it).
You can replace the if/else stanza (or the ternary clause in my golfed example) with just i = NR if you don't care about getting blank lines to fill the requested number (echo "foo" |awk -vt=10 … would have nine blank lines before the line with "foo").
for(i=NR-num;i<=NR;i++)
print vect[$i]
$ indicates a positional parameter. Use just plain i:
for(i=NR-num;i<=NR;i++)
print vect[i]
The full code that worked for me is:
#!/usr/bin/awk -f
BEGIN{
num=ARGV[1];
# Make that arg empty so awk doesn't interpret it as a file name.
ARGV[1] = "";
}
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[i]
}
You should probably add some code to the END to handle the case when NR < num.
You need to add -v num=10 to the awk commandline to set the value of num. And start at NR-num+1 in your final loop, otherwise you'll end up with num+1 lines of output.
This might work for you:
awk '{a=a b $0;b=RS;if(NR<=v)next;a=substr(a,index(a,RS)+1)}END{print a}' v=10

Resources