Bash, trim a "complicated" string to obtain a new string

Bash, trim a "complicated" string to obtain a new string - string

I have a txt file contains many strings(every string lies in a line). A typical string has this shape:
sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9
In the above string, "INT1", "INT2" and "INT3" are all integer types and their values might variant for each string in the text file, "D4 - 9" are double type(not fixed value also).
What I need to do is to change the above string to a new string like :
INT3_ovsc_rms_D4_D5_D6_D7_D8_D9
Can anybody tell me how to do it ?
Thanks!

#!/bin/bash
input=$1
left=${input%%.*}
right=${input#*.fz_}
int3=${left##*_}
output=${int3}_${right}
echo "${output}"
.
$ ./foo.sh sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9
INT3_ovsc_rms_D4_D5_D6_D7_D8_D9
$ ./foo.sh sno_Int-300_Exp-1000_1051.fits.fz_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14
1051_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14
Depending on your real input this might break horribly, though.

If you really want to do this in pure Bash, you'll need to split the string by setting IFS and then using read with a "here string". For details, see here: How do I split a string on a delimiter in Bash?
You will probably need to split it multiple times--once by underscore and then by dash, I guess.

If you don't mind awk:
echo sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9 | awk -F_ 'BEGIN{OFS="_"}{sub(/.fits.fz/,"",$4);print $4,$5,$6,$7,$8,$9,$10,$11,$12}'
INT3_ovsc_rms_D4_D5_D6_D7_D8_D9

This awk should work:
s='1000_1051.fits.fz_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14'
awk -F'[_.]' 'NR==1{i3=$2;next} {printf "%s%s%s", i3, RS, $0}' RS='_ovsc_rms' <<< "$s"
1051_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14

Related

How can I truncate a line of text longer than a given length?

How would you go about removing everything after x number of characters? For example, cut everything after 15 characters and add ... to it.
This is an example sentence should turn into This is an exam...

GnuTools head can use chars rather than lines:
head -c 15 <<<'This is an example sentence'
Although consider that head -c only deals with bytes, so this is incompatible with multi-bytes characters like UTF-8 umlaut ü.
Bash built-in string indexing works:
str='This is an example sentence'
echo "${str:0:15}"
Output:
This is an exam
And finally something that works with ksh, dash, zsh…:
printf '%.15s\n' 'This is an example sentence'
Even programmatically:
n=15
printf '%.*s\n' $n 'This is an example sentence'
If you are using Bash, you can directly assign the output of printf to a variable and save a sub-shell call with:
trim_length=15
full_string='This is an example sentence'
printf -v trimmed_string '%.*s' $trim_length "$full_string"

Use sed:
echo 'some long string value' | sed 's/\(.\{15\}\).*/\1.../'
Output:
some long strin...
This solution has the advantage that short strings do not get the ... tail added:
echo 'short string' | sed 's/\(.\{15\}\).*/\1.../'
Output:
short string
So it's one solution for all sized outputs.

Use cut:
echo "This is an example sentence" | cut -c1-15
This is an exam
This includes characters (to handle multi-byte chars) 1-15, c.f. cut(1)
-b, --bytes=LIST
select only these bytes
-c, --characters=LIST
select only these characters

Awk can also accomplish this:
$ echo 'some long string value' | awk '{print substr($0, 1, 15) "..."}'
some long strin...
In awk, $0 is the current line. substr($0, 1, 15) extracts characters 1 through 15 from $0. The trailing "..." appends three dots.

Todd actually has a good answer however I chose to change it up a little to make the function better and remove unnecessary parts :p
trim() {
if (( "${#1}" > "$2" )); then
echo "${1:0:$2}$3"
else
echo "$1"
fi
}
In this version the appended text on longer string are chosen by the third argument, the max length is chosen by the second argument and the text itself is chosen by the first argument.
No need for variables :)

Using Bash Shell Expansions (No External Commands)
If you don't care about shell portability, you can do this entirely within Bash using a number of different shell expansions in the printf builtin. This avoids shelling out to external commands. For example:
trim () {
local str ellipsis_utf8
local -i maxlen
# use explaining variables; avoid magic numbers
str="$*"
maxlen="15"
ellipsis_utf8=$'\u2026'
# only truncate $str when longer than $maxlen
if (( "${#str}" > "$maxlen" )); then
printf "%s%s\n" "${str:0:$maxlen}" "${ellipsis_utf8}"
else
printf "%s\n" "$str"
fi
}
trim "This is an example sentence." # This is an exam…
trim "Short sentence." # Short sentence.
trim "-n Flag-like strings." # Flag-like strin…
trim "With interstitial -E flag." # With interstiti…
You can also loop through an entire file this way. Given a file containing the same sentences above (one per line), you can use the read builtin's default REPLY variable as follows:
while read; do
trim "$REPLY"
done < example.txt
Whether or not this approach is faster or easier to read is debatable, but it's 100% Bash and executes without forks or subshells.

Trim a string up to 4th delimiter from right side

I have strings like following which should be parsed with only unix command (bash)
49_sftp_mac_myfile_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed
I want to trim the strings like above upto 4th underscore from end/right side. So output should be
49_sftp_mac_myfile_simul_test
Number of underscores can vary in overall string. For example, The string could be
49_sftp_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed
Output should be (after trimming up to 4th occurrence of underscore from right.
49_sftp_simul_test

Easily done using awk that decrements NF i.e. no. of fields to -4 after setting input+output field separator as underscore:
s='49_sftp_mac_myfile_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed'
awk 'BEGIN{FS=OFS="_"} {NF -= 4; $1=$1} 1' <<< "$s"
49_sftp_mac_myfile_simul_test

You can use bash's parameter expansion for that:
string="..."
echo "${string%_*_*_*_*}"

With GNU sed:
$ sed -E 's/(_[^_]*){4}$//' <<< "49_sftp_mac_myfile_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed"
49_sftp_mac_myfile_simul_test
From the end of line, removes 4 occurrences of _ followed by non _ characters.

Perl one-liner
echo $your-string | perl -lne '$n++ while /_/g; print join "_",((split/_/)[-$n-1..-5])'
input
49_sftp_mac_myfile_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed
the output
49_sftp_mac_myfile_simul_test
input
49_sftp_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed
the output
49_sftp_simul_test

Not the fastest but maybe the easiest to remember and funiest:
echo "49_sftp_mac_myfile_simul_test_9999_4000000000000001_2017-02-06_15-15-26.49.csv.failed"|
rev | cut -d"_" -f5- | rev

bash Changing every other comma to point

I am working with set of data which is written in Swedish format. comma is used instead of point for decimal numbers in Sweden.
My data set is like this:
1,188,1,250,0,757,0,946,8,960
1,257,1,300,0,802,1,002,9,485
1,328,1,350,0,846,1,058,10,021
1,381,1,400,0,880,1,100,10,418
Which I want to change every other comma to point and have output like this:
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
Any idea of how to do that with simple shell scripting. It is fine If I do it in multiple steps. I mean if I change first the first instance of comma and then the third instance and ...
Thank you very much for your help.

Using sed
sed 's/,\([^,]*\(,\|$\)\)/.\1/g' file
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418

For reference, here is a possible way to achieve the conversion using awk:
awk -F, '{for(i=1;i<=NF;i=i+2) {printf $i "." $(i+1); if(i<NF-2) printf FS }; printf "\n" }' file
The for loop iterates every 2 fields separated by a comma (set by the option -F,) and prints the current element and the next one separated by a dot.
The comma separator represented by FS is printed except at the end of line.

As a Perl one-liner, using split and array manipulation:
perl -F, -e '#a = #b = (); while (#b = splice #F, 0, 2) {
push #a, join ".", #b} print join ",", #a' file
Output:
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418

Many sed dialects allow you to specify which instance of a pattern to replace by specifying a numeric option to s///.
sed -e 's/,/./9' -e 's/,/./7' -e 's/,/./5' -e 's/,/./3' -e 's/,/./'
ISTR some sed dialects would allow you to simplify this to
sed 's/,/./1,2'
but this is not supported on my Debian.
Demo: http://ideone.com/6s2lAl

AWK: how to extract in a block of text between two "\\" without considering the back to line

I would like to extract from a big block of text certain area
by setting the Field separator as "\\" however I'm always facing a problem as my text contain some single "\" and it seems to disturb the correct text extraction
INPUT:
1\1\GINC-R1430\FOpt\RB3LYP\6-31G(d,p)\C11H8\ROOT\22-Jan-2015\0\\#N b3l
yp/6-31G** opt freq=noraman test Maxdisk=1Gb\\3\\0,1\C,-2.6997011275,0
.2415237678,0.5867242856\C,-0.844160292,1.6395735777,-0.4268479833\C,-
1.9760161741,1.2551936894,0.1361541401\C,-2.3923087914,-1.0358860734,-
0.0557643955\C,0.3235980425,0.7875682734,-0.1356859882\C,-1.1093142432
,-1.3685423936,-0.3602591004\C,0.1496925203,-0.6332454104,-0.151244509
2\H,-3.3806331312,0.2996137801,1.4332335206\H,-0.7633170455,2.45988827
32,-1.1373018124\H,1.7187287121,2.4104501712,0.0387394407\H,-3.1756548
236,-1.7742599934,-0.224548871\H,-0.9560852099,-2.3752668104,-0.747558
6451\C,1.6076580336,1.3296735593,0.0442342156\C,2.5669578833,-0.875832
9525,0.1864536297\H,3.4305876714,-1.5230597241,0.3068386649\C,1.309289
0866,-1.4290100931,-0.0026907826\H,1.2013201753,-2.5103156986,-0.02627
39389\C,2.7201916294,0.5158561201,0.2083031485\H,3.7045180838,0.956653
9373,0.3361669809\\Version=ES64L-G09RevD.01\State=1-A\HF=-423.9087698\
RMSD=8.508e-09\RMSF=5.945e-05\Dipole=0.3132737,-0.297812,-0.0202519\Qu
adrupole=2.0644665,1.7222772,-3.7867437,1.9108337,-0.4477432,-0.303338
1\PG=C01 [X(C11H8)]\\#
OUTPUT I'm looking for:
0,1\C,-2.6997011275,0
.2415237678,0.5867242856\C,-0.844160292,1.6395735777,-0.4268479833\C,-
1.9760161741,1.2551936894,0.1361541401\C,-2.3923087914,-1.0358860734,-
0.0557643955\C,0.3235980425,0.7875682734,-0.1356859882\C,-1.1093142432
,-1.3685423936,-0.3602591004\C,0.1496925203,-0.6332454104,-0.151244509
2\H,-3.3806331312,0.2996137801,1.4332335206\H,-0.7633170455,2.45988827
32,-1.1373018124\H,1.7187287121,2.4104501712,0.0387394407\H,-3.1756548
236,-1.7742599934,-0.224548871\H,-0.9560852099,-2.3752668104,-0.747558
6451\C,1.6076580336,1.3296735593,0.0442342156\C,2.5669578833,-0.875832
9525,0.1864536297\H,3.4305876714,-1.5230597241,0.3068386649\C,1.309289
0866,-1.4290100931,-0.0026907826\H,1.2013201753,-2.5103156986,-0.02627
39389\C,2.7201916294,0.5158561201,0.2083031485\H,3.7045180838,0.956653
9373,0.3361669809
The best I got so far was by using a simple:
awk 'BEGIN { FS = "\\\\" } ; {print $SELECTED AREA}'
where the selected area would be $4 if it is possible to set the field separator as "\\" without considering the "\"
Is someone have an idea how to do that?

You need all of eight backslashes to get what you want.
awk -F '\\\\\\\\' '{print $4}'
That's because you double them to get a literal backslash in a string, and double them again to get a literal backslash in a regex.
As an aside, that's an exceptionally poor choice of field delimiter.

To get correct output, you need to set record separator to nothing like this:
awk -F'\\\\\\\\' '{print $4}' RS= file
0,1\C,-2.6997011275,0
.2415237678,0.5867242856\C,-0.844160292,1.6395735777,-0.4268479833\C,-
1.9760161741,1.2551936894,0.1361541401\C,-2.3923087914,-1.0358860734,-
0.0557643955\C,0.3235980425,0.7875682734,-0.1356859882\C,-1.1093142432
,-1.3685423936,-0.3602591004\C,0.1496925203,-0.6332454104,-0.151244509
2\H,-3.3806331312,0.2996137801,1.4332335206\H,-0.7633170455,2.45988827
32,-1.1373018124\H,1.7187287121,2.4104501712,0.0387394407\H,-3.1756548
236,-1.7742599934,-0.224548871\H,-0.9560852099,-2.3752668104,-0.747558
6451\C,1.6076580336,1.3296735593,0.0442342156\C,2.5669578833,-0.875832
9525,0.1864536297\H,3.4305876714,-1.5230597241,0.3068386649\C,1.309289
0866,-1.4290100931,-0.0026907826\H,1.2013201753,-2.5103156986,-0.02627
39389\C,2.7201916294,0.5158561201,0.2083031485\H,3.7045180838,0.956653
9373,0.3361669809
Yo may need gnu awk to set record selector to nothing.

OK I got it thanks to ED Morton, Jotne and tripleee
By setting the RS i now have the correct output by using
awk 'BEGIN {FS="\\\\\\\\"; RS="\n\n";} {print $4}'
As I don't have any double blank lines it consider my block of text as one region now.
I never though about the RS before as I'm mainly working on table parsing usually.
Thanks for that

Extracting part of a string to a variable in bash

noob here, sorry if a repost. I am extracting a string from a file, and end up with a line, something like:
abcdefg:12345:67890:abcde:12345:abcde
Let's say it's in a variable named testString
the length of the values between the colons is not constant, but I want to save the number, as a string is fine, to a variable, between the 2nd and 3rd colons. so in this case I'd end up with my new variable, let's call it extractedNum, being 67890 . I assume I have to use sed but have never used it and trying to get my head around it...
Can anyone help? Cheers
On a side-note, I am using find to extract the entire line from a string, by searching for the 1st string of characters, in this case the abcdefg part.

Pure Bash using an array:
testString="abcdefg:12345:67890:abcde:12345:abcde"
IFS=':'
array=( $testString )
echo "value = ${array[2]}"
The output:
value = 67890

Here's another pure bash way. Works fine when your input is reasonably consistent and you don't need much flexibility in which section you pick out.
extractedNum="${testString#*:}" # Remove through first :
extractedNum="${extractedNum#*:}" # Remove through second :
extractedNum="${extractedNum%%:*}" # Remove from next : to end of string
You could also filter the file while reading it, in a while loop for example:
while IFS=' ' read -r col line ; do
# col has the column you wanted, line has the whole line
# # #
done < <(sed -e 's/\([^:]*:\)\{2\}\([^:]*\).*/\2 &/' "yourfile")
The sed command is picking out the 2nd column and delimiting that value from the entire line with a space. If you don't need the entire line, just remove the space+& from the replacement and drop the line variable from the read. You can pick any column by changing the number in the \{2\} bit. (Put the command in double quotes if you want to use a variable there.)

You can use cut for this kind of stuff. Here you go:
VAR=$(echo abcdefg:12345:67890:abcde:12345:abcde |cut -d":" -f3); echo $VAR
For the fun of it, this is how I would (not) do this with sed, but I'm sure there's easier ways. I guess that'd be a question of my own to future readers ;)
echo abcdefg:12345:67890:abcde:12345:abcde |sed -e "s/[^:]*:[^:]*:\([^:]*\):.*/\1/"

this should work for you: the key part is awk -F: '$0=$3'
NewVar=$(getTheLineSomehow...|awk -F: '$0=$3')
example:
kent$ newVar=$(echo "abcdefg:12345:67890:abcde:12345:abcde"|awk -F: '$0=$3')
kent$ echo $newVar
67890
if your text was stored in var testString, you could:
kent$ echo $testString
abcdefg:12345:67890:abcde:12345:abcde
kent$ newVar=$(awk -F: '$0=$3' <<<"$testString")
kent$ echo $newVar
67890

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Bash, trim a "complicated" string to obtain a new string - string

If you really want to do this in pure Bash, you'll need to split the string by setting IFS and then using read with a "here string". For details, see here: How do I split a string on a delimiter in Bash? You will probably need to split it multiple times--once by underscore and then by dash, I guess.

If you don't mind awk: echo sno_Int-INT1_Exp-INT2_INT3.fits.fz_ovsc_rms_D4_D5_D6_D7_D8_D9 | awk -F_ 'BEGIN{OFS="_"}{sub(/.fits.fz/,"",$4);print $4,$5,$6,$7,$8,$9,$10,$11,$12}' INT3_ovsc_rms_D4_D5_D6_D7_D8_D9

This awk should work: s='1000_1051.fits.fz_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14' awk -F'[_.]' 'NR==1{i3=$2;next} {printf "%s%s%s", i3, RS, $0}' RS='_ovsc_rms' <<< "$s" 1051_ovsc_rms_10.6_2.35_53.2_0_5.92_2.14

Related

How can I truncate a line of text longer than a given length?

Trim a string up to 4th delimiter from right side

bash Changing every other comma to point

AWK: how to extract in a block of text between two "\\" without considering the back to line

Extracting part of a string to a variable in bash

Categories

Resources