How can I get substring from a string in linux? - linux

I am trying to extract a specific string from a string in linux.
For example, I want to extract 'android.content.pm.PackageParser.parseBaseApplication' from the below string.
The String has a regular format and only the string within parenthesis is changeable.
Join point 'method-execution(boolean android.content.pm.PackageParser.parseBaseApplication(android.content.pm.PackageParser$Package, android.content.res.Resources, org.xmlpull.v1.XmlPullParser, android.util.AttributeSet, int, java.lang.String[]))' in Type
However, I have a trouble in finding a proper approach to do this.
At first, I tried sed command but it's too complicate so I couldn't complete the work.
Could you recommend any other approach to do this?
Thanks alot.

If the interested string is always the second string after the first ( then:
echo "..." | awk -F '[()]' '{split($2,a," "); printf a[2]}'
extract it.
It splits the line using delimiters ( and ). So $2 will the data between ( and ). split splits $2 and you get the second string which is
android.content.pm.PackageParser.parseBaseApplication
for your example.

This looks like AOP syntax. So with certain assumption, this can be done as :
echo "Join point...." | cut -d'(' -f2 | cut -d' ' -f2
Explanation : cut based on ( and get second field, which is the method signature except parameters. Since we are not interested in return type as well, split the signature based on blank space and get the second field, which is the method name.

This is based your stated invariant, that the substring you're capturing is the only part that varies from file to file, here is a perl solution:
Extract=$(perl -ne 'print $1 if /\s*Join point \x27method-execution\(boolean\s+([^(]*)/' file_to_search)
echo $Extract
android.content.pm.PackageParser.parseBaseApplication
I used the full lead-in because it reduced the chance of false-positive, but if you find other things change and want to use yet a substring of that (e.g., "method-execution(boolean "), that's your choice to make.
This matches out to the where the variant substring starts, which goes to the next invariant--the open parenthesis--so we can just capture while not open parenthesis. Since it's probably some human interaction changing the variant, I allowed for extra spaces with the \s+ (one or more white space).
You could use almost the same regex with sed, but would need to consume the entire string to avoid it becoming part of the output. e.g., in shorthand:
sed -r 's/.*LEAD_IN(CAPTURE_TEXT).*/\1/
Where LEAD_IN is the constant leader, "Join point..." and CAPTURE_TEXT the same capture group as in the perl solution. Main difference is leading and triling ".*" to consume the entire subject.

Related

Insert a character before the occurrence of a substring within a string

Suppose I have a string with the following format:
This string is in a file called motif.txt
str='ATTCTCGGTGA'
Within this string is a substring, with a variable name sub_str.
sub_str='CGG'
How do I make it so that I can insert the character 'A' right before the substring?
example_output='ATTCTCAGGTGA'
In this case suppose you cannot format the string based on the position of the characters. For instance you're not allowed to just put an A after the 5th index in the string. The string could vary in length and thus this implementation needs to handle strings of various sizes.
sed 's/CGG/N&/g' motif.txt
How do I make it so that I can insert the character 'A' right before the substring?
You offered this potential solution.
$ echo ATTCTCGGTGA | sed 's/CGG/N&/g'
ATTCTNCGGTGA
But alas! It inserts an N. Let's change that to an A.
$ echo ATTCTCGGTGA | sed 's/CGG/A&/g'
ATTCTACGGTGA
Problem solved.
I think you need to fix your question and do some research.
First, as pointed out elsewhere, you say you want to substitute in an A, but your sed uses an N... and you say you want the Adenine "right before" your CGG substring, but your example output tucks in in behind the Cytosine.
Rethink, clarify, edit.
If what you meant was what you showed, assuming every occurrence of $str in the file, then just simplify your logic.
sed 's/ATTCTCGGTGA/ATTCTCAGGTGA/g' motif.txt > new.txt
Check your output, and if you're satisfied, mv new.txt motif.txt.
If you're trying to abstract this so that several sequences can be corrected, you should still probably replace the minimum uniquely identifiable sequence with its new version in its entirety, rather than looking for a unique sequence and then scanning it for a subsequence to alter, which is going to be more error prone.

How to capitalize and replace characters in shell script in one echo

I am trying to find a way to capitalize and replace dashes of a string in one echo. I do not have the ability to use multiple lines for reassigning the string value.
For example:
string='test-e2e-uber' needs to echo $string as TEST_E2E_UBER
I currently can do one or the other by utilizing
${string^^} for capitalization
${string//-/_} for replacement
However, when I try to combine them it does not appear to work (bad substitution error).
Is there a correct syntax to achieve this?
echo ${string^^//-/_}
This does not answer directly your question, but still following script achieves what you wanted :
declare -u string='test-e2e-uber'
echo ${string//-/_}
You can do that directly with the 'tr' command, in just one 'echo'
echo "$string" | tr "-" "_" | tr "[:lower:]" "[:upper:]"
TEST_E2E_UBER
I don't think 'tr' allows to do the conversion of 2 objects in one command only, so I used pipe for output redirection
or you could do something similar with 'awk'
echo "$string" | awk '{gsub("-","_",$0)} {print toupper($0)}'
TEST_E2E_UBER
in this case, I'm replacing with 'gsub' the hyphen, then i'm printing the whole record to uppercase
Why do you dislike it so much to have two successive assignment statements? If you really hate it, you will have to revert to some external program to do the task for you, such as
string=$(tr a-z- A-Z_ <<<$string)
but I would consider it a waste of resources to create a child process for such a simple operation.

grep with variable numbers of whitespaces

I want to grep for a string which I know exists in my files. However, the source it comes from managed to change the number of whitespaces so that the content per se is identical in the string, but the length differs. => ordinary grep does not find it. Is there a way to adjust for it?
I dont's see a system behind the additional whitespace effect
Here's the original string
4FD0-A tr|A5ZLA0|A5ZLA0_9BACE Bacterial
and here's the modified string
4FD0-A tr|A5ZLA0|A5ZLA0_9BACE Bacterial
I believe that egrep would be your friend here. Try the following command:
egrep '4FD0-A\s+tr[|]A5ZLA0[|]A5ZLA0_9BACE\s+Bacterial' filename
I used a rather simple pattern for my example. Feel free to change it to suit.
You can use the "one or more" + operator. For example
grep '^4FD0-A\s\+tr|' myfile
The \s searches for any whitespace. If you want to limit it to only spaces just use a single space in place of the \s
You can use all kind of utilities for deleting spaces.
Try this:
cat file |tr -s '\ ' | grep '4FD0-A tr|A5ZLA0|A5ZLA0_9BACE Bacterial'

How can I remove a doubled section of a string?

I'm having trouble with data manipulation in a txt file. My file currently looks like this:
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
This shows some of my tab-delimited data. Half the entries are in the correct seven-characters (letterletternumbernumbernumbernumbernumber) format, but some are doubled up. I want to go into the second column (first column is empty for a reason!) and remove the repeats in the string so it would read
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
I can't work out how to do this with sed/awk on a per column basis. I feel like I should be able to write a regex, but because the data is a repeat, I don't want to lose the first half of the string; and I can't work out how to cut on a specific column, or I would just delete the 7th character. Any help much appreciated!
Solution
You can solve this with a backreference. For example, using GNU sed:
$ cat << EOF | sed --regexp-extended 's/(.{7})\1/\1/'
HG02239 -23.42333333
NA06985NA06985 -20.125
NA06991NA06991 -20.92
EOF
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
If you aren't using GNU sed, you may need to escape the capture groups. In addition, you can tune the regular expression if you need a more accurate character match.
Explanation
The cat pipeline is just a here-document to make it easy to display and test the code. You can call sed directly on your file, or use the -i flag to perform an in-place edit when you're comfortable with the results.
The sed script does the following:
It stores any group of 7 consecutive characters in a capture group using an "interval expression" (the number in the curly braces).
The \1 is a backreference that matches the first capture group.
The match looks for "a capture group followed by a copy of the capture group."
The substitution replaces the match with a single copy of the capture group.
One way, using awk:
awk '{ print substr($1, 1, 7), $2 }' file.txt
Output:
HG02239 -23.42333333
NA06985 -20.125
NA06991 -20.92
You could use something like that:
sed -i 's|\([A-Z]\{2\}[0-9]\{5\}\)[A-Z0-9]*\s*\(.*\)|\1 \2|g' <your-file>

A Linux Shell Script Problem

I have a string separated by dot in Linux Shell,
$example=This.is.My.String
I want to
1.Add some string before the last dot, for example, I want to add "Good.Long" before the last dot, so I get:
This.is.My.Goood.Long.String
2.Get the part after the last dot, so I will get
String
3.Turn the dot into underscore except the last dot, so I will get
This_is_My.String
If you have time, please explain a little bit, I am still learning Regular Expression.
Thanks a lot!
I don't know what you mean by 'Linux Shell' so I will assume bash. This solution will also work in zsh, etcetera:
example=This.is.My.String
before_last_dot=${example%.*}
after_last_dot=${example##*.}
echo ${before_last_dot}.Goood.Long.${after_last_dot}
This.is.My.Goood.Long.String
echo ${before_last_dot//./_}.${after_last_dot}
This_is_My.String
The interim variables before_last_dot and after_last_dot should explain my usage of the % and ## operators. The //, I also think is self-explanatory but I'd be happy to clarify if you have any questions.
This doesn't use sed (or even regular expressions), but bash's inbuilt parameter substitution. I prefer to stick to just one language per script, with as few forks as possible :-)
Other users have given good answers for #1 and #2. There are some disadvantages to some of the answers for #3. In one case, you have to run the substitution twice. In another, if your string has other underscores they might get clobbered. This command works in one go and only affects dots:
sed 's/\(.*\)\./\1\n./;h;s/[^\n]*\n//;x;s/\n.*//;s/\./_/g;G;s/\n//'
It splits the line before the last dot by inserting a newline and copies the result into hold space:
s/\(.*\)\./\1\n./;h
removes everything up to and including the newline from the copy in pattern space and swaps hold space and pattern space:
s/[^\n]*\n//;x
removes everything after and including the newline from the copy that's now in pattern space
s/\n.*//
changes all dots into underscores in the copy in pattern space and appends hold space onto the end of pattern space
s/\./_/g;G
removes the newline that the append operation adds
s/\n//
Then the sed script is finished and the pattern space is output.
At the end of each numbered step (some consist of two actual steps):
Step Pattern Space Hold Space
This.is.My\n.String This.is.My\n.String
This.is.My\n.String .String
This.is.My .String
This_is_My\n.String .String
This_is_My.String .String
Solution
Two versions of this, too:
Complex: sed 's/\(.*\)\([.][^.]*$\)/\1.Goood.Long\2/'
Simple: sed 's/.*\./&Goood.Long./' - thanks Dennis Williamson
What do you want?
Complex: sed 's/.*[.]\([^.]*\)$/\1/'
Simpler: sed 's/.*\.//' - thanks, glenn jackman.
sed 's/\([^.]*\)[.]\([^.]*[.]\)/\1_\2/g'
With 3, you probably need to run the substitute (in its entirety) at least twice, in general.
Explanation
Remember, in sed, the notation \(...\) is a 'capture' that can be referenced as '\1' or similar in the replacement text.
Capture everything up to a string starting with a dot followed by a sequence of non-dots (which you also capture); replace by what came before the last dot, the new material, and the last dot and what came after it.
Ignore everything up to the last dot followed by a capture of a sequence of non-dots; replace with the capture only.
Find and capture a sequence of non-dots, a dot (not captured), followed by a sequence of non-dots and a dot; replace the first dot with an underscore. This is done globally, but the second and subsequent matches won't touch anything already matched. Therefore, I think you need ceil(log2N) passes, where N is the number of dots to be replaced. One pass deals with 1 dot to replace; two passes deals with 2 or 3; three passes deals with 4-7, and so on.
Here's a version that uses Bash's regex matching (Bash 3.2 or greater).
[[ $example =~ ^(.*)\.(.*)$ ]]
echo ${BASH_REMATCH[1]//./_}.${BASH_REMATCH[2]}
Here's a Bash version that uses IFS (Internal Field Separator).
saveIFS=$IFS
IFS=.
array=($e) # * split the string at each dot
lastword=${array[#]: -1}
unset "array[${#array}-1]" # *
IFS=_
echo "${array[*]}.$lastword" # The asterisk as a subscript when inside quotes causes IFS (an underscore in this case) to be inserted between each element of the array
IFS=$saveIFS
* use declare -p array after these steps to see what the array looks like.
1.
$ echo 'This.is.my.string' | sed 's}[^\.][^\.]*$}Good Long.&}'
This.is.my.Good Long.string
before: a dot, then no dot until the end. after: obvious, & is what matched the first part
2.
$ echo 'This.is.my.string' | sed 's}.*\.}}'
string
sed greedy matches, so it will extend the first closure (.*) as far as possible i.e. to the last dot.
3.
$ echo 'This.is.my.string' | tr . _ | sed 's/_\([^_]*\)$/\.\1/'
This_is_my.string
convert all dots to _, then turn the last _ to a dot.
(caveat: this will turn 'This.is.my.string_foo' to 'This_is_my_string.foo', not 'This_is_my.string_foo')
You don't need regular expressions at all (those complex things hurt my eyes!) if you use Awk and are a little creative.
1. echo $example| awk -v ins="Good.long" -F . '{OFS="."; $NF = ins"."$NF;print}'
What this does:
-v ins="Good.long" tells awk to create a variable called 'ins' with "Good.long" as content,
-F . tells awk to use the dot as a separator for your fields for input,
-OFS tells awk to use the dot as a separator for your fields as output,
NF is the number of fields, so $NF represents the last field,
the $NF=... part replaces the last field, it appends the current last string to what you want to insert (the variable called "ins" declared earlier).
2. echo $example| awk -F . '{print $NF}'
$NF is the last field, so that's all!
3. echo $example| awk -F . '{OFS="_"; $(NF-1) = $(NF-1)"."$NF; NF=NF-1; print}'
Here we have to be creative, as Awk AFAIK doesn't allow deleting fields. Of course, we set the output field separateor to underscore.
$(NF-1) = $(NF-1)"."$NF: First, we replace the second last field with the last glued to the second last, with a dot between.
Then, we fool awk to make it think the Number of fields is equal to the number of fields minus one, hence deleting the last field!
Note you can't say $NF="", because then it would display two underscores.

Resources