how to position a string character in R - string

Suppose I have a string like:
x<-c("bv_bid_bayley_inf_development_f7r","bv_fci_family_care_indicator_f7r")
how can I position the first "_" (a) and the last "_" (b) so that I can substr(x,a,b) in R. Such a output like that:
bid_bayley_inf_development
fci_family_care_indicator

You can use regular expressions to extract the substring:
x <- c("bv_bid_bayley_inf_development_f7r", "bv_fci_family_care_indicator_f7r")
sub("[^_]*_(.*)_[^_]*", "\\1", x)
# [1] "bid_bayley_inf_development" "fci_family_care_indicator"

for position only,
gregexpr("_",x)

Related

Split string with commas while keeping numeric parts

I'm using the following function to separate strings with commas right on the capitals, as long as it is not preceded by a blank space.
def func(x):
y = re.findall('[A-Z][^A-Z\s]+(?:\s+\S[^A-Z\s]*)*', x)
return ','.join(y)
However, when I try to separate the next string it removes the part with numbers.
Input = '49ersRiders Mapple'
Output = 'Riders Mapple'
I tried the following code but now it removes the 'ers' part.
def test(x):
y = re.findall(r'\d+[A-Z]*|[A-Z][^A-Z\s]+(?:\s+\S[^A-Z\s]*)*', x)
return ','.join(y)
Output = '49,Riders Mapple'
The output I'm looking for is this:
'49ers,Riders Mapple'
Is it possible to add this indication to my regex?
Thanks in advance
Maybe naive but why don't you use re.sub:
def func(x):
return re.sub(r'(?<!\s)([A-Z])', r',\1', x)
inp = '49ersRiders Mapple'
out = func(inp)
print(out)
# Output
49ers,Riders Mapple
Here is a regex re.findall approach:
inp = "49ersRiders"
output = ','.join(re.findall('(?:[A-Z]|[0-9])[^A-Z]+', inp))
print(output) # 49ers,Riders
The regex pattern used here says to match:
(?:
[A-Z] a leading uppercase letter (try to find this first)
| OR
[0-9] a leading number (fallback for no uppercase)
)
[^A-Z]+ one or more non capital letters following

How can create a new string from an original string replacing all non-instances of a character

So Let's say I have a random string "Mississippi"
I want to create a new string from "Mississippi" but replacing all the non-instances of a particular character.
For example if we use the letter "S". In the new string, I want to keep all the S's in "MISSISSIPPI" and replace all the other letters with a "_".
I know how to do the reverse:
word = "MISSISSIPPI"
word2 = word.replace("S", "_")
print(word2)
word2 gives me MI__I__IPPI
but I can't figure out how to get word2 to be __SS_SS____
(The classic Hangman Game)
You would need to use the sub method of Python strings with a regular expression for symbolizing a NOT character set such as
import re
line = re.sub(r"[^S]", "_", line)
This replaces any non S character with the desired character.
You could do this with str.maketrans() and str.translate() but it would be easier with regular expressions. The trick is you need to insert your string of valid characters into the regular expression programattically:
import re
word = "MISSISSIPPI"
show = 'S' # augment as the game progresses
print(re.sub(r"[^{}]".format(show), "_", word))
A simpler way is to map a function across the string:
>>> ''.join(map(lambda w: '_' if w != 'S' else 'S', 'MISSISSIPPI'))
'__SS_SS____'

string matching in matlab

I have two short (S with the size of 1x10) and very long (L with the size of 1x1000) strings and I am going to find the locations in L which are matched with S.
In this specific matching, I am just interested to match some specific strings in S (the black strings). Is there any function or method in matlab that can match some specific strings (for example string numbers of 1, 5, 9 in S)?
If I understand your question correctly, you want to find substrings in L that contain the same letters (characters) as S in certain positions (let's say given by array idx). Regular expressions are ideal here, so I suggest using regexp.
In regular expressions, a dot (.) matches any character, and curly braces ({}) optionally specify the number of desired occurrences. For example, to match a string of length 6, where the second character is 'a' and the fifth is 'b', our regular expression could be any of the following syntaxes:
.a..b.
.a.{2}b.
.{1}a.{2}b.{1}
any of these is correct. So let's construct a regular expression pattern first:
in = num2cell(diff([0; idx(:); numel(S) + 1]) - 1); %// Intervals
ch = num2cell(S(idx(:))); %// Matched characters
C = [in(:)'; ch(:)', {''}];
pat = sprintf('.{%d}%c', C{:}); %// Pattern for regexp
Now all is left is to feed regexp with L and the desired pattern:
loc = regexp(L, pat)
and voila!
Example
Let's assume that:
S = 'wbzder'
L = 'gabcdexybhdef'
idx = [2 4 5]
First we build a pattern:
in = num2cell(diff([0; idx(:); numel(S) + 1]) - 1);
ch = num2cell(S(idx(:)));
C = [in(:)'; ch(:)', {''}];
pat = sprintf('.{%d}%c', C{:});
The pattern we get is:
pat =
.{1}b.{1}d.{0}e.{1}
Obviously we can add code that beautifies this pattern into .b.de., but this is really an unnecessary optimization (regexp can handle the former just as well).
After we do:
loc = regexp(L, pat)
we get the following result:
loc =
2 8
Seems correct.

How to extract substrings from this string?

The string is
And I want to get substrings "11","1.1","282". Can anyone show me how to do this in R? Thanks!
I believe strsplit(x," +")[[1]] will do it. (the regular expression " +" denotes one or more spaces; strsplit applies to character vectors, and returns a list with the splitted version of each element in the vector, so [[1]] extracts the first (and only) component)
> x = "11 1.1 282"
> res <- strsplit(x, " +")
> res
[[1]]
[1] "11" "1.1" "282"
>

R: How can I replace let's say the 5th element within a string?

I would like to convert the a string like be33szfuhm100060 into BESZFUHM0060.
In order to replace the small letters with capital letters I've so far used the gsub function.
test1=gsub("be","BE",test)
Is there a way to tell this function to replace the 3rd and 4th string element? If not, I would really appreciate if you could tell me another way to solve this problem. Maybe there is also a more general solution to change a string element at a certain position into a capital letter whatever the element is?
A couple of observations:
Cnverting a string to uppercase can be done with toupper, e.g.:
> toupper('be33szfuhm100060')
> [1] "BE33SZFUHM100060"
You could use substr to extract a substring by character positions and paste to concatenate strings:
> x <- 'be33szfuhm100060'
> paste(substr(x, 1, 2), substr(x, 5, nchar(x)), sep='')
[1] "beszfuhm100060"
As an alternative, if you are going to be doing this alot:
String <- function(x="") {
x <- as.character(paste(x, collapse=""))
class(x) <- c("String","character")
return(x)
}
"[.String" <- function(x,i,j,...,drop=TRUE) {
unlist(strsplit(x,""))[i]
}
"[<-.String" <- function(x,i,j,...,value) {
tmp <- x[]
tmp[i] <- String(value)
x <- String(tmp)
x
}
print.String <- function(x, ...) cat(x, "\n")
## try it out
> x <- String("be33szfuhm100060")
> x[3:4] <- character(0)
> x
beszfuhm100060
You can use substring to remove the third and fourth elements.
x <- "be33szfuhm100060"
paste(substring(x, 1, 2), substring(x, 5), sep = "")
If you know what portions of the string you want based on their position(s), use substr or substring. As I mentioned in my comment, you can use toupper to coerce characters to uppercase.
paste( toupper(substr(test,1, 2)),
toupper(substr(test,5,10)),
substr(test,12,nchar(test)),sep="")
# [1] "BESZFUHM00060"

Resources