Inconsistent behavior between str_split and strsplit - string

The documentation for str_split in the stringr package states that for the pattern argument:
If "" splits into individual characters.
which suggests it behaves the same as strsplit in this regard. However,
library(stringr)
str_split("abcab","")
[[1]]
[1] "" "a" "b" "c" "a" "b"
with a leading empty string. This compares with,
strsplit("abcab","")
[[1]]
[1] "a" "b" "c" "a" "b"
Leading empty strings seems to be normal behavior when splitting on non-empty strings,
strsplit("abcab","ab")
[[1]]
[1] "" "c"
but even then, str_split generates an 'extra' trailing empty string:
str_split("abcab","ab")
[[1]]
[1] "" "c" ""
Is this discrepancy a bug, feature, an error in the documentation or just a different notion of what's 'expected behavior'?

If you use commas as delimiters, the "expected" (your mileage may vary) result is more obvious:
# expect "" "2" "3" "4" ""
strsplit(",2,3,4,", ",")
# [[1]]
# [1] "" "2" "3" "4"
str_split(",2,3,4,", ",")
# [[1]]
# [1] "" "2" "3" "4" ""
If I have n commas then I expect (n+1) elements to be returned. So I prefer the results from str_split. However, I wouldn't necessarily call this a bug in strsplit, since in performs as advertised:
(from ?strplit) Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
"" is trickier, as there is no way to count the number of times "" appears in a string. Therefore treating it as a special case seems justified.
(from ?str_split) If ‘""’ splits into individual characters.
Based on this I suggest you have found a bug and should take hadley's advice and report it!

Related

VBA: How to find the values after a "#" symbol in a string

I am trying to set the letters after a # symbol to a variable.
For example, x = #BAL
I want to set y = BAL
Or x = #NE
I want y = NE
I am using VBA.
Split() in my opinion is the easiest way to do it:
Dim myStr As String
myStr = "#BAL"
If InStr(, myStr, "#") > 0 Then '<-- Check for your string to not throw error
MsgBox Split(myStr, "#")(1)
End If
As wisely pointed out by Scott Craner, you should check to ensure the string contains the value, which he checks in this comment by doing: y = Split(x,"#")(ubound(Split(x,"#")). Another way you can do it is using InStr(): If InStr(, x, "#") > 0 Then...
The (1) will take everything after the first instance of the character you are looking for. If you were to have used (0), then this would have taken everything before the #.
Similar but different example:
Dim myStr As String
myStr = "#BAL#TEST"
MsgBox Split(myStr, "#")(2)
The message box would have returned TEST because you used (2), and this was the second instance of your # character.
Then you can even split them into an array:
Dim myStr As String, splitArr() As String
myStr = "#BAL#TEST"
splitArr = Split(myStr, "#") '< -- don't append the collection number this time
MsgBox SplitArr(1) '< -- This would return "BAL"
MsgBox SplitArr(2) '< -- This would return "TEST"
If you are looking for additional reading, here is more from the MSDN:
Split Function
Description Returns a zero-based, one-dimensional array containing a specified number of substrings. SyntaxSplit( expression [ ,delimiter [ ,limit [ ,compare ]]] ) The Split function syntax has thesenamed arguments:
expression
Required. String expression containing substrings and delimiters. If expression is a zero-length string(""), Split returns an empty array, that is, an array with no elements and no data.
delimiter
Optional. String character used to identify substring limits. If omitted, the space character (" ") is assumed to be the delimiter. If delimiter is a zero-length string, a single-element array containing the entire expression string is returned.
limit
Optional. Number of substrings to be returned; -1 indicates that all substrings are returned.
compare
Optional. Numeric value indicating the kind of comparison to use when evaluating substrings. See Settings section for values.
You can do the following to get the substring after the # symbol.
x = "#BAL"
y = Right(x,len(x)-InStr(x,"#"))
Where x can be any string, with characters before or after the # symbol.

How do I get from string "3+10" to strings "3" "+" "10"?

I'm making a graphing calculator in Unity and I have input with strings like "3+10" and I want to split it to "3","+" and "10".
I can figure out a way to deal with them once I've got them to this form, but I really need a way to split the string to the left and right of key characters such as plus, times, exponent, etc.
I'm doing this in Unity, but a way to do this in any language should help.
C#
The following code will do what you asked for (and nothing more).
string input = "3+10-5";
string pattern = #"([-+^*\/])";
string[] substrings = Regex.Split(input, pattern);
// results in substrings = {"3", "+", "10", "-", "5"}
By using Regex.Split instead of String.Split you are able to retrieve the math operators as well. This is done by putting the math operators in a capture group ( ). If you're not familiar with regular expressions you should google the basics.
The code above will stubbornly use the math operators to split your string. If the string doesn't make sense, the method doesn't care and may even produce unexpected results. For example "5//10-" will result in {"5", "/", "", "10", "-", ""}. Note that only one / is returned and empty strings are added.
You can use more complex regular expressions to check if your string is a valid mathematical expression before you try to split it. For example ^(\d+(?:.\d+)?+([-+*^\/]\g<1>)?)$ would check if your string consists of a decimal number and zero or more combinations of an operator and another decimal number.
Here is the C# way -- which I mention because you are using Unity.
words = phrase.Split(default(string[]),StringSplitOptions.RemoveEmptyEntries);
https://msdn.microsoft.com/en-us/library/tabh47cf%28v=vs.110%29.aspx
Here is Java code for splitting a String by math operators
String[] splitByOperators(String input) {
String[] output = new String[input.length()];
int index = 0;
String current = "";
for (char c : input){
if (c == '+' || c == '-' || c == '*' || c == '/'){
output[index] = current;
index++;
output[index] = c;
index++;
current = "";
} else {
current = current + c;
}
}
output[index] = current;
return output;
}
Using Python regular expressions:
>>> import re
>>> match = re.search(r'(\d+)(.*)(\d+)', "3+1")
>>> match.group(1)
'3'
>>> match.group(2)
'+'
>>> match.group(3)
'1'
The reason for using regular expressions is for greater flexibility in handling a variety of simple arithmetic expressions.
R: EDITED
Take your input vector as x<-c("3+10", "4/12" , "8-3" ,"12*1","1+2-3*4/8").
We can use the following string split based on regex:
> strsplit(x,split="(?<=\\d)(?=[+*-/])|(?<=[+*-/])(?=\\d)",perl=T)
[[1]]
[1] "3" "+" "10"
[[2]]
[1] "4" "/" "12"
[[3]]
[1] "8" "-" "3"
[[4]]
[1] "12" "*" "1"
[[5]]
[1] "1" "+" "2" "-" "3" "*" "4" "/" "8"
How it works:
Split the string when one of two things is found:
A digit followed by an arithmetic operator. (?<=\\d) finds something immediately preceded by a digit, while (?=[+*-/]) finds something immediately succeeded by an arithmetic operator, i.e. +, *, -, or /. The "something" in both cases is the blank string "" found between a digit and an operator, and the string is split at such a point.
An arithmetic operator followed by a digit. This is just the reverse of the above.

strsplit with vertical bar (pipe)

Here,
> r<-c("AAandBB", "BBandCC")
> strsplit(as.character(r),'and')
[[1]]
[1] "AA" "BB"
[[2]]
[1] "BB" "CC"
Working well, but
> r<-c("AA|andBB", "BB|andCC")
> strsplit(as.character(r),'|and')
[[1]]
[1] "A" "A" "|" "" "B" "B"
[[2]]
[1] "B" "B" "|" "" "C" "C"
Here, the answer is not correct. How to get "AA" and "BB", when I use '|and'?
Thanks in advance.
As you can read on ?strsplit, the argument split in function strsplit is a regular expression. Hence either you need to escape the vertical bar (it is a special character)
strsplit(r,split='\\|and')
or you can choose fixed=TRUE to indicate that split is not a regular expression
strsplit(r,split='|and',fixed=TRUE)

How to extract substrings from this string?

The string is
And I want to get substrings "11","1.1","282". Can anyone show me how to do this in R? Thanks!
I believe strsplit(x," +")[[1]] will do it. (the regular expression " +" denotes one or more spaces; strsplit applies to character vectors, and returns a list with the splitted version of each element in the vector, so [[1]] extracts the first (and only) component)
> x = "11 1.1 282"
> res <- strsplit(x, " +")
> res
[[1]]
[1] "11" "1.1" "282"
>

truncate string from a certain character in R [duplicate]

This question already has answers here:
How do I specify a dynamic position for the start of substring?
(4 answers)
Closed 5 years ago.
I have a list of strings in R which looks like:
WDN.TO
WDR.N
WDS.AX
WEC.AX
WEC.N
WED.TO
I want to get all the postfix of the strings starting from the character ".", the result should look like:
.TO
.N
.AX
.AX
.N
.TO
Anyone have any ideas?
Joshua's solution works fine. I'd use sub instead of gsub though. gsub is for substituting multiple occurrences of a pattern in a string - sub is for one occurrence. The pattern can be simplified a bit too:
> x <- c("WDN.TO","WDR.N","WDS.AX","WEC.AX","WEC.N","WED.TO")
> sub("^[^.]*", "", x)
[1] ".TO" ".N" ".AX" ".AX" ".N" ".TO"
...But if the strings are as regular as in the question, then simply stripping the first 3 characters should be enough:
> x <- c("WDN.TO","WDR.N","WDS.AX","WEC.AX","WEC.N","WED.TO")
> substring(x, 4)
[1] ".TO" ".N" ".AX" ".AX" ".N" ".TO"
Using gsub:
x <- c("WDN.TO","WDS.N")
# replace everything from the start of the string to the "." with "."
gsub("^.*\\.",".",x)
# [1] ".TO" ".N"
Using strsplit:
# strsplit returns a list; use sapply to get the 2nd obs of each list element
y <- sapply(strsplit(x,"\\."), `[`, 2)
# since we split on ".", we need to put it back
paste(".",y,sep="")
# [1] ".TO" ".N"
Strsplit might do it but in case the data set is too large it will show an error
subscript out of bounds
x <- c("WDN.TO","WDR.N","WDS.AX","WEC.AX","WEC.N","WED.TO")
y <- strsplit(x,".")[,2]
#output y= TO N AX AX N TO

Resources