strsplit with vertical bar (pipe) - string

Here,
> r<-c("AAandBB", "BBandCC")
> strsplit(as.character(r),'and')
[[1]]
[1] "AA" "BB"
[[2]]
[1] "BB" "CC"
Working well, but
> r<-c("AA|andBB", "BB|andCC")
> strsplit(as.character(r),'|and')
[[1]]
[1] "A" "A" "|" "" "B" "B"
[[2]]
[1] "B" "B" "|" "" "C" "C"
Here, the answer is not correct. How to get "AA" and "BB", when I use '|and'?
Thanks in advance.

As you can read on ?strsplit, the argument split in function strsplit is a regular expression. Hence either you need to escape the vertical bar (it is a special character)
strsplit(r,split='\\|and')
or you can choose fixed=TRUE to indicate that split is not a regular expression
strsplit(r,split='|and',fixed=TRUE)

Related

How to correctly read column from file when first element is empty

I have a data file data.txt
a
5 b
3 c 7
which I would like to load and have as
julia> loaded_data
3×3 Matrix{Any}:
"" "a" ""
5 "b" ""
3 "c" 7
but it is unclear to me how to do this. Trying readdlm
julia> using DelimitedFiles
julia> readdlm("data.txt")
3×3 Matrix{Any}:
"a" "" ""
5 "b" ""
3 "c" 7
does not correctly identify the first element of the first column as empty space, and instead reads "a" as the first element (which of course makes sense that it would). The closest I think I've gotten to what I want is using readlines
julia> readlines("data.txt")
3-element Vector{String}:
" a "
"5 b "
"3 c 7"
but from here I'm not sure how to proceed. I can grab one of the rows with all the columns and split it, but not sure how that helps me identify the empty elements in other rows.
Here's a possibility:
cnv(s) = (length(s) > 0 && all(isdigit, s)) ? parse(Int, s) : s
cnv.(stack(split.(replace.(eachline("data.txt")," "=>" "), " "), dims=1))
If the contents of the columns are sufficiently distinguishable to make the parsing uniquely defined, I'd use a regex on each line:
julia> lines
3-element Vector{String}:
" a "
"5 b "
"3 c 7"
julia> [match(r"\s*(\d*)\s*([a-z]*)\s*(\d*)", s).captures for s in lines]
3-element Vector{Vector{Union{Nothing, SubString{String}}}}:
["", "a", ""]
["5", "b", ""]
["3", "c", "7"]
You can then proceed to parse and concatenate as you wish, e.g.
julia> mapreduce(vcat, lines) do line
x, y, z = match(r"\s*(\d*)\s*([a-z]*)\s*(\d*)", line).captures
[tryparse(Int, x) y tryparse(Int, z)]
end
3×3 Matrix{Any}:
nothing "a" nothing
5 "b" nothing
3 "c" 7
In Julia 1.9, I think you should be able to write this as
stack(lines; dims=1) do line
x, y, z = match(r"\s*(\d*)\s*([a-z]*)\s*(\d*)", line).captures
(tryparse(Int, x), y, tryparse(Int, z))
end
This problem may have many edge cases to clarify.
Here is a longer option than the other answer, but perhaps better suited to tweak for the edge cases:
function splittable(d)
# find all non-space locations
t = sort(union(findall.(!isspace, d)...))
# find initial indices of fields
tt = t[vcat(1,findall(diff(t).!=1).+1)]
# prepare ranges to extract fields
tr = [tt[i]:tt[i+1]-1 for i in 1:length(tt)-1]
# extract substrings
vs = map(s -> strip.(vcat([s[intersect(r,eachindex(s))] for r in tr],
tt[end]<=length(s) ? s[tt[end]:end] : "")), d)
# fit substrings into matrix
L = maximum(length.(vs))
String.([j <= length(vs[i]) ? vs[i][j] : ""
for i in 1:length(vs), j in 1:L])
end
And:
julia> d = readlines("data.txt")
3-element Vector{String}:
" a "
"5 b "
"3 c 7"
julia> dd = splittable(d)
3×3 Matrix{String}:
"" "a" ""
"5" "b" ""
"3" "c" "7"
To get the partial parsing effect:
function parsewhatmay(m)
M = tryparse.(Int, m)
map((x,y)->isnothing(x) ? y : x, M, m)
end
and now:
julia> parsewhatmay(dd)
3×3 Matrix{Any}:
"" "a" ""
5 "b" ""
3 "c" 7

Julia: concat strings with separator (equivalent of R's paste)

I have an array of strings that I would like to concatenate together with a specific separator.
x = ["A", "B", "C"]
Expected results (with sep = ;):
"A; B; C"
The R's equivalent would be paste(x, sep=";")
I've tried things like string(x) but the result is not what I look for...
Use join. It is not clear if you want ";" or "; " as a separator.
julia> x = ["A", "B", "C"]
3-element Array{String,1}:
"A"
"B"
"C"
julia> join(x, ';')
"A;B;C"
julia> join(x, "; ")
"A; B; C"
If you just want ; then just use a character ';'as a separator, if you also want the space, you need to use a string: "; "

How to extract substrings from this string?

The string is
And I want to get substrings "11","1.1","282". Can anyone show me how to do this in R? Thanks!
I believe strsplit(x," +")[[1]] will do it. (the regular expression " +" denotes one or more spaces; strsplit applies to character vectors, and returns a list with the splitted version of each element in the vector, so [[1]] extracts the first (and only) component)
> x = "11 1.1 282"
> res <- strsplit(x, " +")
> res
[[1]]
[1] "11" "1.1" "282"
>

Getting and removing the first character of a string

I would like to do some 2-dimensional walks using strings of characters by assigning different values to each character. I was planning to 'pop' the first character of a string, use it, and repeat for the rest of the string.
How can I achieve something like this?
x <- 'hello stackoverflow'
I'd like to be able to do something like this:
a <- x.pop[1]
print(a)
'h'
print(x)
'ello stackoverflow'
See ?substring.
x <- 'hello stackoverflow'
substring(x, 1, 1)
## [1] "h"
substring(x, 2)
## [1] "ello stackoverflow"
The idea of having a pop method that both returns a value and has a side effect of updating the data stored in x is very much a concept from object-oriented programming. So rather than defining a pop function to operate on character vectors, we can make a reference class with a pop method.
PopStringFactory <- setRefClass(
"PopString",
fields = list(
x = "character"
),
methods = list(
initialize = function(x)
{
x <<- x
},
pop = function(n = 1)
{
if(nchar(x) == 0)
{
warning("Nothing to pop.")
return("")
}
first <- substring(x, 1, n)
x <<- substring(x, n + 1)
first
}
)
)
x <- PopStringFactory$new("hello stackoverflow")
x
## Reference class object of class "PopString"
## Field "x":
## [1] "hello stackoverflow"
replicate(nchar(x$x), x$pop())
## [1] "h" "e" "l" "l" "o" " " "s" "t" "a" "c" "k" "o" "v" "e" "r" "f" "l" "o" "w"
There is also str_sub from the stringr package
x <- 'hello stackoverflow'
str_sub(x, 2) # or
str_sub(x, 2, str_length(x))
[1] "ello stackoverflow"
Use this function from stringi package
> x <- 'hello stackoverflow'
> stri_sub(x,2)
[1] "ello stackoverflow"
substring is definitely best, but here's one strsplit alternative, since I haven't seen one yet.
> x <- 'hello stackoverflow'
> strsplit(x, '')[[1]][1]
## [1] "h"
or equivalently
> unlist(strsplit(x, ''))[1]
## [1] "h"
And you can paste the rest of the string back together.
> paste0(strsplit(x, '')[[1]][-1], collapse = '')
## [1] "ello stackoverflow"
removing first characters:
x <- 'hello stackoverflow'
substring(x, 2, nchar(x))
Idea is select all characters starting from 2 to number of characters in x. This is important when you have unequal number of characters in word or phrase.
Selecting the first letter is trivial as previous answers:
substring(x,1,1)
Another alternative is to use capturing sub-expressions with the regular expression functions regmatches and regexec.
# the original example
x <- 'hello stackoverflow'
# grab the substrings
myStrings <- regmatches(x, regexec('(^.)(.*)', x))
This returns the entire string, the first character, and the "popped" result in a list of length 1.
myStrings
[[1]]
[1] "hello stackoverflow" "h" "ello stackoverflow"
which is equivalent to list(c(x, substr(x, 1, 1), substr(x, 2, nchar(x)))). That is, it contains the super set of the desired elements as well as the full string.
Adding sapply will allow this method to work for a character vector of length > 1.
# a slightly more interesting example
xx <- c('hello stackoverflow', 'right back', 'at yah')
# grab the substrings
myStrings <- regmatches(x, regexec('(^.)(.*)', xx))
This returns a list with the matched full string as the first element and the matching subexpressions captured by () as the following elements. So in the regular expression '(^.)(.*)', (^.) matches the first character and (.*) matches the remaining characters.
myStrings
[[1]]
[1] "hello stackoverflow" "h" "ello stackoverflow"
[[2]]
[1] "right back" "r" "ight back"
[[3]]
[1] "at yah" "a" "t yah"
Now, we can use the trusty sapply + [ method to pull out the desired substrings.
myFirstStrings <- sapply(myStrings, "[", 2)
myFirstStrings
[1] "h" "r" "a"
mySecondStrings <- sapply(myStrings, "[", 3)
mySecondStrings
[1] "ello stackoverflow" "ight back" "t yah"
Another way using the sub function.
sub('(^.).*', '\\1', 'hello stackoverflow')
[1] "h"
sub('(^.)(.*)', '\\2', 'hello stackoverflow')
[1] "ello stackoverflow"

Inconsistent behavior between str_split and strsplit

The documentation for str_split in the stringr package states that for the pattern argument:
If "" splits into individual characters.
which suggests it behaves the same as strsplit in this regard. However,
library(stringr)
str_split("abcab","")
[[1]]
[1] "" "a" "b" "c" "a" "b"
with a leading empty string. This compares with,
strsplit("abcab","")
[[1]]
[1] "a" "b" "c" "a" "b"
Leading empty strings seems to be normal behavior when splitting on non-empty strings,
strsplit("abcab","ab")
[[1]]
[1] "" "c"
but even then, str_split generates an 'extra' trailing empty string:
str_split("abcab","ab")
[[1]]
[1] "" "c" ""
Is this discrepancy a bug, feature, an error in the documentation or just a different notion of what's 'expected behavior'?
If you use commas as delimiters, the "expected" (your mileage may vary) result is more obvious:
# expect "" "2" "3" "4" ""
strsplit(",2,3,4,", ",")
# [[1]]
# [1] "" "2" "3" "4"
str_split(",2,3,4,", ",")
# [[1]]
# [1] "" "2" "3" "4" ""
If I have n commas then I expect (n+1) elements to be returned. So I prefer the results from str_split. However, I wouldn't necessarily call this a bug in strsplit, since in performs as advertised:
(from ?strplit) Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
"" is trickier, as there is no way to count the number of times "" appears in a string. Therefore treating it as a special case seems justified.
(from ?str_split) If ‘""’ splits into individual characters.
Based on this I suggest you have found a bug and should take hadley's advice and report it!

Resources