Wrapping strings, but not substrings in quotes, using R - string

This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.

Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.

Related

Split string into 100 words parts in R

How do I split a single huge "character" into smaller ones, each containing exactly 100 words.
For example, that's how I used to split it by single words.
myCharSplitByWords <- strsplit(myCharUnSplit, " ")[[1]]
I think that this can probably be done with regex (maybe selecting 100th space or smth) but couldn't write a proper expression
I'm new to R and I'm totally stuck. Thanks
Maybe there is a way using regular expressions but after strsplit it would be easier to group the words by "hand":
## example data
set.seed(1)
string <- paste0(sample(c(LETTERS[1:10], " "), 1e5, replace=TRUE), collapse="")
## split if there is at least one space
words <- strsplit(string, "\\s+")[[1]]
## build group index
group <- rep(seq(ceiling(length(words)/100)), each=100)[1:length(words)]
## split by group index
words100 <- split(words, group)
You can get every 100th instances of a run of spaces preceded by a run of non-spaces (if that's your definition of a word) by:
ind<- gregexpr("([^ ]+? +){100}", string)[[1]]
and then substring your original by
hundredWords <- substr(string, ind, c(ind[-1]-1, nchar(string))
This will leave trailing spaces at the end of each entry, and the final entry will not necessarily have 100 entries, but will have the remaining words that are left after removing batches of 100. If you have another definition of word delimiter (tabs, punctuation, ...) then post that and we can change the regular expression accordingly.

Standard ML string to a list

Is there a way in ML to take in a string and output a list of those string where a separation is a space, newline or eof, but also keeping strings inside strings intact?
EX) hello world "my id" is 5555
-> [hello, world, my id, is, 5555]
I am working on a tokenizing these then into:
->[word, word, string, word, int]
Sure you can! Here's the idea:
If we take a string like "Hello World, \"my id\" is 5555", we can split it at the quote marks, ignoring the spaces for now. This gives us ["Hello World, ", "my id", " is 5555"]. The important thing to notice here is that the list contains three elements - an odd number. As long as the string only contains pairs of quotes (as it will if it's properly formatted), we'll always get an odd number of elements when we split at the quote marks.
A second important thing is that all the even-numbered elements of the list will be strings that were unquoted (if we start counting from 0), and the odd-numbered ones were quoted. That means that all we need to do is tokenize the ones that were unquoted, and then we're done!
I put some code together - you can continue from there:
fun foo s =
let
val quoteSep = String.tokens (fn c => c = #"\"") s
val spaceSep = String.tokens (fn c => c = #" ") (* change this to include newlines and stuff *)
fun sepEven [] = []
| sepEven [x] = (* there were no quotes in the string *)
| sepEven (x::y::xs) = (* x was unquoted, y was quoted *)
in
if length quoteSep mod 2 = 0
then (* there was an uneven number of quote marks - something is wrong! *)
else (* call sepEven *)
end
String.tokens brings you halfway there. But if you really want to handle quotes like you are sketching then there is no way around writing an actual lexer. MLlex, which comes with SML/NJ and MLton (but is usable with any SML) could help. Or you just write it by hand, which should be easy enough in this case as well.

How to split a string vector and recompose it in the original form

I would like to split a string vector, process its tokens, and then recompose it in the original form.
Please consider the following
vector.in <- c("red rum", "mur der", "red rum", "mur der")
length(vector.in)
# [1] 4
vector.splt <- strsplit(vector.in, "\\s")
vector.splt <- unlist(vector.splt)
vector.out <- paste(vector.splt, sep="", collapse=" ")
and of course
length(vector.out)
# [1] 1
How should I process it so to output a vector with the same form and length as the original vector.in, that is without loosing any information?
The unlist is the problem. That removes the structure too early. Then you need to loop around the elements and pass to the paste function. I will use lapply for the loop:
vector.in <- c("red rum", "mur der", "red rum", "mur der")
vector.splt <- strsplit(vector.in, "\\s")
unlist(lapply(vector.splt, paste, collapse=' '))
## [1] "red rum" "mur der" "red rum" "mur der"
The gsubfn function in the gsubfn package does that. For example, here we split the input into words, apply a function (represented in formula notation) to each word where in this case the function parenthesizes each word and then we put it all back together:
> library(gsubfn)
> gsubfn("\\w+", ~ sprintf("(%s)", x), vector.in)
[1] "(red) (rum)" "(mur) (der)" "(red) (rum)" "(mur) (der)"

Extract numeric part of strings of mixed numbers and characters in R

I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt
I want to replace it with 001234. How can I achieve it in R?
The stringr package has lots of handy shortcuts for this kind of work:
# input data following #agstudy
data <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
# load library
library(stringr)
# prepare regular expression
regexp <- "[[:digit:]]+"
# process string
str_extract(data, regexp)
Which gives the desired result:
[1] "001234" "001234"
To explain the regexp a little:
[[:digit:]] is any number 0 to 9
+ means the preceding item (in this case, a digit) will be matched one or more times
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
Using gsub or sub you can do this :
gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"
you can use regexpr with regmatches
m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"
EDIT the 2 methods are vectorized and works for a vector of strings.
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"
m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"
[[2]]
[1] "001234"
You could use genXtract from the qdap package. This takes a left character string and a right character string and extracts the elements between.
library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")
Though I much prefer agstudy's answer.
EDIT Extending answer to match agstudy's:
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")
# $`- : .txt1`
# [1] "001234"
#
# $`- : .txt2`
# [1] "001234"
gsub Remove prefix and suffix:
gsub(".*-|\\.txt$", "", x)
tools package Use file_path_sans_ext from tools to remove extension and then use sub to remove prefix:
library(tools)
sub(".*-", "", file_path_sans_ext(x))
strapplyc Extract the digits after - and before dot. See gsubfn home page for more info:
library(gsubfn)
strapplyc(x, "-(\\d+)\\.", simplify = TRUE)
Note that if it were desired to return a numeric we could use strapply rather than strapplyc like this:
strapply(x, "-(\\d+)\\.", as.numeric, simplify = TRUE)
I'm adding this answer because it works regardless of what non-numeric characters you have in the strings you want to clean up, and because OP said that the string tends to follow the format "Ab_Cd-001234.txt", which I take to mean allows for variation.
Note that this answer takes all numeric characters from the string and keeps them together, so if the string were "4_Ab_Cd_001234.txt", your result would be "4001234".
If you're wanting to point your solution at a column in a dataframe you've got,
df$clean_column<-gsub("[^0-9]", "", df$dirty_column)
This is very similar to the answer here:
https://stackoverflow.com/a/52729957/9731173.
Essentially what you are doing with my solution is replacing any non-numeric character with "", while the answer I've linked to replaces any character that is not numeric, - or .

Insert line breaks in long string -- word wrap

Here is a function I wrote to break a long string into lines not longer than a given length
strBreakInLines <- function(s, breakAt=90, prepend="") {
words <- unlist(strsplit(s, " "))
if (length(words)<2) return(s)
wordLen <- unlist(Map(nchar, words))
lineLen <- wordLen[1]
res <- words[1]
lineBreak <- paste("\n", prepend, sep="")
for (i in 2:length(words)) {
lineLen <- lineLen+wordLen[i]
if (lineLen < breakAt)
res <- paste(res, words[i], sep=" ")
else {
res <- paste(res, words[i], sep=lineBreak)
lineLen <- 0
}
}
return(res)
}
It works for the problem I had; but I wonder if I can learn something here. Is there a shorter or more efficient solution, especially can I get rid of the for loop?
How about this:
gsub('(.{1,90})(\\s|$)', '\\1\n', s)
It will break string "s" into lines with maximum 90 chars (excluding the line break character "\n", but including inter-word spaces), unless there is a word itself exceeding 90 chars, then that word itself will occupy a whole line.
By the way, your function seems broken --- you should replace
lineLen <- 0
with
lineLen <- wordLen[i]
For the sake of completeness, Karsten W.'s comment points at strwrap, which is the easiest function to remember:
strwrap("Lorem ipsum... you know the routine", width=10)
and to match exactly the solution proposed in the question, the string has to be pasted afterwards:
paste(strwrap(s,90), collapse="\n")
This post is deliberately made community wiki since the honor of finding the function isn't mine.
For further completeness, there's:
stringi::stri_wrap
stringr::str_wrap (which just ultimately calls stringi::stri_wrap
The stringi version will deal with character sets better (it's built on the ICU library) and it's in C/C++ so it'll ultimately be faster than base::strwrap. It's also vectorized over the str parameter.
You can look at e.g. the write.dcf() FUNCTION in R itself; it also uses a loop so nothing to be ashamed of here.
The first goal is to get it right --- see Chambers (2008).

Resources