Here is a function I wrote to break a long string into lines not longer than a given length
strBreakInLines <- function(s, breakAt=90, prepend="") {
words <- unlist(strsplit(s, " "))
if (length(words)<2) return(s)
wordLen <- unlist(Map(nchar, words))
lineLen <- wordLen[1]
res <- words[1]
lineBreak <- paste("\n", prepend, sep="")
for (i in 2:length(words)) {
lineLen <- lineLen+wordLen[i]
if (lineLen < breakAt)
res <- paste(res, words[i], sep=" ")
else {
res <- paste(res, words[i], sep=lineBreak)
lineLen <- 0
}
}
return(res)
}
It works for the problem I had; but I wonder if I can learn something here. Is there a shorter or more efficient solution, especially can I get rid of the for loop?
How about this:
gsub('(.{1,90})(\\s|$)', '\\1\n', s)
It will break string "s" into lines with maximum 90 chars (excluding the line break character "\n", but including inter-word spaces), unless there is a word itself exceeding 90 chars, then that word itself will occupy a whole line.
By the way, your function seems broken --- you should replace
lineLen <- 0
with
lineLen <- wordLen[i]
For the sake of completeness, Karsten W.'s comment points at strwrap, which is the easiest function to remember:
strwrap("Lorem ipsum... you know the routine", width=10)
and to match exactly the solution proposed in the question, the string has to be pasted afterwards:
paste(strwrap(s,90), collapse="\n")
This post is deliberately made community wiki since the honor of finding the function isn't mine.
For further completeness, there's:
stringi::stri_wrap
stringr::str_wrap (which just ultimately calls stringi::stri_wrap
The stringi version will deal with character sets better (it's built on the ICU library) and it's in C/C++ so it'll ultimately be faster than base::strwrap. It's also vectorized over the str parameter.
You can look at e.g. the write.dcf() FUNCTION in R itself; it also uses a loop so nothing to be ashamed of here.
The first goal is to get it right --- see Chambers (2008).
Related
I've got a string that I need to only upcase the first letter. I also need to preserve the case of any subsequent letters. At first I thought:
String.capitalize("hyperText")
would do the trick. But in addition to fixing the first letter, it downcases the rest of the letters. What I need to end up with is "HyperText". My initial pass at this is:
<<letter :: utf8, rest :: binary>> = word
upcased_first_letter = List.to_string([letter])
|> String.upcase()
upcased_first_letter <> rest
This works perfectly but it really seems like a lot of verbosity and a lot of work as well. I keep feeling like there's a better way. I'm just not seeing it.
You can use with/1 to keep it to a single expression, and you can avoid List.to_string by using the <<>> operator again on the resulting codepoint:
with <<first::utf8, rest::binary>> <- "hyperText", do: String.upcase(<<first::utf8>>) <> rest
Or put it in a function:
def upcaseFirst(<<first::utf8, rest::binary>>), do: String.upcase(<<first::utf8>>) <> rest
One method:
iex(10)> Macro.camelize("hyperText")
"HyperText"
This might be more UTF-8 compatible? Not sure how many letters are multiple codepoints, but this seems a little safer than assuming how many bytes a letter is going to be.
iex(6)> with [first | rest] <- String.codepoints("βool") do
...(6)> [String.capitalize(first) | rest] |> Enum.join()
...(6)> end
"Βool"
iex(7)> with [first | rest] <- String.codepoints("😂ool") do
...(7)> [String.capitalize(first) | rest] |> Enum.join()
...(7)> end
"😂ool"
iex(8)>
If you're just upcasing the English alphabet, you could do an easy guard clause on your match. An anonymous function example, though named or a with or something would work too:
iex> cap_first = fn
...> <<first, rest::binary>> when first in ?a..?z -> <<first - 32, rest::binary>>
...> string -> string
...> end
iex> cap_first.("hyperText")
"HyperText"
Is there a way in ML to take in a string and output a list of those string where a separation is a space, newline or eof, but also keeping strings inside strings intact?
EX) hello world "my id" is 5555
-> [hello, world, my id, is, 5555]
I am working on a tokenizing these then into:
->[word, word, string, word, int]
Sure you can! Here's the idea:
If we take a string like "Hello World, \"my id\" is 5555", we can split it at the quote marks, ignoring the spaces for now. This gives us ["Hello World, ", "my id", " is 5555"]. The important thing to notice here is that the list contains three elements - an odd number. As long as the string only contains pairs of quotes (as it will if it's properly formatted), we'll always get an odd number of elements when we split at the quote marks.
A second important thing is that all the even-numbered elements of the list will be strings that were unquoted (if we start counting from 0), and the odd-numbered ones were quoted. That means that all we need to do is tokenize the ones that were unquoted, and then we're done!
I put some code together - you can continue from there:
fun foo s =
let
val quoteSep = String.tokens (fn c => c = #"\"") s
val spaceSep = String.tokens (fn c => c = #" ") (* change this to include newlines and stuff *)
fun sepEven [] = []
| sepEven [x] = (* there were no quotes in the string *)
| sepEven (x::y::xs) = (* x was unquoted, y was quoted *)
in
if length quoteSep mod 2 = 0
then (* there was an uneven number of quote marks - something is wrong! *)
else (* call sepEven *)
end
String.tokens brings you halfway there. But if you really want to handle quotes like you are sketching then there is no way around writing an actual lexer. MLlex, which comes with SML/NJ and MLton (but is usable with any SML) could help. Or you just write it by hand, which should be easy enough in this case as well.
I would like to split a string vector, process its tokens, and then recompose it in the original form.
Please consider the following
vector.in <- c("red rum", "mur der", "red rum", "mur der")
length(vector.in)
# [1] 4
vector.splt <- strsplit(vector.in, "\\s")
vector.splt <- unlist(vector.splt)
vector.out <- paste(vector.splt, sep="", collapse=" ")
and of course
length(vector.out)
# [1] 1
How should I process it so to output a vector with the same form and length as the original vector.in, that is without loosing any information?
The unlist is the problem. That removes the structure too early. Then you need to loop around the elements and pass to the paste function. I will use lapply for the loop:
vector.in <- c("red rum", "mur der", "red rum", "mur der")
vector.splt <- strsplit(vector.in, "\\s")
unlist(lapply(vector.splt, paste, collapse=' '))
## [1] "red rum" "mur der" "red rum" "mur der"
The gsubfn function in the gsubfn package does that. For example, here we split the input into words, apply a function (represented in formula notation) to each word where in this case the function parenthesizes each word and then we put it all back together:
> library(gsubfn)
> gsubfn("\\w+", ~ sprintf("(%s)", x), vector.in)
[1] "(red) (rum)" "(mur) (der)" "(red) (rum)" "(mur) (der)"
This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.
Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.
I am new to R and I am currently having trouble with reading a series of strings until I encounter an EOF. Not only I don't know how to detect EOF, but I also don't know how to read a single string separated by whitespace which is trivial to do in any other language I have seen so far. In C, I would simply do:
while (scanf("%s", s) == 1) { /* do something with s */ }
If possible, I would prefer a solution which does not require knowing the maximum length of strings in advance.
Any ideas?
EDIT: I am looking for solution which does not store all the input into memory, but the one equivalent or at least similar to the C code above.
Here's a way to read one item at a time... It uses the fact that scan has an nmax parameter (and n and nlines - it's actually kind of a mess!).
# First create a sample file to read from...
writeLines(c("Hello world", "and now", "Goodbye"), "foo.txt")
# Use a file connection to read from...
f <- file("foo.txt", "r")
i <- 0L
repeat {
s <- scan(f, "", nmax=1, quiet=TRUE)
if (length(s) == 0) break
i <- i + 1L
cat("Read item #", i, ": ", s, "\n", sep="")
}
close(f)
When scan encounters EOF, it returns a zero-length vector. So a more obscure but C-like way would be:
while (length(s <- scan(f, "", nmax=1, quiet=TRUE))) {
i <- i + 1L
cat("Read item #", i, ": ", s, "\n", sep="")
}
In any case, the output would be:
Read item #1: Hello
Read item #2: world
Read item #3: and
Read item #4: now
Read item #5: Goodbye
Finally, if you could vectorize what you do to the strings, you should probably try to read a bunch of them at a time - just change nmax to, say, 10000.
> txt <- "This is an example" # could be from a file but will use textConnection()
> read.table(textConnection(txt))
V1 V2 V3 V4
1 This is an example
read.table is implemented with scan, so you can just look at the code to see how the experts did it.