How to read whitespace delimited strings until EOF in R - string

I am new to R and I am currently having trouble with reading a series of strings until I encounter an EOF. Not only I don't know how to detect EOF, but I also don't know how to read a single string separated by whitespace which is trivial to do in any other language I have seen so far. In C, I would simply do:
while (scanf("%s", s) == 1) { /* do something with s */ }
If possible, I would prefer a solution which does not require knowing the maximum length of strings in advance.
Any ideas?
EDIT: I am looking for solution which does not store all the input into memory, but the one equivalent or at least similar to the C code above.

Here's a way to read one item at a time... It uses the fact that scan has an nmax parameter (and n and nlines - it's actually kind of a mess!).
# First create a sample file to read from...
writeLines(c("Hello world", "and now", "Goodbye"), "foo.txt")
# Use a file connection to read from...
f <- file("foo.txt", "r")
i <- 0L
repeat {
s <- scan(f, "", nmax=1, quiet=TRUE)
if (length(s) == 0) break
i <- i + 1L
cat("Read item #", i, ": ", s, "\n", sep="")
}
close(f)
When scan encounters EOF, it returns a zero-length vector. So a more obscure but C-like way would be:
while (length(s <- scan(f, "", nmax=1, quiet=TRUE))) {
i <- i + 1L
cat("Read item #", i, ": ", s, "\n", sep="")
}
In any case, the output would be:
Read item #1: Hello
Read item #2: world
Read item #3: and
Read item #4: now
Read item #5: Goodbye
Finally, if you could vectorize what you do to the strings, you should probably try to read a bunch of them at a time - just change nmax to, say, 10000.

> txt <- "This is an example" # could be from a file but will use textConnection()
> read.table(textConnection(txt))
V1 V2 V3 V4
1 This is an example
read.table is implemented with scan, so you can just look at the code to see how the experts did it.

Related

Use scanf to split a string on a non-whitespace separator

I aim to scan a string containing a colon as a division and save both parts of it in a tuple.
For example:
input: "a:b"
output: ("a", "b")
My approach so far keeps getting the error message:
"scanf: bad input at char number 9: looking for ':', found '\n'".
Scanf.bscanf Scanf.Scanning.stdin "%s:%s" (fun x y -> (x,y));;
Additionally, my approach works with integers, I'm confused why it is not working with strings.
Scanf.bscanf Scanf.Scanning.stdin "%d:%d" (fun x y -> (x,y));;
4:3
- : int * int = (4, 3)
The reason for the issue you're seeing is that the first %s is going to keep consuming input until one of the following conditions hold:
a whitespace has been found,
a scanning indication has been encountered,
the end-of-input has been reached.
Note that seeing a colon isn't going to satisfy any of these (if you don't use a scanning indication). This means that the first %s is going to consume everything up to, in your case, the newline character in the input buffer, and then the : is going to fail.
You don't have this same issue for %d:%d because %d isn't going to consume the colon as part of matching an integer.
You can fix this by instead using a format string which will not consume the colon, e.g., %[^:]:%s. You could also use a scanning indication, like so: %s#:%s.
Additionally, your current method won't consume any trailing whitespace in the buffer, which might result in newlines being added to the first element on subsequent use of this, so you might prefer %s#:%s\n to consume the newline.
So, in all,
Scanf.bscanf Scanf.Scanning.stdin "%s#:%s\n" (fun x y -> (x,y));;
The %s specifier is greedy and it will read the string up to whitespace or a scanning indicator. The indicator could be specified using #<indicator> just after the %s specifier, where <indicator> is a single character, e.g.,
let split str =
Scanf.sscanf str "%s#:%s" (fun x y -> x,y)
This will instruct scanf to read everything up to : into the first string, drop : and then read the rest into the second string.
The string specifier %s is eager by default and will swallow all your content until the next space. You need to add a scanning indication(https://ocaml.org/api/Scanf.html#indication) to explain to Scanf.sscanf that you expect the first string to end on the first : :
For instance,
Scanf.sscanf "a:b"
"%s#:%s"
(fun x y -> x,y)
returns "a", "b". Here the scanning indication is the #: specifier just after the first %s specifier. In general, scanning indication are written #c for a character c.

Write statement for a complex format / possibility to write more than once on the same excel line

I am presently working on a file to open one by one .txt documents, extract data, to finally fill a .excel document.
Because I did not know how it is possible to write multiple times on the same line of my Excel document after one write statement (because it jumps to the next line), I have created a string of characters which is filled time after time :
Data (data_limite(x),x=1,8)/10, 9, 10, 7, 9, 8, 8, 9/
do file_descr = 1,nombre_fichier,1
taille_data1 = data_limite(file_descr)
nvari = taille_data1-7
write (new_data1,"(A30,A3,A11,A3,F5.1,A3,A7,F4.1,<nvari>(A3))") description,char(9),'T-isotherme',char(9),T_trait,char(9),'d_gamma',taille_Gam,(char(9),i=1,nvari)
ecriture_descr = ecriture_descr//new_data1
end do
Main issue was I want to adapt char(9) amount with the data_limite value so I built a write statement with a variable amount of char(9).
At the end of the do-loop, I have a very complex format of ecriture_descr which has no periodic format due to the change of the nvari value
Now I want to add this to the first line of my .excel :
Open(Unit= 20 ,File='resultats.RES',status='replace')
write(20,100) 'param',char(9),char(9),char(9),char(9),char(9),'*',char(9),'nuances',char(9),'*',char(9),ecriture_descr
100 format (a5,5(a3),a,a3,a7,a,a3,???)
but I do not know how to write this format. It would have been easier if, at each iteration of the do-loop I could fill the first line of my excel and continue to fill the first line at each new new_data1 value.
EDIT : maybe adding advance='no' in my write statement would help me, I am presently trying to add it
EDIT 2 : it did not work with advance='no' but adding a '$' at the end of my format write statement disable the return of my function. By moving it to my do-loop, I guess I can solve my problem :). I am presently trying to add it
First of all, your line
ecriture_descr = ecriture_descr//new_data1
Is almost certainly not doing what you expect it to do. I assume that both ecriture_descr and new_data are of type CHARACTER(len=<some value>) -- that is a fixed length string. If you assign anything to such a string, the string is cut to length (if the assigned is too long), or padded with spaces (if the assigned is too short:
program strings
implicit none
character(len=8) :: h
h = "Hello"
print *, "|" // h // "|" ! Prints "|Hello |"
h = "Hello World"
print *, "|" // h // "|" ! Prints "|Hello Wo|"
end program strings
And this combination will work against you: ecriture_descr will already be padded to the max with spaces, so when you append new_data1 it will be just outside the range of ecriture_descr, a bit like this:
h = "Hello" ! h is actually "Hello "
h = h // "World" ! equiv to h = "Hello " // "World"
! = "Hello World"
! ^^^^^^^^^
! Only this is assigned to h => no change
If you want a string aggregator, you need to use the trim function which removes all trailing spaces:
h = trim(h) // " World"
Secondly, if you want to write to a file, but don't want to have a newline, you can add the option advance='no' into the write statement:
do i = 1, 100
write(*, '(I4)', advance='no') i
end do
This should make your job a lot easier than to create one very long string in memory and then write it all out in one go.

Collapsing character vectors with sprintf instead of paste

I have mostly used paste or paste0 for my pasting tasks in the past, but I'm pretty fascinated by the speed of sprintf. Yet I feel that I'm lacking some its basics.
Just wondered if there's also a way to collapse a multi-element character vector to one of length 1 as paste would do when using its collapse argument, that is, without having to specify respective wildcards and its values manually (in paste, I simply leave the task up to the function to find out how many elements should be collapsed).
x <- c("Pasted string:", "hello", "world!")
> sprintf("%s %s %s", x[1], x[2], x[3])
[1] "Pasted string: hello world!"
> paste(x, collapse=" ")
[1] "Pasted string: hello world!"
I'm looking for something like this (pseudo code)
> sprintf("<the-correct-parameter>", x)
[1] "Pasted string: hello world"
For the interested: benchmark of sprintf vs. paste
require("microbenchmark")
t1 <- median(microbenchmark(sprintf("%s %s %s", x[1], x[2], x[3]))$time)
t2 <- median(microbenchmark(paste(x, collapse=" "))$time)
> t1/t2
[1] 0.7273114
The function sprintf recycles its format string, so for example the code
cat(sprintf("%8.4f",rnorm(5)),"\n")
prints something like
-0.5685 -0.6481 0.6296 -0.0043 -1.4763
str = sprintf("%8.4f",rnorm(5))
stores the output in a vector of strings and
str_one = paste(sprintf("%8.4f",rnorm(5)),collapse='')
stores the output in a single string. The format string does not need to specify the number of floats to be printed. This also holds for printing integers and strings with the %d and %s formats.

Wrapping strings, but not substrings in quotes, using R

This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.
Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.

Insert line breaks in long string -- word wrap

Here is a function I wrote to break a long string into lines not longer than a given length
strBreakInLines <- function(s, breakAt=90, prepend="") {
words <- unlist(strsplit(s, " "))
if (length(words)<2) return(s)
wordLen <- unlist(Map(nchar, words))
lineLen <- wordLen[1]
res <- words[1]
lineBreak <- paste("\n", prepend, sep="")
for (i in 2:length(words)) {
lineLen <- lineLen+wordLen[i]
if (lineLen < breakAt)
res <- paste(res, words[i], sep=" ")
else {
res <- paste(res, words[i], sep=lineBreak)
lineLen <- 0
}
}
return(res)
}
It works for the problem I had; but I wonder if I can learn something here. Is there a shorter or more efficient solution, especially can I get rid of the for loop?
How about this:
gsub('(.{1,90})(\\s|$)', '\\1\n', s)
It will break string "s" into lines with maximum 90 chars (excluding the line break character "\n", but including inter-word spaces), unless there is a word itself exceeding 90 chars, then that word itself will occupy a whole line.
By the way, your function seems broken --- you should replace
lineLen <- 0
with
lineLen <- wordLen[i]
For the sake of completeness, Karsten W.'s comment points at strwrap, which is the easiest function to remember:
strwrap("Lorem ipsum... you know the routine", width=10)
and to match exactly the solution proposed in the question, the string has to be pasted afterwards:
paste(strwrap(s,90), collapse="\n")
This post is deliberately made community wiki since the honor of finding the function isn't mine.
For further completeness, there's:
stringi::stri_wrap
stringr::str_wrap (which just ultimately calls stringi::stri_wrap
The stringi version will deal with character sets better (it's built on the ICU library) and it's in C/C++ so it'll ultimately be faster than base::strwrap. It's also vectorized over the str parameter.
You can look at e.g. the write.dcf() FUNCTION in R itself; it also uses a loop so nothing to be ashamed of here.
The first goal is to get it right --- see Chambers (2008).

Resources