How do I get to haskell to output numbers NOT in scientific notation? - haskell

I have a some items that I want to partition in to a number of buckets, such that each bucket is some fraction larger than the last.
items = 500
chunks = 5
increment = 0.20
{- find the proportions -}
sizes = take chunks (iterate (+increment) 1)
base = sum sizes / items
buckets = map (base *) sizes
main = print buckets
I'm sure there is a mathematically more elegant way to do this, but that's not my question.
The end step is always printing out in scientific notation.
How do I get plain decimal output? I've looked at the Numeric package but I'm getting nowhere fast.

> putStrLn $ Numeric.showFFloat Nothing 1e40 ""
10000000000000000000000000000000000000000.0

Try printf. e.g.:
> import Text.Printf
> printf "%d\n" (23::Int)
23
> printf "%s %s\n" "Hello" "World"
Hello World
> printf "%.2f\n" pi
3.14

Related

Printf text and return value of a method call

Disclaimer: I am a total newb to haskell, but I can't find the answer. Maybe I am searching in the wrong way or it is so basic that nobody even asks that.
Here is what I try to do:
import Text.Printf
factorial n = if n < 2 then 1 else n * factorial (n-1)
main = do
let input = 22
printf "Some text... %d! = %d" input (factorial input)
But that doesn't work, a bunch of errors appear. Can you give me a quick hint, what I am doing wrong?
the only input is of ambiguous type in your code.
import Text.Printf
factorial n = if n < 2 then 1 else n * factorial (n-1)
main = do
let input = 22::Integer
printf "Some text... %d! = %d" input (factorial input)
return ()
The problem is that the compiler cannot infer the type of input. To do, you would need to provide it explicitly:
import Text.Printf
factorial n = if n < 2 then 1 else n * factorial (n-1)
main = do
let input = 22 :: Integer
printf "Some text... %d! = %d" input (factorial input)
Note that Integer willl work for very large results, whereas Int won't, quoting Haskell Wikibook:
"Integer" is an arbitrary precision type: it will hold any number no
matter how big, up to the limit of your machine's memory…. This means
you never have arithmetic overflows. On the other hand it also means
your arithmetic is relatively slow. Lisp users may recognise the
"bignum" type here.
"Int" is the more common 32 or 64 bit integer. Implementations vary,
although it is guaranteed to be at least 30 bits.

Formating %e output of sprintf

Is it possible to format the output of sprintf, like following or should I use another function.
Say I have an variable dt= 9.765625e-05 and I want use sprintf to make a string for use when saving say a figure
fig = figure(nfig);
plot(x,y);
figStr = sprintf('NS2d_dt%e',dt);
saveas(fig,figStr,'pdf')
The punctuation mark dot presents me with problems, some systems mistake the format of the file.
using
figStr = sprintf('NS2d_dt%.2e',dt);
then
figStr = NS2d_dt9.77e-05
using
figStr = sprintf('NS2d_dt%.e',dt);
then
figStr = NS2d_dt1e-04
which is not precise enough. I would like something like this
using
figStr = sprintf('NS2d_dt%{??}e',dt);
then
figStr = NS2d_dt9765e-08
Essentially the only way to get your desired output is with some manipulation of the value or strings. So here's two solutions for you first with some string manipulation and second by manipulating the value. Hopefully, these 2 approaches will help reason out solutions for other problems, particularly the number manipulation.
String Manipulation
Solution
fmt = #(x) sprintf('%d%.0fe%03d', (sscanf(sprintf('%.4e', x), '%d.%de%d').' .* [1 0.1 1]) - [0 0.5 3]);
Explanation
First I use sprintf to print the number in a defined format
>> sprintf('%.4e', dt)
ans =
9.7656e-05
then sscanf to read it back in making sure to remove the . and e
>> sscanf(sprintf('%.4e', dt), '%d.%de%d').'
ans =
9 7656 -5
before printing it back we perform some manipulation of the data to get the correct values for printing
>> (sscanf(sprintf('%.4e', dt), '%d.%de%d').' .* [1 0.1 1]) - [0 0.5 3]
ans =
9 765.1 -8
and now we print
>> sprintf('%d%.0fe%03d', (sscanf(sprintf('%.4e', dt), '%d.%de%d').' .* [1 0.1 1]) - [0 0.5 3])
ans =
9765e-08
Number Manipulation
Solution
orderof = #(x) floor(log10(abs(x)));
fmt = #(x) sprintf('%.0fe%03d', x*(10^(abs(orderof(x))+3))-0.5, orderof(x)-3);
Explanation
First I create an anonymous orderof function which tells me the order (the number after e) of the input value. So
>> dt = 9.765625e-05;
>> orderof(dt)
ans =
-5
Next we manipulate the number to convert it to a 4 digit integer, this is the effect of adding 3 in
>> floor(dt*(10^(abs(orderof(dt))+3)))
ans =
9756
finally before printing the value we need to figure out the new exponent with
>> orderof(x)-3
ans =
-8
and printing will give us
>> sprintf('%.0fe%03d', floor(dt*(10^(abs(orderof(dt))+3))), orderof(dt)-3)
ans =
9765e-08
Reading your question,
The punctuation mark dot presents me with problems, some systems mistake the format of the file.
it seems to me that your actual problem is that when you build the file name using, for example
figStr = sprintf('NS2d_dt%.2e',dt);
you get
figStr = NS2d_dt9.77e-05
and, then, when you use that string as filename, the . is intepreted as the extension and the .pdf is not attached, so in Explorer you can not open the file double-clicking on it.
Considering that changing the representation of the number dt from 9.765e-05 to 9765e-08 seems quite wierd, you can try the following approach:
use the print function to save your figure in .pdf
add .pdf in the format specifier
This should allows you the either have the right file extension and the right format for the dt value.
peaks
figStr = sprintf('NS2d_dt_%.2e.pdf',dt);
print(gcf,'-dpdf', figStr )
Hope this helps.
figStr = sprintf('NS2d_dt%1.4e',dt)
figStr =
NS2d_dt9.7656e-05
specify the number (1.4 here) as NumbersBeforeDecimal (dot) NumbersAfterDecimal.
Regarding your request:
A = num2str(dt); %// convert to string
B = A([1 3 4 5]); %// extract first four digits
C = A(end-2:end); %// extract power
fspec = 'NS2d_dt%de%d'; %// format spec
sprintf(fspec ,str2num(B),str2num(C)-3)
NS2d_dt9765e-8

Fast partial string matching in R

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text.
For small datasets, this can be easily done in R with grepl:
patterns = c("some","pattern","a","horse")
texts = c("this is a text with some pattern", "this is another text with a pattern")
# for each x in patterns
lapply( patterns, function(x){
# match all texts against pattern x
res = grepl( x, texts, fixed=TRUE )
print(res)
# do something with the matches
# ...
})
This solution is correct, but it doesn't scale up. Even with moderately bigger datasets (~500 texts and patterns), this code is embarassingly slow, solving only about 100 cases per sec on a modern machine - which is ridiculous considering that this is a crude string partial matching, without regex (set with fixed=TRUE). Even making the lapply parallel does not solve the issue.
Is there a way to re-write this code efficiently?
Thanks,
Mulone
Use stringi package - it's even faster than grepl. Check the benchmarks!
I used text from #Martin-Morgan post
require(stringi)
require(microbenchmark)
text = readLines("~/Desktop/pg100.txt")
pattern <- strsplit("all the world's a stage and all the people players", " ")[[1]]
grepl_fun <- function(){
lapply(pattern, grepl, text, fixed=TRUE)
}
stri_fixed_fun <- function(){
lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
}
# microbenchmark(grepl_fun(), stri_fixed_fun())
# Unit: milliseconds
# expr min lq median uq max neval
# grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509 100
# stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913 100
# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()
for(i in seq_along(xx)){
print(all(xx[[i]] == stri[[i]]))
}
Have you accurately characterized your problem and the performance you're seeing? Here are the Complete Works of William Shakespeare and a query against them
text = readLines("~/Downloads/pg100.txt")
pattern <-
strsplit("all the world's a stage and all the people players", " ")[[1]]
which seems to be much more performant than you imply?
> length(text)
[1] 124787
> system.time(xx <- lapply(pattern, grepl, text, fixed=TRUE))
user system elapsed
0.444 0.001 0.444
## avoid retaining memory; 500 x 500 case; no blank lines
> text = text[nzchar(text)]
> system.time({ for (p in rep(pattern, 50)) grepl(p, text[1:500], fixed=TRUE) })
user system elapsed
0.096 0.000 0.095
We're expecting linear scaling with both the length (number of elements) of pattern and text. It seems I mis-remember my Shakespeare
> idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
> range(idx)
[1] 0 7
> sum(idx == 7)
[1] 8
> text[idx == 7]
[1] " And all the men and women merely players;"
[2] " cicatrices to show the people when he shall stand for his place."
[3] " Scandal'd the suppliants for the people, call'd them"
[4] " all power from the people, and to pluck from them their tribunes"
[5] " the fashion, and so berattle the common stages (so they call"
[6] " Which God shall guard; and put the world's whole strength"
[7] " Of all his people and freeze up their zeal,"
[8] " the world's end after my name-call them all Pandars; let all"

algorithm/code in R to find pattern from any position in a string

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least.
For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f".
In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.
If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.
Thanks!
Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:
vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")
The function that does the matching:
find_rep_path <- function(vec, reps) {
regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
And some tests:
sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "aaaab" "0f0f"
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
#
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"
Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.
Also, as per the discussion in #G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.
find_rep_path <- function(vec, reps, limit) {
regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "a" "0f"
find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:
> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
>
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
> joycec <- paste(joyce, collapse = " ")
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
1.36 0.00 1.39
> result
[1] " Hoopsa boyaboy hoopsa!"
ADDED
Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).
find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
r <- regexpr(pat, string, perl = TRUE)
ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}
> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
0 0 0
> result
[1] "w"
REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.
Not optimized (even it is fast) function , but I think it is more R way to do this.
Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.
Here my code. I am creating 2 functions ( steps 1-2) and step 3:
library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000"
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <-
function(length=1,ss){
patt = mapply(function(x,y) substr(ss,x,y),
1:(nchar(ss)-length),
(length+1):nchar(ss))
res = str_locate_all(ss,unique(patt))
ll = unlist(lapply(res,length))
list(patt = patt[which.max(ll)],
rep = max(ll))
}
get_pattern_threshold <-
function(ss,threshold =3 ){
res <-
sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
res[,which.max(res['rep',])]
}
some tests:
get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"
$rep
[1] 6
> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"
$rep
[1] 18
Since you want at least three repetitions, there is a nice O(n^2) approach.
For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:
a0cc0
vaaaa
baaaa
baaaa
baa00
bvw
Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).
Repeat it for each d starting from some upper bound (at most n/3).
You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).
Maybe not optimal but I found this cutting idea quite neat ;)
For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.
pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)
# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
{
str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
}
# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]
aaaabaa
3
NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.
Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.
I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.
str <- "a0cc0vaaaabaaaabaaaabaa00bvw"
findPatternMax <- function(str){
nb <- nchar(str):1
length.patt <- rev(nb)
patterns <- c()
for (i in 1:length(nb)){
for (j in 1:nb[i]){
patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
}
}
patt.max <- names(which(table(patterns) == max(table(patterns))))
return(patt.max)
}
findPatternMax(str)
> findPatternMax(str)
[1] "a"
EDIT :
Maybe you want the returned pattern have a min length ?
then you can add a nchar.patt parameter for example :
nchar.patt <- 2 #For a pattern of 2 char min
nb <- nb[length.patt >= nchar.patt]
length.patt <- length.patt[length.patt >= nchar.patt]

Matlab: Convert cell string (comma separated) to vector

I have a huge csv file (as in: more than a few gigs) and would like to read it in Matlab and process each file. Reading the file in its entirety is impossible so I use this code to read in each line:
fileName = 'input.txt';
inputfile = fopen(fileName);
while 1
tline = fgetl(inputfile);
if ~ischar(tline)
break
end
end
fclose(inputfile);
This yiels a cell array of size(1,1) with the line as string. What I would like is to convert this cell to a normal array with just the numbers.
For example:
input.csv:
0.0,0.0,3.201,0.192
2.0,3.56,0.0,1.192
0.223,0.13,3.201,4.018
End result in Matlab for the first line:
A = [0.0,0.0,3.201,0.192]
I tried converting tline with double(tline) but this yields completely different results. Also tried using a regex but got stuck there. I got to the point where I split up all values into a different cell in one array. But converting to double with str2double yields only NaNs...
Any tips? Preferably without any loops since it already takes a while to read the entire file.
You are looking for str2num
>> A = '0.0,0.0,3.201,0.192';
>> str2num(A)
ans =
0 0 3.2010 0.1920
>> A = '0.0 0.0 3.201 0.192';
>> str2num(A)
ans =
0 0 3.2010 0.1920
>> A = '0.0 0.0 , 3.201 , 0.192';
>> str2num(A)
ans =
0 0 3.2010 0.1920
e.g., it's quite agnostic to input format.
However, I will not advise this for your use case. For your problem, I'd do
C = dlmread('input.txt',',', [1 1 1 inf]) % for first line
C = dlmread('input.txt',',') % for entire file
or
[a,b,c,d] = textread('input.txt','%f,%f,%f,%f',1) % for first line
[a,b,c,d] = textread('input.txt','%f,%f,%f,%f') % for entire file
if you want all columns in separate variables:
a = 0
b = 0
c = 3.201
d = 0.192
or
fid = fopen('input.txt','r');
C = textscan(fid, '%f %f %f %f', 1); % for first line only
C = textscan(fid, '%f %f %f %f', N); % for first N lines
C = textscan(fid, '%f %f %f %f', 1, 'headerlines', N-1); % for Nth line only
fclose(fid);
all of which are much more easily expandable (things like this, whatever they are, tend to grow bigger over time :). Especially dlmread is much less prone to errors than writing your own clauses is, for empty lines, missing values and other great nuisances very common in most data sets.
Try
data = dlmread('input.txt',',')
It will do exactly what you want to do.
If you still want to convert string to a vector:
line_data = sscanf(line,'%g,',inf)
This code will read the entire coma-separated string and convert each number.

Resources