I have the following string:
x = 'aaabbbbbaaaaaacccccbbbbbbbbbbbbbbb'. I want to get an output like this: abaacbbb, in which "a" will be compressed with a length of 3 and "b" will be compressed with a length of 5. I used the following function, but it removes all the adjacent duplicates and the output is: abacb :
def remove_dup(x):
if len(x) < 2:
return x
if x[0] != x[1]:
return x[0] + remove_dup(x[1:])
return remove_dup(x[1:])
x = 'aaabbbbbaaaaaacccccbbbbbbbbbbbbbbb'
print(remove_dup(x))
It would be wonderful if somebody could help me with this.
Thank you!
Unless this is a homework question with special constraints, this would be more conveniently and arguably more readably implemented with a regex substitution that replaces desired quantities of specific characters with a single instance of the captured character:
import re
def remove_dup(x):
return re.sub('(a){3}|([bc]){5}', r'\1\2', x)
x = 'aaabbbbbaaaaaacccccbbbbbbbbbbbbbbb'
print(remove_dup(x))
This outputs:
abaacbbb
tl;dr
I'm just looking for two functions, f from double to string and g from string to double, such that g(f(d)) == d for any double d (scalar and real double).
Original question
How do I convert a double to a string or char array in a reversible way? I mean, in such a way that afterward I can convert that string/char array back to double retrieving the original result.
I've found formattedDisplayText, and in some situations it works:
>> x = eps
x =
2.220446049250313e-16
>> double(formattedDisplayText(x, 'NumericFormat', 'long')) - x
ans =
0
But in others it doesn't
x = rand(1)
x =
0.546881519204984
>> double(formattedDisplayText(x, 'NumericFormat', 'long')) - x
ans =
1.110223024625157e-16
As regards this and other tools like num2str, mat2str, at the end they all require me to decide a precision, whereas I would like to express the idea of "use whatever precision is needed for you (MATLAB) to be able to read back your own number".
Here are two simpler solutions to convert a single double value to a string and back without loss.
I want the string to be a human-readable representation of the number
Use num2str to obtain 17 decimal digits in string form, and str2double to convert back:
>> s = mat2str(x,17)
s =
'2.2204460492503131e-16'
>> y = str2double(s);
>> y==x
ans =
logical
1
Note that 17 digits are always enough to represent any IEEE double-precision floating-point number.
I want a more compact string representation of the number
Use matlab.net.base64encode to encode the 8 bytes of the number. Unfortunately you can only encode strings and integer arrays, so we type cast to some integer array (we use uint8 here, but uint64 would work too). We reverse the process to get the same double value back:
>> s = matlab.net.base64encode(typecast(x,'uint8'))
s =
'AAAAAAAAsDw='
>> y = typecast(matlab.net.base64decode(s),'double');
>> x==y
ans =
logical
1
Base64 encodes every 3 bytes in 4 characters, this is the most compact representation you can easily create. A more complex algorithm could likely convert into a smaller UTF-8-encoded string (which uses more than 6 bytes per displayable character).
Function f: from double real-valued scalar x to char vector str
str = num2str(typecast(x, 'uint8'));
str is built as a string containing 8 numbers, which correspond to the bytes in the internal representation of x. The function typecast extracts the bytes as a numerical vector, and num2str converts to a char vector with numbers separated by spaces.
Function g: from char vector str to double real-valued scalar y
y = typecast(uint8(str2double(strsplit(str))), 'double');
The char vector is split at spaces using strsplit. The result is a cell array of char vectors, each of which is then interpreted as a number by str2double, which produces a numerical vector. The numbers are cast to uint8 and then typecast interprets them as the internal representation of a double real-valued scalar.
Note that str2double(strsplit(str)) is preferred over the simpler str2num(str), because str2num internally calls eval, which is considered evil bad practice.
Example
>> format long
>> x = sqrt(pi)
x =
1.772453850905516
>> str = num2str(typecast(x, 'uint8'))
str =
'106 239 180 145 248 91 252 63'
>> y = typecast(uint8(str2double(strsplit(str))), 'double')
y =
1.772453850905516
>> x==y
ans =
logical
1
So lets say I have a string that says "m * x + b", I want to find any letter chars, other than x, and surround them with text.
In this example, output should be "var['m'] * x + var['b']"
A tiny regular expression solves your problem:
import re
s = "m * x + b"
print re.sub("([a-wyzA-Z])", r"var['\1']", s)
Output:
var['m'] * x + var['b']
Explanation:
[a-wyzA-Z] matches all characters within the brackets: a-w, y, z and A-Z (so basically every letter but x)
(...) makes the found match accessible later via \1
r"var['\1']" is the replacement referring to the match\1`
I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least.
For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f".
In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.
If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.
Thanks!
Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:
vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")
The function that does the matching:
find_rep_path <- function(vec, reps) {
regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
And some tests:
sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "aaaab" "0f0f"
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
#
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"
Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.
Also, as per the discussion in #G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.
find_rep_path <- function(vec, reps, limit) {
regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "a" "0f"
find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:
> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
>
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
> joycec <- paste(joyce, collapse = " ")
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
1.36 0.00 1.39
> result
[1] " Hoopsa boyaboy hoopsa!"
ADDED
Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).
find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
r <- regexpr(pat, string, perl = TRUE)
ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}
> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
0 0 0
> result
[1] "w"
REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.
Not optimized (even it is fast) function , but I think it is more R way to do this.
Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.
Here my code. I am creating 2 functions ( steps 1-2) and step 3:
library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000"
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <-
function(length=1,ss){
patt = mapply(function(x,y) substr(ss,x,y),
1:(nchar(ss)-length),
(length+1):nchar(ss))
res = str_locate_all(ss,unique(patt))
ll = unlist(lapply(res,length))
list(patt = patt[which.max(ll)],
rep = max(ll))
}
get_pattern_threshold <-
function(ss,threshold =3 ){
res <-
sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
res[,which.max(res['rep',])]
}
some tests:
get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"
$rep
[1] 6
> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"
$rep
[1] 18
Since you want at least three repetitions, there is a nice O(n^2) approach.
For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:
a0cc0
vaaaa
baaaa
baaaa
baa00
bvw
Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).
Repeat it for each d starting from some upper bound (at most n/3).
You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).
Maybe not optimal but I found this cutting idea quite neat ;)
For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.
pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)
# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
{
str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
}
# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]
aaaabaa
3
NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.
Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.
I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.
str <- "a0cc0vaaaabaaaabaaaabaa00bvw"
findPatternMax <- function(str){
nb <- nchar(str):1
length.patt <- rev(nb)
patterns <- c()
for (i in 1:length(nb)){
for (j in 1:nb[i]){
patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
}
}
patt.max <- names(which(table(patterns) == max(table(patterns))))
return(patt.max)
}
findPatternMax(str)
> findPatternMax(str)
[1] "a"
EDIT :
Maybe you want the returned pattern have a min length ?
then you can add a nchar.patt parameter for example :
nchar.patt <- 2 #For a pattern of 2 char min
nb <- nb[length.patt >= nchar.patt]
length.patt <- length.patt[length.patt >= nchar.patt]
I really need help in writing this function in Haskell, I don't even know where to start. Here are the specs:
Define a function flagpattern that takes a positive Int value greater than or equal to five and returns a String that can be displayed as the following `flag' pattern of dimension n, e.g.
Main> putStr (flagpattern 7)
#######
## ##
# # # #
# # #
# # # #
## ##
#######
Assuming you want a "X" enclosed in 4 lines, you need to write a function that given a coordinate (x,y) returns what character should be at that position:
coordinate n x y = if i == 0 then 'X' else ' '
(This version outputs only the leftmost X'es, modify it, remember indices start with 0)
Now you want them nicely arranged in a matrix, use a list comprehension, described in the linked text.
You should start from your problem definition:
main :: IO ()
main = putStr . flagPattern $ 7
Then, you should ask yourself about how much dots flag has:
flagPattern :: Int -> String
flagPattern = magic $ [1..numberOfDots]
Then, (hard) part of magic function should decide for each dot whether it is or #:
partOfMagic ...
| ... = "#" -- or maybe even "#\n" in some cases?
| otherwise = " "
Then, you can concatenate parts into one string and get the answer.
Start with the type signature.
flagpattern :: Int -> String
Now break the problem into subproblems. For example, suppose I told you to produce row 2 of a size 7 flag pattern. You would write:
XX XX
Or row 3 of a size 7 flag pattern would be
X X X X
So suppose we had a function that could produce a given row. Then we'd have
flagpattern :: Int -> String
flagpattern size = unlines (??? flagrow ???)
flagrow :: Int -> Int -> String
flagrow row size = ???
unlines takes a list of Strings and turns it into a single String with newlines between each element of the list. See if you can define flagrow, and get it working correctly for any given row and size. Then see if you can use flagrow to define flagpattern.