How do I get the index of empty lines in a string in Haskell using pattern matching without importing anything? - haskell

For example:
nLn "This\n\nis an\nexample\n\n" == [2,5,6]
nLn "" == [1]
I have tried this but it doesn't seem to be working, why is that?
nLn :: Integral a => String -> [a]
nLn "" = [1]
nLn (x:xs) = [n | n <- lineBr (x:xs)]
where
lineBr (x:y:xs)
|x == y && y =='\n' = 1
|otherwise = 1+ lineBr (y:xs)
lineBr [x] = 0

I have to admit that I don't completely get where you are going with your attempted solution. The reason that it doesn't work, i.e., that is gets rejected by the type checker, is that while your helper function lineBr returns a single number, it is used by nLn as if it returns a list, i.e., as the generating expression in a list comprehension.
Here's an implementation (with a simpler type and an arguably more descriptive name) that does the trick:
indicesOfEmptyLines :: String -> [Int]
indicesOfEmptyLines = go 1 True
where
go _ False "" = []
go i True "" = [i]
go i False ('\n' : cs) = go (i + 1) True cs
go i True ('\n' : cs) = i : go (i + 1) True cs
go i _ (c : cs) = go i False cs
The idea is to have the heavy lifting done by a helper function go that will iterate over the characters in the input string. The helper function takes two additional arguments: the index of the line that it is currently processing (I followed your example and used 1-based indexing) and a Boolean indicating whether the currently processed line can still be empty.
We consider five cases:
If we reach the end of the string and the current line is known not to be empty, we know we are done and we have no more indices to report. Hence, we produce the empty list.
If we reach the end of the string and the current line can still be empty (i.e., we have not encountered any characters on it yet), we know we are done and that the last line was in fact an empty line. Hence, we produce a singleton list containing the index of the last line.
If we encounter a newline character and the current line is known not to be empty, we advance the index, acknowledge that the next line can still be empty (as we have not encountered any characters on it yet), and we proceed by processing the remainder of the string.
If we encounter a newline character and the current line can still be empty, then we know that the current line is in fact empty and we prepend the current index to any indices that the recursive call to go will produce. We advance the index, acknowledge that the next line can still be empty, and we proceed by processing the remainder of the string.
If we encounter any other character than a newline, we know that the current line cannot be empty (hence, we pass False to the recursive call to go) and we proceed by processing the remainder of the string.
Seeing it in action:
> indicesOfEmptyLines "This\n\nis an\nexample\n\n"
[2,5,6]
> indicesOfEmptyLines ""
[1]

Related

How convert first char to lowerCase

Try to play with string and I have string like: "Hello.Word" or "stackOver.Flow"
and i what first char convert to lower case: "hello.word" and "stackOver.flow"
For snakeCase it easy we need only change UpperCase to lower and add '_'
but in camelCase (with firs char in lower case) i dont know how to do this
open System
let convertToSnakeCase (value:string) =
String [|
Char.ToLower value.[0]
for ch in value.[1..] do
if Char.IsUpper ch then '_'
Char.ToLower ch |]
Who can help?
module Identifier =
open System
let changeCase (str : string) =
if String.IsNullOrEmpty(str) then str
else
let isUpper = Char.IsUpper
let n = str.Length
let builder = new System.Text.StringBuilder()
let append (s:string) = builder.Append(s) |> ignore
let rec loop i j =
let k =
if i = n (isUpper str.[i] && (not (isUpper str.[i - 1])
((i + 1) <> n && not (isUpper str.[i + 1]))))
then
if j = 0 then
append (str.Substring(j, i - j).ToLower())
elif (i - j) > 2 then
append (str.Substring(j, 1))
append (str.Substring(j + 1, i - j - 1).ToLower())
else
append (str.Substring(j, i - j))
i
else
j
if i = n then builder.ToString()
else loop (i + 1) k
loop 1 0
type System.String with
member x.ToCamelCase() = changeCase x
printfn "%s" ("StackOver.Flow".ToCamelCase()) //stackOver.Flow
//need stackOver.flow
I suspect there are much more elegant and concise solutions, I sense you are learning functional programming, so I think its best to do stuff like this with recursive function rather than use some magic library function. I notice in your question you ARE using a recusive function, but also an index into an array, lists and recursive function work much more easily than arrays, so if you use recursion the solution is usually simpler if its a list.
I'd also avoid using a string builder, assuming you are learning fp, string builders are imperative, and whilst they obviously work, they wont help you get your head around using immutable data.
The key then is to use the pattern match to match the scenario that you want to use to trigger the upper/lower case logic, as it depends on 2 consecutive characters.
I THINK you want this to happen for the 1st char, and after a '.'?
(I've inserted a '.' as the 1st char to allow the recursive function to just process the '.' scenario, rather than making a special case).
let convertToCamelCase (value : string) =
let rec convertListToCamelCase (value : char list) =
match value with
| [] -> []
| '.' :: second :: rest ->
'.' :: convertListToCamelCase (Char.ToLower second :: rest)
| c :: rest ->
c :: convertListToCamelCase rest
// put a '.' on the front to simplify the logic (and take it off after)
let convertAsList = convertListToCamelCase ('.' :: (value.ToCharArray() |> Array.toList))
String ((convertAsList |> List.toArray).[1..])
The piece to worry about is the recusive piece, the rest of it is just flipping an array to a list and back again.

Replace string only if all characters match (Thai)

The problem is that มาก technically is in มาก็. Because มาก็ is มาก + ็.
So when I do
"แชมพูมาก็เยอะ".replace("มาก", " X ")
I end up with
แชมพู X ็เยอะ
And what I want
แชมพู X เยอะ
What I really want is to force the last character ก็ to count as a single character, so that มาก no longer matches มาก็.
While I haven't found a proper solution, I was able to find a solution. I split each string into separate (combined) characters via regex. Then I compare those lists to each other.
# Check is list is inside other list
def is_slice_in_list(s,l):
len_s = len(s) #so we don't recompute length of s on every iteration
return any(s == l[i:len_s+i] for i in range(len(l) - len_s+1))
def is_word_in_string(w, s):
a = regex.findall(u'\X', w)
b = regex.findall(u'\X', s)
return is_slice_in_list(a, b)
assert is_word_in_string("มาก็", "พูมาก็เยอะ") == True
assert is_word_in_string("มาก", "พูมาก็เยอะ") == False
The regex will split like this:
พู ม า ก็ เ ย อ ะ
ม า ก
And as it compares ก็ to ก the function figures the words are not the same.
I will mark as answered but if there is a nice or "proper" solution I will chose that one.

Getting the largest and smallest word at a string

when I run this codes the output is (" "," "),however it should be ("I","love")!!!, and there is no errors . what should I do to fix it ??
sen="I love dogs"
function Longest_word(sen)
x=" "
maxw=" "
minw=" "
minl=1
maxl=length(sen)
p=0
for i=1:length(sen)
if(sen[i]!=" ")
x=[x[1]...,sen[i]...]
else
p=length(x)
if p<min1
minl=p
minw=x
end
if p>maxl
maxl=p
maxw=x
end
x=" "
end
end
return minw,maxw
end
As #David mentioned, another and may be better solution can be achieved by using split function:
function longest_word(sentence)
sp=split(sentence)
len=map(length,sp)
return (sp[indmin(len)],sp[indmax(len)])
end
The idea of your code is good, but there are a few mistakes.
You can see what's going wrong by debugging a bit. The easiest way to do this is with #show, which prints out the value of variables. When code doesn't work like you expect, this is the first thing to do -- just ask it what it's doing by printing everything out!
E.g. if you put
if(sen[i]!=" ")
x=[x[1]...,sen[i]...]
#show x
and run the function with
Longest_word("I love dogs")
you will see that it is not doing what you want it to do, which (I believe) is add the ith letter to the string x.
Note that the ith letter accessed like sen[i] is a character not a string.
You can try converting it to a string with
string(sen[i])
but this gives a Unicode string, not an ASCII string, in recent versions of Julia.
In fact, it would be better not to iterate over the string using
for i in 1:length(sen)
but iterate over the characters in the string (which will also work if the string is Unicode):
for c in sen
Then you can initialise the string x as
x = UTF8String("")
and update it with
x = string(x, c)
Try out some of these possibilities and see if they help.
Also, you have maxl and minl defined wrong initially -- they should be the other way round. Also, the names of the variables are not very helpful for understanding what should happen. And the strings should be initialised to empty strings, "", not a string with a space, " ".
#daycaster is correct that there seems to be a min1 that should be minl.
However, in fact there is an easier way to solve the problem, using the split function, which divides a string into words.
Let us know if you still have a problem.
Here is a working version following your idea:
function longest_word(sentence)
x = UTF8String("")
maxw = ""
minw = ""
maxl = 0 # counterintuitive! start the "wrong" way round
minl = length(sentence)
for i in 1:length(sentence) # or: for c in sentence
if sentence[i] != ' ' # or: if c != ' '
x = string(x, sentence[i]) # or: x = string(x, c)
else
p = length(x)
if p < minl
minl = p
minw = x
end
if p > maxl
maxl = p
maxw = x
end
x = ""
end
end
return minw, maxw
end
Note that this function does not work if the longest word is at the end of the string. How could you modify it for this case?

Haskell recursive nextWord function

I wanted to implement a function nextWord which given a list of characters s (i.e. a String)
returns a pair consisiting of the next word in s and the remaining characters in s after the
first word has been removed. I tried this and I got error messages:
nextWord :: String -> (String, String)
nextWord [] = ([],[])
nextWord (next:rest)
| isSpace next = ([],rest)
| otherwise = next : [restWord,restString]
where
(restWord,restString) = nextWord next
However, when I looked at the solution I noticed that there is a big resemblance:
nextWord :: String -> ( String, String )
nextWord [] = ( [], [] )
nextWord ( next : rest )
| isSpace next = ([], rest)
| otherwise = let (restWord, restString) = nextWord rest
in (next : restWord, restString)
which works perfectly.
My question is first of all, why didn't my function work? I know I'm doing something wrong when defining restWord / restString.
Also, in the solution, how does the last line work?
next : restWord , restString
restWord, restString isn't a list, therefore how does the cont work here?
Thanks!
In the function you wrote, you sometimes return ([],rest), a tuple, and other times return next : [restWord,restString], a list. Those two types are incompatible, and so the function can't be typed.
In the working function, (next : restWord, restString) is a tuple: its fst is next:restWord and its snd is restString. This way, the function's return value is always a tuple, so the types line up.

algorithm/code in R to find pattern from any position in a string

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least.
For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f".
In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.
If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.
Thanks!
Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:
vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")
The function that does the matching:
find_rep_path <- function(vec, reps) {
regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
And some tests:
sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "aaaab" "0f0f"
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
#
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"
Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.
Also, as per the discussion in #G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.
find_rep_path <- function(vec, reps, limit) {
regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "a" "0f"
find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:
> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
>
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
> joycec <- paste(joyce, collapse = " ")
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
1.36 0.00 1.39
> result
[1] " Hoopsa boyaboy hoopsa!"
ADDED
Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).
find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
r <- regexpr(pat, string, perl = TRUE)
ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}
> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
0 0 0
> result
[1] "w"
REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.
Not optimized (even it is fast) function , but I think it is more R way to do this.
Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.
Here my code. I am creating 2 functions ( steps 1-2) and step 3:
library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000"
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <-
function(length=1,ss){
patt = mapply(function(x,y) substr(ss,x,y),
1:(nchar(ss)-length),
(length+1):nchar(ss))
res = str_locate_all(ss,unique(patt))
ll = unlist(lapply(res,length))
list(patt = patt[which.max(ll)],
rep = max(ll))
}
get_pattern_threshold <-
function(ss,threshold =3 ){
res <-
sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
res[,which.max(res['rep',])]
}
some tests:
get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"
$rep
[1] 6
> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"
$rep
[1] 18
Since you want at least three repetitions, there is a nice O(n^2) approach.
For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:
a0cc0
vaaaa
baaaa
baaaa
baa00
bvw
Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).
Repeat it for each d starting from some upper bound (at most n/3).
You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).
Maybe not optimal but I found this cutting idea quite neat ;)
For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.
pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)
# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
{
str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
}
# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]
aaaabaa
3
NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.
Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.
I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.
str <- "a0cc0vaaaabaaaabaaaabaa00bvw"
findPatternMax <- function(str){
nb <- nchar(str):1
length.patt <- rev(nb)
patterns <- c()
for (i in 1:length(nb)){
for (j in 1:nb[i]){
patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
}
}
patt.max <- names(which(table(patterns) == max(table(patterns))))
return(patt.max)
}
findPatternMax(str)
> findPatternMax(str)
[1] "a"
EDIT :
Maybe you want the returned pattern have a min length ?
then you can add a nchar.patt parameter for example :
nchar.patt <- 2 #For a pattern of 2 char min
nb <- nb[length.patt >= nchar.patt]
length.patt <- length.patt[length.patt >= nchar.patt]

Resources