Haskell substring testing - haskell

When I use sub_string("abberr","habberyry") , it returns True, when obviously it should be False. The point of the function is to search for the first argument within the second one. Any ideas what's wrong?
sub_string :: (String, String) -> Bool
sub_string(_,[]) = False
sub_string([],_) = True
sub_string(a:x,b:y) | a /= b = sub_string(a:x,y)
| otherwise = sub_string(x,y)

Let me give you hints on why it's not working:
your function consumes "abber" and "habber" of the input stings on the initial phase.
Now "r" and "yry" is left.
And "r" is a subset of "yry". So it returns True. To illustrate a more simple example of your problem:
*Main> sub_string("rz","rwzf")
True

First off, you need to switch your first two lines. _ will match [] and this will matter when you're matching, say, substring "abc" "abc". Secondly, it is idiomatic Haskell to write a function with two arguments instead of one with a pair argument. So your code should start out:
substring :: String -> String -> Bool
substring [] _ = True
substring _ [] = False
substring needle (h : aystack)
| ...
Now we get to the tricky case where both of these lists are not empty. Here's the problem with recursing on substring as bs: you'll get results like "abc" being a substring of "axbxcx" (because "abc" will match 'a' first, then will look for "bc" in the rest of the string; the substring algorithm will then skip past the 'x' to look for "bc" in "bxcx", which will match 'b' and look for "c" in "xcx", which will return True.
Instead your condition needs to be more thorough. If you're willing to use functions from Data.List this is:
| isPrefixOf needle (h : aystack) = True
| otherwise = substring needle aystack
Otherwise you need to write your own isPrefixOf, for example:
isPrefixOf needle haystack = needle == take (length needle) haystack

As Sibi already pointed out, your function tests for subsequence. Review the previous exercise, it is probably isPrefixof (hackage documentation), which is just a fancy way of saying startsWith, which looks very similar to the function you wrote.
If that is not the previous exercise, do that now!
Then write sub_string in terms of isPrefixOf:
sub_string (x, b:y) = isPrefixOf ... ?? ???
Fill in the dots and "?" yourself.

Related

Finding if a string is a substring of another in Sml without library functions

I am trying to write a function that subString : string * string -> int
that checks if the first string is a substring of the second and its case sensitive.
I want to return the index starting from 0 if the first string is a substring or -1 if it is not. if it appears multiple times just return the index of the first appearance.
for instance:
subString("bc","abcabc") ===>1
subString("aaa","aaaa") ===>0
subString("bc","ABC") ===>-1
I am having a lot of trouble wrapping my brain around this because I am not too familiar with sml or using strings in sml and I am not supposed to use any built in functions like String.sub.
I can use helper functions though.
all I can think of is to use explode somehow in a helper function and somehow check the lists and then implode them, but how do I get the indexed position?
all I have is
fun subString(s1,s2) =
if null s2 then ~1
else if s1 = s2 then 0
else 1+subString(s1, tl s2);
I am thinking of using a helper function that explodes the strings and then maybe compares the two but I can't figure how to get that to work.
This is already a really good start, but there are some slight problems:
In your recursive case you add 1 to the recursive result, even if the recursive application did not find the substring and returned -1. You should check wether the result is -1 before adding 1.
In the second line you check whether the two strings are equal. If you do this you will only find a substring if the string ends with that substring. So what you really want to do in line 2 is to test whether s2 starts with s1. I would recommend that you write a helper function that performs that test. For this helper function you could indeed use explode and then recursively check whether the first character of the lists are identical.
Once you have this helper function use it in line 2 instead of the equality test.
I am not supposed to use any built in functions like String.sub
What a pity! Since strings have an abstract interface while you with lists have direct access to its primary constructors, [] and ::, you have to use library functions to get anywhere with strings. explode is also a library function. But okay, if your constraint is that you have to convert your string into a list to solve the exercise, so be it.
Given your current code,
fun subString(s1,s2) =
if null s2 then ~1
else if s1 = s2 then 0
else 1+subString(s1, tl s2);
I sense one problem here:
subString ([#"b",#"c"], [#"a",#"b",#"c",#"d"])
~> if null ([#"a",#"b",#"c",#"d"]) then ... else
if [#"b",#"c"] = [#"a",#"b",#"c",#"d"] then ... else
1 + subString([#"b",#"c"], [#"b",#"c",#"d"])
~> 1 + subString([#"b",#"c"], [#"b",#"c",#"d"])
~> 1 + if null ([#"b",#"c",#"d"]) then ... else
if [#"b",#"c"] = [#"b",#"c",#"d"] then ... else
1 + subString([#"b",#"c"], [#"c",#"d"])
It seems that the check s1 = s2 is not exactly enough: We should have liked to say that [#"b",#"c"] is a substring of [#"b",#"c",#"d"] because it's a prefix of it, not because it is equivalent. With s1 = s2 you end up checking that something is a valid suffix, not a valid substring. So you need to change s1 = s2 into something smarter.
Perhaps you can build a helper function that determines if one list is a prefix of another and use that here?
As for solving this exercise by explodeing your strings into lists: This is highly inefficient, so much that Standard ML's sister language Ocaml had explode entirely removed from the library:
The functions explode and implode were in older versions of Caml, but we omitted them from OCaml because they encourage inefficient code. It is generally a bad idea to treat a string as a list of characters, and seeing it as an array of characters is a much better fit to the actual implementation.
So first off, String.isSubstring already exists, so this is a solved problem. But if it weren't, and one wanted to write this compositionally, and String.sub isn't cheating (it is accessing a character in a string, comparable to pattern matching the head and tail of a list via x::xs), then let me encourage you to write efficient, composable and functional code:
(* Check that a predicate holds for all (c, i) of s, where
* s is a string, c is every character in that string, and
* i is the position of c in s. *)
fun alli s p =
let val stop = String.size s
fun go i = i = stop orelse p (String.sub (s, i), i) andalso go (i + 1)
in go 0 end
(* needle is a prefix of haystack from the start'th index *)
fun isPrefixFrom (needle, haystack, start) =
String.size needle + start <= String.size haystack andalso
alli needle (fn (c, i) => String.sub (haystack, i + start) = c)
(* needle is a prefix of haystack if it is from the 0th index *)
fun isPrefix (needle, haystack) =
isPrefixFrom (needle, haystack, 0)
(* needle is a substring of haystack if is a prefix from any index *)
fun isSubstring (needle, haystack) =
let fun go i =
String.size needle + i <= String.size haystack andalso
(isPrefixFrom (needle, haystack, i) orelse go (i + 1))
in go 0 end
The general idea here, which you can re-use when building an isSubstring that uses list recursion rather than string index recursion, is to build the algorithm abstractly: needle being a substring of haystack can be defined in simpler terms by needle being the prefix of haystack counting from any valid position in haystack (of course not such that it exceeds haystack). And determining if something is a prefix is much easier, even easier with list recursion!
This suggestion would leave you with a template,
fun isPrefix ([], _) = ...
| isPrefix (_, []) = ...
| isPrefix (x::xs, y::ys) = ...
fun isSubstring ([], _) = ...
| isSubstring (xs, ys) = ... isPrefix ... orelse ...
As for optimizing the string index recursive solution, you could avoid the double bounds checking in both isPrefixFrom and in isSubstring by making isPrefixFrom a local function only accessible to isPrefix and isSubstring; otherwise it will be unsafe.
Testing this,
- isSubstring ("bc", "bc");
> val it = true : bool
- isSubstring ("bc", "bcd");
> val it = true : bool
- isSubstring ("bc", "abc");
> val it = true : bool
- isSubstring ("bc", "abcd");
> val it = true : bool
- isSubstring ("bc", "");
> val it = false : bool

Taking a list of strings and replacing multiple words with one

I'm fairly new to Haskell and as input I want to take an array of string for example as
["HEY" "I'LL" "BE" "RIGHT" "BACK"] and look for lets say "BE" "RIGHT" "BACK" and replace it with a different word, lets say "CHEESE". I have a function made for single words but I want this to work if a string contains a certain phrase to replace it with a word. Oh and I don't want to use external libraries.
Code:
replace :: [String] -> [String]
replace [] = []
replace (h:t)
| h == "WORD" = "REPLACED" : replace t
| otherwise = h : replace t
What you have now could also be implemented as
replace ("WORD":rest) = "REPLACED" : replace rest
replace (x:rest) = x : replace rest
replace [] = []
And this could be extended to your example as
replace ("BE":"RIGHT":"BACK":rest) = "CHEESE" : replace rest
replace (x:rest) = x : replace rest
replace [] = []
But obviously this is not a great way to write it. We'd like a more general solution where we can pass in a phrase (or sub-list) to replace. To start with we know the following things:
Input is a list of n elements (decreases as we recurse)
Phrase is a list of m elements (stays constant as we recurse)
If m > n, we definitely don't have a match
If m <= n, we might have a match
If we don't have a match, keep the head and try with the tail
While there are more efficient algorithms out there, a simple one would be to check our lengths at each step along the list. This can be done pretty simply as
-- Phrase Replacement Sentence New sentence
replaceMany :: [String] -> String -> [String] -> [String]
replaceMany phrase new sentence = go sentence
where
phraseLen = length phrase
go [] = []
go sent#(x:xs)
| sentLen < phraseLen = sent
| first == phrase = new : go rest
| otherwise = x : go xs
where
sentLen = length sent
first = take phraseLen sent
rest = drop phraseLen sent
Here we can take advantage of Haskell's laziness and just go ahead and define first and rest without worrying if it's valid to do so. If they aren't used, they never get computed. I've opted to also use some more complex pattern matching in the form sent#(x:xs). This matches a list with at least one element, assigning the entire list to sent, the first element to x, and the tail of the list to xs. Next, we just check each condition. If sentLen < phraseLen, there's no possible chance that there's a match in the rest of the list so just return the whole thing. If the first m elements equals our phrase, then replace it and keep searching, and otherwise just put back the first element and keep searching.

Removing a string from a list of strings in Haskell

I have a question regarding Haskell that's been stumping my brain. I'm currently required to write a function that removes a string i.e. "word" from a list of strings ["hi", "today", "word", "Word", "WORD"] returns the list ["hi", "today", "Word", "WORD"]. I cannot use any higher-order functions and can only resort to primitive recursion.
Thinking about the problem, I thought maybe I could solve it by using a recursion where you search the head of the first string, if it matches "w" then compare the next head from the tail, and see if that matches "o". But then I soon realized that after all that work, you wouldn't be able to delete the complete string "word".
My question really being how do I compare a whole string in a list rather than only comparing 1 element at a time with something like: removeWord (x:xs). Is it even possible? Do I have to write a helper function to aid in the solution?
Consider the base case: removing a word from an empty list will be the empty list. This can be trivially written like so:
removeWord [] _ = []
Now consider the case where the list is not empty. You match this with x:xs. You can use a guard to select between these two conditions:
x is the word you want to remove. (x == word)
x is not the word you want to remove. (otherwise)
You don't need a helper function, though you could write one if you wanted to. You've basically got 3 conditions:
You get an empty list.
You get a list whose first element is the one you want to remove.
You get a list whose first element is anything else.
In other languages, you would do this with a set of if-else statements, or with a case statement, or a cond. In Haskell, you can do this with guards:
remove_word_recursive:: String -> [String] -> [String]
remove_word_recursive _ [] = []
remove_word_recursive test_word (x:xs) | test_word == x = what in this case?
remove_word_recursive test_word (x:xs) = what in default case?
Fill in the correct result for this function in these two conditions, and you should be done.
I think what you're looking for is a special case of the function sought for this question on string filters: Haskell - filter string list based on some conditions . Reading some of the discussion on the accepted answer might help you understand more of Haskell.
Since you want to remove a list element, it's easy to do it with List Comprehension.
myList = ["hi", "today", "word", "Word", "WORD"]
[x | x <- myList, x /= "word"]
The result is:
["hi","today","Word","WORD"]
If isInfixOf is not considered as higher order, then
import Data.List (isInfixOf)
filter (not . isInfixOf "word") ["hi", "today", "word", "Word", "WORD"]

Is there a better way to write a "string contains X" method?

Just stared using Haskell and realized (at far as I can tell) there is no direct way to check a string to see if it contains a smaller string. So I figured I'd just take a shot at it.
Essentially the idea was to check if the two strings were the same size and were equal. If the string being checked was longer, recursively lop of the head and run the check again until the string being checked was the same length.
The rest of the possibilities I used pattern matching to handle them. This is what I came up with:
stringExists "" wordToCheckAgainst = False
stringExists wordToCheckFor "" = False
stringExists wordToCheckFor wordToCheckAgainst | length wordToCheckAgainst < length wordToCheckFor = False
| length wordToCheckAgainst == length wordToCheckFor = wordToCheckAgainst == wordToCheckFor
| take (length wordToCheckFor) wordToCheckAgainst == wordToCheckFor = True
| otherwise = stringExists wordToCheckFor (tail wordToCheckAgainst)
If you search Hoogle for the signature of the function you're looking for (String -> String -> Bool) you should see isInfixOf among the top results.
isInfixOf from Data.List will surely solve the problem, however in case of longer haystacks or perverse¹ needles you should consider more advanced string matching algorithms with a much better average and worst case complexity.
¹ Consider a really long string consisting only of a's and a needle with a lot of a's at the beginning and one b at the end.
Consider using the text package(text on Hackage, now also part of Haskell Platform) for your text-processing needs. It provides a Unicode text type, which is more time- and space-efficient than the built-in list-based String. For string search, the text package implements a Boyer-Moore-based algorithm, which has better complexity than the naïve method used by Data.List.isInfixOf.
Usage example:
Prelude> :s -XOverloadedStrings
Prelude> import qualified Data.Text as T
Prelude Data.Text> T.breakOnAll "abc" "defabcged"
[("def","abcged")]
Prelude Data.Text> T.isInfixOf "abc" "defabcged"
True

Correct way to define a function in Haskell

I'm new to Haskell and I'm trying out a few tutorials.
I wrote this script:
lucky::(Integral a)=> a-> String
lucky 7 = "LUCKY NUMBER 7"
lucky x = "Bad luck"
I saved this as lucky.hs and ran it in the interpreter and it works fine.
But I am unsure about function definitions. It seems from the little I have read that I could equally define the function lucky as follows (function name is lucky2):
lucky2::(Integral a)=> a-> String
lucky2 x=(if x== 7 then "LUCKY NUMBER 7" else "Bad luck")
Both seem to work equally well. Clearly function lucky is clearer to read but is the lucky2 a correct way to write a function?
They are both correct. Arguably, the first one is more idiomatic Haskell because it uses its very important feature called pattern matching. In this form, it would usually be written as:
lucky::(Integral a)=> a-> String
lucky 7 = "LUCKY NUMBER 7"
lucky _ = "Bad luck"
The underscore signifies the fact that you are ignoring the exact form (value) of your parameter. You only care that it is different than 7, which was the pattern captured by your previous declaration.
The importance of pattern matching is best illustrated by function that operates on more complicated data, such as lists. If you were to write a function that computes a length of list, for example, you would likely start by providing a variant for empty lists:
len [] = 0
The [] clause is a pattern, which is set to match empty lists. Empty lists obviously have length of 0, so that's what we are having our function return.
The other part of len would be the following:
len (x:xs) = 1 + len xs
Here, you are matching on the pattern (x:xs). Colon : is the so-called cons operator: it is appending a value to list. An expression x:xs is therefore a pattern which matches some element (x) being appended to some list (xs). As a whole, it matches a list which has at least one element, since xs can also be an empty list ([]).
This second definition of len is also pretty straightforward. You compute the length of remaining list (len xs) and at 1 to it, which corresponds to the first element (x).
(The usual way to write the above definition would be:
len (_:xs) = 1 + len xs
which again signifies that you do not care what the first element is, only that it exists).
A 3rd way to write this would be using guards:
lucky n
| n == 7 = "lucky"
| otherwise = "unlucky"
There is no reason to be confused about that. There is always more than 1 way to do it. Note that this would be true even if there were no pattern matching or guards and you had to use the if.
All of the forms we've covered so far use so-called syntactic sugar provided by Haskell. Pattern guards are transformed to ordinary case expressions, as well as multiple function clauses and if expressions. Hence the most low-level, unsugared way to write this would be perhaps:
lucky n = case n of
7 -> "lucky"
_ -> "unlucky"
While it is good that you check for idiomatic ways I'd recommend to a beginner that he uses whatever works for him best, whatever he understands best. For example, if one does (not yet) understand points free style, there is no reason to force it. It will come to you sooner or later.

Resources