Join multiple lazy sequences of strings in Clojure - string

I have several strings:
(def a "some random string")
(def b "this is a text")
Now i want to concatenate parts of them to create a string "some text". Unfortunately both of the strings below didn't work.
(clojure.string/join " " [(take 4 a) (take-last 4 b)])
(str (take 4 a) " " (take-last 4 b))
It's because functions take and take-last return lazy sequences. The question is: what is the proper way to concatenate multiple lazy sequences of strings and return one string?
Edit: I found one solution - (apply str (concat (take 4 a) " " (take-last 4 a))) - but is it the most correct way?

Rather than using sequence functions to slice the input strings, you might want to use the much more efficient subs (for substring; note there's a GC-related caveat about it, see below):
(subs "asdf" 1 2)
; => "s"
;; a and b as in the question text
(clojure.string/join " " [(subs a 0 4) (subs b (- (count b) 4))])
; => "some text"
The aforementioned caveat is that as long as the "s" returned in the first example here remains ineligible for garbage collection, so does the original "asdf" (since subs returns a "view" onto the input String without allocating fresh storage -- this is the behaviour of Java's substring method which subs wraps). This is not a problem if you immediately hand the "s" off to subs and retain no other reference to it, since join will discard it after pulling the characters.
If you do end up working with lazy sequences of characters after all, there's nothing to be done but to use something like (map (partial apply str) [...your vector here...]) to turn the inputs to clojure.string/join into strings.

Try this, and yes becoz of the laziness the result of your code is not proper.
(str (apply str (take 4 a)) " " (apply str (take-last 4 b)))

(str/join " " (map (fn [f col] (f col))
[first last]
(map #(str/split % #" ") [a b])))

Related

Is implementing the words function possible without a postprocessing step after folding?

Real World Haskell, chapter 4, page 98 of the printed version asks if words can be implemented using folds, and this is my question too:
Is it possible? If not, why? If it is, how?
I came up with the following, which is based on the idea that each non-space should be prepended to the last word in the output list (this happens in the otherwise guard), and that a space should trigger the appending of an emtpy word to the output list if there is not one already (this is handled in the if-then-else).
myWords :: String -> [String]
myWords = foldr step [[]]
where
step x yss#(y:ys)
| x == ' ' = if y == "" then yss else "":yss
| otherwise = (x:y):ys
Clearly this solution is wrong, since leading spaces in the input string result in one leading empty string in the output list of strings.
At the link above, I've looked into several of the proposed solutions for other readers, and many of them work similarly to my solution, but they generally "post-process" the output of the fold, for instance by tailing it if there is an empty leading string.
Other approaches use tuples (actually just pairs), so that the fold deals with the pair and can well handle the leading/trailing spaces.
In all these approaches, foldr (or another fold, fwiw) is not the function that provides the final output out of the box; there's always something else with has to adjust the output somehow.
Therefore I go back to the initial question and ask if it is actually possible to implement words (in a way that it correctly handles trailing/leading/repeated spaces) using folds. By using folds I mean that the folding function has to be the outermost function:
myWords :: String -> [String]
myWords input = foldr step seed input
If I understand correctly, your requirements include
(1) words "a b c" == words " a b c" == ["a", "b", "c"]
(2) words "xa b c" == ["xa", "b", "c"] /= ["x", "a", "b", "c"] == words "x a b c"
This implies that we can not have
words = foldr step base
for any step and base.
Indeed, if we had that, then
words "xa b c"
= def words and foldr
step 'x' (words "a b c")
= (1)
step 'x' (words " a b c")
= def words and foldr
words "x a b c"
and this contradicts (2).
You definitely need some post-processing after the foldr.
#chi has a wonderful argument that you cannot implement words using "a" fold, but you did say using folds.
words = filterNull . words1
where
filterNull = foldr (\xs -> if null xs then id else (xs:)) []
words1 = foldr (\c -> if c == ' ' then ([]:) else consHead c) []
consHead c [] = [[c]]
consHead c (xs:xss) = (c:xs):xss
Both the outermost and innermost function are folds. ;-)
Yes. Eventhough it's a little tricky you may still do this job properly by using a single foldr and nothing else if you dwell into CPS (Continuation Passing Style). I had shown a special kind of chunksOf function previously.
In this kinds of folds our accumulator, hence the result of the fold is a function and we have to apply it to an identity kind of input so that we have the final result. So this may count as a final processing stage or not since we are using a single fold here and the type of it includes the function. Open to debate :)
ws :: String -> [String]
ws str = foldr go sf str $ ""
where
sf :: String -> [String]
sf s = if s == " " then [""] else [s]
go :: Char -> (String -> [String]) -> (String -> [String])
go c f = \pc -> let (s:ss) = f [c]
in case pc of
"" -> dropWhile (== "") (s:ss)
otherwise -> case (pc == " ", s == "") of
(True, False) -> "":s:ss
(True, True) -> s:ss
otherwise -> (pc++s):ss
λ> ws " a b c "
["a","b","c"]
sf : The initial function value to start with.
go : The iterator function
We are actually not fully utilizing the power of the CPS here since we have both the previous character pc and the currect character c at hand in every turn. It was very useful in the chunksOf function mentioned above while chunking a [Int] into [[Int]] every time an ascending sequence of elements were broken.

Comparing each element of a list to each element of another list

I'm trying to write a function that takes the first character of the first string, compares it to all the characters of the second string and if it finds the same character, replaces with a "-". Then it moves on to the second character of the first string, does the same comparison with each character (except the first character - the one we already checked) on the second string and so on. I want it to return the first string, but with the repeating characters swapped with the symbol "-".
E.g. if I put in comparing "good morning" "good afternoon", I'd like it to return "-----m---i-g"
I hope I explained it clearly enough. So far I've got:
comparing :: String -> String -> String
comparing a b =
if a == "" then ""
else if head a == head b then "-" ++ (comparing (tail a) (tail b))
else [head a] ++ (comparing (tail a) b)
The problem with this is it does not go through the second string character by character and I'm not sure how to implement that. I think I would need to call a recursive function on the 4th line:
if head a == ***the first character of tail b*** then "-" ++ (comparing (tail a) (tail b))
What could that function look like? Or is there a better way to do this?
First, at each recursive call, while you're iterating over the string a, you are for some reason also iterating over the string b at the same time. Look: you're passing only tail b to the next call. This means that the next call won't be able to look through the whole string b, but only through its tail. Why are you doing this?
Second, in order to see if a character is present in a string, use elem:
elem 'x' "xyz" == True
elem 'x' "abc" == False
So the second line of your function should look like this:
else if elem (head a) b then "-" ++ (comparing (tail a) b)
On a somewhat related note, use of head and tail functions is somewhat frowned upon, because they're partial: they will crash if the string is empty. Yes, I see that you have checked to make sure that the string is not empty, but the compiler doesn't understand that, which means that it won't be able to catch you when you accidentally change this check in the future.
A better way to inspect data is via pattern matching:
-- Comparing an empty string with anything results in an empty string
comparing "" _ = ""
-- Comparing a string that starts with `a` and ends with `rest`
comparing (a:rest) b =
(if elem a b then "-" else a) ++ comparing rest b
Rather than writing the recursive logic manually, this looks like a classic use case for map. You just need a function that takes a character and returns either that character or '-' depending on its presence in the other list.
Written out fully, this would look like:
comparing first second = map replace first
where replace c = if c `elem` second then '-' else c

Searching through a String

I found a good example in a book that I'm trying to tackle. I'm trying to write a function called, "pointer" with the signature as, pointer :: String -> Int. It is going to take text with "pointers" that look like, [Int], and then return the total number of pointers found.
The text that the pointer function will examine will look like:
txt :: String
txt = "[1] and [2] are friends who grew up together who " ++
"went to the same school and got the same degrees." ++
"They eventually opened up a store named [2] which was pretty successful."
In the command line, we will run the code as follows:
> pointer txt
3
The 3 signifies the number of pointers that were found.
WHAT I UNDERSTAND:
I get that "words" will break down a string into a list with words.
Example:
words "where are all of these apples?"
["where","are","all","of","these","apples?"]
I get that "filter" will choose a specific element(s) in a list.
Example:
filter (>3) [1,5,6,4,3]
[5,6,4]
I get that "length" will return the length of a list
WHAT I THINK I NEED TO DO:
Step 1) look at txt and then break it down into single words until you have a long list of words.
Step 2) use filter to examine the list for [1] or [2]. Once found, filter will place these pointers into an list.
Step 3) call the length function on the resulting list.
PROBLEM BEING FACED:
I'm having a tough time trying to take everything I know and implementing it.
Here is a hypothetical ghci session:
ghci> words txt
[ "[1]", "and", "[2]", "are", "friends", "who", ...]
ghci> filter (\w -> w == "[1]" || w == "[2]") (words txt)
[ "[1]", "[2]", "[2]" ]
ghci> length ( filter (\w -> w == "[1]" || w == "[2]") (words txt) )
3
You can make the last expression more readable using the $ operator:
length $ filter (\w -> w == "[1]" || w == "[2]") $ words txt
If you want to be able to find all patterns of type [Int] in a string – such as [3], [465], etc. and not only [1] and [2] the easiest would be to use regular expression:
{-# LANGUAGE NoOverloadedStrings #-}
import Text.Regex.Posix
txt :: String
txt = "[1] and [2] are friends who grew up together who " ++
"went to the same school and got the same degrees." ++
"They eventually opened up a store named [2] which was pretty successful."
pointer :: String -> Int
pointer source = source =~ "\\[[0-9]{1,}\\]"
We can now run:
pointer txt
> 3
This works for single digit "pointers":
pointer :: String -> Int
pointer ('[':_:']':xs) = 1 + pointer xs
pointer (_: xs) = pointer xs
pointer _ = 0
This is better handled with parser combinators like those provided by ie. Parsec, but this might be overkill.

What would be a good or efficient way to get the list of alphabet used in a string

Put it simply, how to get a list non-repeated letters from a string in Common Lisp?
e.g:
"common"
-> ("c" "o" "m" "n") or in characters, (#\c #\o #\m #\n)
I'd care less about the order and type, if it is in string or character.
"overflow" -> (o v e r f l w)
"tomtomtom" -> (t o m)
etc...
What I was thinking is to collect the first letter of the original string,
Then use the function;
(remove letter string)
collect the first letter of now, removed letter string and append it to the already collected letters from before.
It sounds like recursion but if recursively calling would loose the previously collected *letter*s list, right? I also doubt if there is any built-in functions for this.
Furthermore, I don't want to use set or any of them since I want
to do this completely in functional style.
Thanks for your time.
CL-USER> (remove-duplicates (coerce "common" 'list))
(#\c #\m #\o #\n)
Or you can even do it simply as:
CL-USER> (remove-duplicates "common")
"comn"
There may be certain better possibilities to do that, if you can make some assumptions about the text you are dealing with. For instance, if you are dealing with English text only, then you could implement a very simple hash function (basically, use a bit vector 128 elements long), so that you wouldn't need to even use a hash-table (which is a more complex structure). The code below illustrates the idea.
(defun string-alphabet (input)
(loop with cache =
(coerce (make-array 128
:element-type 'bit
:initial-element 0) 'bit-vector)
with result = (list input)
with head = result
for char across input
for code = (char-code char) do
(when (= (aref cache code) 0)
(setf (aref cache code) 1
(cdr head) (list char)
head (cdr head)))
finally (return (cdr result))))
(string-alphabet "overflow")
;; (#\o #\v #\e #\r #\f #\l #\w)
Coercing to bit-vector isn't really important, but it is easier for debugging (the printed form is more compact) and some implementation may actually optimize it to contain only so many integers that the platform needs to represent so many bits, i.e. in the case of 128 bits length, on a 64 bit platform, it could be as short as 2 or 3 integers long.
Or, you could've also done it like this, using integers:
(defun string-alphabet (input)
(loop with cache = (ash 1 128)
with result = (list input)
with head = result
for char across input
for code = (char-code char) do
(unless (logbitp code cache)
(setf cache (logior cache (ash 1 code))
(cdr head) (list char)
head (cdr head)))
finally (return (cdr result))))
but in this case you would be, in your worst case, create 128 big integers, which is not so expensive after all, but the bit-vector might do better. However, this might give you a hint, for the situation, when you can assume that, for example, only letters of English alphabet are used (in which case it would be possible to use an integer shorter then machine memory word).
Here some code in Haskell, because I am not so familiar with Lisp, but as they're both functional, I don't think, it will be a problem for translating it:
doit :: String -> String
doit [] = []
doit (x:xs) = [x] ++ doit (filter (\y -> x /= y) xs)
So what does it? You've got a String, if it's an empty String (in Haskell [] == ""), you return an empty String.
Otherwise, take the first element and concatenate it to the recursion over the tail of the String, but filter out those elements, which are == first element.
This Function filter is only syntactic sugar for a specific map-function, in Lisp called remove-if as you can reread here: lisp filter out results from list not matching predicate

Add character to string to get another string?

I want to add a character to a string, and get another string with the character added as a result.
This doesn't work:
(cons \a "abc")
Possible solutions, in order of preference:
Clojure core function
Clojure library function
Clojure user-defined (me!) function (such as (apply str (cons \a "abc")))
java.lang.String methods
Is there any category 1 solution before I roll-my-own?
Edit: this was a pretty dumb question. :(
How about:
(str "abc" \a)
This returns "abca" on my machine.
You can also use it for any number of strings/chars: (str "kl" \m "abc" \a \b).
You could use join from clojure.string:
(clojure.string/join [\a "abc"])
But for the simple use case you should really just use str, as #Dan Filimon suggests. join has the added benefit that you could put a separator between the joined strings, but without a separator it actually just applies str:
(defn ^String join
"Returns a string of all elements in coll, separated by
an optional separator. Like Perl's join."
{:added "1.2"}
([coll]
(apply str coll))
([separator [x & more]]
(loop [sb (StringBuilder. (str x))
more more
sep (str separator)]
(if more
(recur (-> sb (.append sep) (.append (str (first more))))
(next more)
sep)
(str sb)))))

Resources