How can I split a string by a delimiter in Common Lisp, like is done in SPLIT-SEQUENCE, but also add the delimiter in the list of strings?
For example, I could write:
(split-string-with-delimiter #\. "a.bc.def.com")
and the result would be ("a" "." "bc" "." "def" "." "com").
I've tried the following code (make-adjustable-string makes a string that can be extended with vector-push-extend):
(defun make-adjustable-string (s)
(make-array (length s)
:fill-pointer (length s)
:adjustable t
:initial-contents s
:element-type (array-element-type s)))
(defun split-str (string &key (delimiter #\ ) (keep-delimiters nil))
"Splits a string into a list of strings, with the delimiter still
in the resulting list."
(let ((words nil)
(current-word (make-adjustable-string "")))
(do* ((i 0 (+ i 1))
(x (char string i) (char string i)))
((= (+ i 1) (length string)) nil)
(if (eql delimiter x)
(unless (string= "" current-word)
(push current-word words)
(push (string delimiter) words)
(setf current-word (make-adjustable-string "")))
(vector-push-extend x current-word)))
(nreverse words)))
But this doesn't print out the last substring/word. I'm not sure what's going on.
Thanks for the help ahead of time!
If you're just looking for a solution, and not for an exercise, you can use cl-ppcre:
CL-USER> (cl-ppcre:split "(\\.)" "a.bc.def.com" :with-registers-p t)
("a" "." "bc" "." "def" "." "com")
Something like this?
copy sub-strings using subseq
using LOOP makes collecting things easier
Example:
(defun split-string-with-delimiter (string
&key (delimiter #\ )
(keep-delimiters nil)
&aux (l (length string)))
(loop for start = 0 then (1+ pos)
for pos = (position delimiter string :start start)
; no more delimiter found
when (and (null pos) (not (= start l)))
collect (subseq string start)
; while delimiter found
while pos
; some content found
when (> pos start) collect (subseq string start pos)
; optionally keep delimiter
when keep-delimiters collect (string delimiter)))
Example:
CL-USER 120 > (split-string-with-delimiter "..1.2.3.4.."
:delimiter #\. :keep-delimiters nil)
("1" "2" "3" "4")
CL-USER 121 > (split-string-with-delimiter "..1.2.3.4.."
:delimiter #\. :keep-delimiters t)
("." "." "1" "." "2" "." "3" "." "4" "." ".")
CL-USER 122 > (split-string-with-delimiter "1.2.3.4"
:delimiter #\. :keep-delimiters nil)
("1" "2" "3" "4")
CL-USER 123 > (split-string-with-delimiter "1.2.3.4"
:delimiter #\. :keep-delimiters t)
("1" "." "2" "." "3" "." "4")
Or modified to work with any sequence (lists, vectors, strings, ...):
(defun split-sequence-with-delimiter (sequence delimiter
&key (keep-delimiters nil)
&aux (end (length sequence)))
(loop for start = 0 then (1+ pos)
for pos = (position delimiter sequence :start start)
; no more delimiter found
when (and (null pos) (not (= start end)))
collect (subseq sequence start)
; while delimiter found
while pos
; some content found
when (> pos start) collect (subseq sequence start pos)
; optionally keep delimiter
when keep-delimiters collect (subseq sequence pos (1+ pos))))
The problem is after the end condition of the do* loop. When variable i reaches the end of the string, the do* loop is exited but there is still a current-word which has not been added yet to words. When the end condition is met you need to add x to current-word and then current-word to words, before exiting the loop:
(defun split-string-with-delimiter (string delimiter)
"Splits a string into a list of strings, with the delimiter still
in the resulting list."
(let ((words nil)
(current-word (make-adjustable-string "")))
(do* ((i 0 (+ i 1))
(x (char string i) (char string i)))
((>= (+ i 1) (length string)) (progn (vector-push-extend x current-word) (push current-word words)))
(if (eql delimiter x)
(unless (string= "" current-word)
(push current-word words)
(push (string delimiter) words)
(setf current-word (make-adjustable-string "")))
(vector-push-extend x current-word)))
(nreverse words)))
However, note that this version is still buggy in that if the last character of string is a delimiter, this will be included into the last word, i.e. (split-string-with-delimiter "a.bc.def." #\.) => ("a" "." "bc" "." "def.")
I'll let you add this check.
In any case, you might want to make this more efficient by looking ahead for delimiter and extracting all the characters between the current i and the next delimiter at once as one single substring.
For the case that you want to split with many delimiters, and keep them:
(defun split-string-with-delims (str delims)
(labels ((delim-p (c)
(position c delims))
(tokens (stri test)
(when (> (length stri) 0)
(let ((p (position-if test stri)))
(if p
(if (= p 0)
(cons (subseq stri 0 (1+ p))
(tokens (subseq stri (1+ p) nil) test))
(cons (subseq stri 0 p)
(tokens (subseq stri p nil) test)))
(cons stri nil))))))
(tokens str #'delim-p)))
And you can call it either:
(split-string-with-delims ".,hello world,," '(#\. #\, #\ ))
; => ("." "," "hello" " " "world" "," ",")
or:
(split-string-with-delims ".,hello world,,!!" "., ")
; => ("." "," "hello" " " "world" "," "," "!!")
Concerning your code, since there is subseq, i'd go for Rainer Joswig's way(above), instead of your make-adjustable-string + vector-push-extend.
I have been reading a little bit about functional programming on the web lately and I think I got a basic idea about the concepts behind it.
I'm curious how everyday programming problems which involve some kind of state are solved in a pure functional programing language.
For example: how would the word count program from the book 'The C programming Language' be implemented in a pure functional language?
Any contributions are welcome as long as the solution is in a pure functional style.
Here's the word count C code from the book:
#include <stdio.h>
#define IN 1 /* inside a word */
#define OUT 0 /* outside a word */
/* count lines, words, and characters in input */
main()
{
int c, nl, nw, nc, state;
state = OUT;
nl = nw = nc = 0;
while ((c = getchar()) != EOF) {
++nc;
if (c == '\n')
++nl;
if (c == ' ' || c == '\n' || c = '\t')
state = OUT;
else if (state == OUT) {
state = IN;
++nw;
}
}
printf("%d %d %d\n", nl, nw, nc);
}
Basically, in a functional styly you'll want to divide the IO operation of getting your stream of data from the pure operation of some stateful transistion based on the current character and the current state.
The Haskell solution from Tikhon is very clean but performs three passes on your input data and will result in the entire input being contained in memory until the result is computed. You can process data incrementally, which I do below using the Text package but no other advanced Haskell tools (which could clean this up at the expense of understandability by non-Haskellers).
First we have our preamble:
{-# LANGUAGE BangPatterns #-}
import Data.Text.Lazy as T
import Data.Text.Lazy.IO as TIO
Then we define our data structure to hold the state of the process (number of characters, words, and lines along with the State IN/OUT):
data Counts = Cnt { nc, nl, nw :: !Int
, state :: State }
deriving (Eq, Ord, Show)
data State = IN | OUT
deriving (Eq, Ord, Show)
Now I define a "zero" state just for easy use. I'd normally make a number of helper functions or use a package like lense to make incrementing each field in the Counts structure simple, but will go without for this answer:
zeros :: Counts
zeros = Cnt 0 0 0 OUT
And now I translate your series of if/else statements into a pure state machine:
op :: Counts -> Char -> Counts
op c '\n' = c { nc = nc c + 1, nw = nw c + 1, nl = nl c + 1, state=OUT }
op c ch | ch == ' ' || ch == '\t' = c { nc = nc c + 1, state=OUT }
| state c == OUT = c { nc = nc c + 1, nw = nw c + 1, state = IN }
| otherwise = c { nc = nc c + 1 }
Finally the main function just gets the input stream and folds our operation over the characters:
main = do
contents <- TIO.getContents
print $ T.foldl' op zeros contents
EDIT: You mentioned not understanding the syntax. Here is an even simpler version that I will explain:
import Data.Text.Lazy as T
import Data.Text.Lazy.IO as TIO
op (nc, nw, nl, st) ch
| ch == '\n' = (nc + 1, nw + 1 , nl + 1, True)
| ch == ' ' || ch == '\t' = (nc + 1, nw , nl , True)
| st = (nc + 1, nw + 1 , nl , False)
| otherwise = (nc + 1, nw , nl , st)
main = do
contents <- TIO.getContents
print $ T.foldl' op (0,0,0,True) contents
The import statements give us access to the getContents and foldl' functions we use.
The op function uses a bunch of guards - parts like | ch = '\n' - which is basically like a C if/elseif/else series.
The tuples ( ... , ... , ... , ... ) contain all our state. Haskell variables are immutable, so we make new tuples by adding one (or not) to the values of the previous variables.
A simple way to do it would be to read in the input and then use some simple functions to get the line/word/character count. Something like this would work:
count :: String -> (Int, Int, Int)
count str = (length $ lines str, length $ words str, length str)
main :: IO ()
main = fmap count getContents >>= print
This isn't exactly the same, but it's close.
This works really simply. Given a string, we can turn it into a list of lines with the standard lines function and a list of words with the standard words function. Since String is just [Char], length returns the number of characters. This is how we get the three counts. (For reference, length $ lines str is the same as length (lines str).)
The important idea is how the IO--reading the input and printing it out--is separated from the actual logic.
Also, instead of going through the input character by character keeping track of some state, we get the actual numbers by applying simple functions to the input. These functions are all just compositions of standard library functions.
In your loop there are four state variables, nc, nw, nl and state, plus the next character c. The loop remembers nc, nw, nl and state from the last time through the loop, and c changes each iteration through the loop. Imagine instead that you take those variables out of the loop and put them in a vector: [state, nc, nw, nl]. Then you change your loop construct into a function that takes two arguments, the first being a vector [state, nc, nw, nl], and the second being c, and returns a new vector with the updated values of nc, nw, nl and state. In C-ish pseudocode:
f([state, nc, nw, nl], c) {
++nc;
if (c == '\n')
++nl;
if (c == ' ' || c == '\n' || c = '\t')
state = OUT;
else if (state == OUT) {
state = IN;
++nw;
}
return [state, nc, nw, nl];
}
Now you can call that function with the vector [OUT, 0, 0, 0] and the first character in the string ("hello, world", say), and it will return a new vector [IN, 1, 0, 0]. Call f again with this new vector and the second character 'e', and it returns [IN, 2, 0, 0]. Repeat for the rest of the characters in the string, and the last call will return [IN, 12, 2, 0], identical to the values printed by the C code.
The basic idea is that you take the state variables out of the loop, turn the guts of the loop into a function, and pass the vector of state variables and the next input in as arguments to that function, and return a new state vector as a result. There is a function called reduce that does this.
Here's how you would do it in Clojure (formatted to emphasize the vectors returned):
(defn f [[state nc nw nl] c]
(let [nl (if (= c \n)(inc nl) nl)]
(cond
(or (= c \space)(= c \n)(= c \t)) [:out (inc nc) nw nl]
(= state :out) [:in (inc nc) (inc nw) nl]
true [state (inc nc) nw nl]
)))
(defn wc [s] (reduce f [:out 0 0 0] s))
(wc "hello, world")
which returns (and prints in the repl) [:in 12 2 0]
Here's my shot at a purely functional, strict, single-pass, tail-recursive solution in Scheme:
(define (word-count input-port)
(let loop ((c (read-char input-port))
(nl 0)
(nw 0)
(nc 0)
(state 'out))
(cond ((eof-object? c)
(printf "nl: ~s, nw: ~s, nc: ~s\n" nl nw nc))
((char=? c #\newline)
(loop (read-char input-port) (add1 nl) nw (add1 nc) 'out))
((char-whitespace? c)
(loop (read-char input-port) nl nw (add1 nc) 'out))
((eq? state 'out)
(loop (read-char input-port) nl (add1 nw) (add1 nc) 'in))
(else
(loop (read-char input-port) nl nw (add1 nc) state)))))
word-count receives an input port as a parameter; notice that no additional data structures are created (structs, tuples, vectors, etc.) instead, all state is kept in parameters. As an example, for counting the words in a file containing this:
hello, world
Call the procedure like this:
(call-with-input-file "/path/to/file" word-count)
> nl: 0, nw: 2, nc: 12
Common Lisp is mentioned, but it is not a pure functional programming language and it does not support TCO in its standard. Individual implementations do.
Tail recursive version, if the compiler supports it:
(defun word-count (&optional (stream *standard-input*))
(labels ((word-count-aux (in-p chars words lines)
(case (read-char stream nil :eof)
(:eof (values chars words lines))
(#\newline (word-count-aux nil (1+ chars) words (1+ lines)))
((#\space #\tab) (word-count-aux nil (1+ chars) words lines))
(otherwise (word-count-aux t
(1+ chars)
(if in-p words (1+ words))
lines)))))
(word-count-aux nil 0 0 0)))
But since TCO is not in the standard, a portable version would look more like this:
(defun word-count (&optional (stream *standard-input*)
&aux (in-p nil) (chars 0) (words 0) (lines 0) char)
(loop while (setf char (read-char stream nil nil)) do
(case char
(#\newline (setf in-p nil) (incf lines))
((#\space #\tab) (setf in-p nil))
(otherwise (unless in-p (incf words)) (setf in-p t)))
(incf chars))
(values chars words lines))
Above is no longer Functional.
We can replace the loop with a higher-order stream-map:
(defun stream-map (function stream)
(loop for char = (read-char stream nil nil)
while char do (funcall function char)))
(defun word-count (&optional (stream *standard-input*)
&aux (in-p nil) (chars 0) (words 0) (lines 0) char)
(stream-map (lambda (char)
(incf chars)
(when (eql char #\newline)
(incf lines))
(if (member char '(#\space #\newline #\tab))
(setf in-p nil)
(unless in-p
(incf words)
(setf in-p t))))
stream)
(values chars words lines))
The state is modified by the closure. To get rid of that we can implement a stream-reduce.
(defun stream-reduce (function stream &key initial-value)
(let ((value initial-value))
(loop for char = (read-char stream nil nil)
while char
do (setf value (funcall function value char)))
value))
(defun word-count (&optional (stream *standard-input*))
(rest (stream-reduce
(lambda (state char)
(destructuring-bind (in-p chars words lines) state
(case char
(#\newline (list nil (1+ chars) words (1+ lines)))
((#\space #\tab) (list nil (1+ chars) words lines))
(otherwise (list t
(1+ chars)
(if in-p words (1+ words))
lines)))))
stream
:initial-value (list nil 0 0 0))))
Here is a Scheme version of the program, from my blog, which implements the entire Unix word count program, including argument- and file-handling. The key function is wc, which is purely functional. It moves all local variables into the arguments of a local function (defined via named-let), which is the standard idiom for converting an imperative loop to functional style. The man page and code appear below:
NAME
wc -- word count
SYNOPSIS
wc [ -lwc ] [ name ... ]
DESCRIPTION
Wc counts lines, words and characters in the named files,
or in the standard input if no name appears. A word is a
maximal string of characters delimited by spaces, tabs or
newlines.
If the optional argument is present, just the specified
counts (lines, words, or characters) are selected by the
letters l, w or c.
#! /usr/bin/scheme --script
(define l-flag #t)
(define w-flag #t)
(define c-flag #t)
(define (update-flags fs)
(if (not (member #\l fs)) (set! l-flag #f))
(if (not (member #\w fs)) (set! w-flag #f))
(if (not (member #\c fs)) (set! c-flag #f)))
(define (put-dec n width)
(let* ((n-str (number->string n)))
(display (make-string (- width (string-length n-str)) #\space))
(display n-str)))
(define (wc)
(let loop ((inword #f) (c (read-char)) (ls 0) (ws 0) (cs 0))
(cond ((eof-object? c) (values ls ws cs))
((char=? c #\newline)
(loop #f (read-char) (add1 ls) ws (add1 cs)))
((not (member c '(#\space #\newline #\tab)))
(if inword
(loop #t (read-char) ls ws (add1 cs))
(loop #t (read-char) ls (add1 ws) (add1 cs))))
(else (loop #f (read-char) ls ws (add1 cs))))))
(define (main args)
(when (and (pair? args) (char=? (string-ref (car args) 0) #\-))
(update-flags (cdr (string->list (car args))))
(set! args (cdr args)))
(if (null? args)
(let-values (((ls ws cs) (wc)))
(when l-flag (display ls) (display " "))
(when w-flag (display ws) (display " "))
(when c-flag (display cs) (display " "))
(newline))
(let loop ((args args) (l-tot 0) (w-tot 0) (c-tot 0))
(if (null? args)
(begin (when l-flag (put-dec l-tot 12))
(when w-flag (put-dec w-tot 12))
(when c-flag (put-dec c-tot 12)))
(with-input-from-file (car args)
(lambda ()
(let-values (((ls ws cs) (wc)))
(when l-flag (put-dec ls 12))
(when w-flag (put-dec ws 12))
(when c-flag (put-dec cs 12))
(display " ") (display (car args)) (newline)
(loop (cdr args) (+ l-tot ls) (+ w-tot ws) (+ c-tot cs)))))))))
(main (cdr (command-line)))
Here's a solution based on the Clojure example posted here but in CL using recursion.
(defstruct (state (:constructor make-state (state chars words lines)))
state chars words lines)
(defun wc (state stream)
(symbol-macrolet ((s (state-state state))
(c (state-chars state))
(w (state-words state))
(l (state-lines state)))
(case (read-char stream nil :eof)
(:eof state)
(#\Newline (wc (make-state :out (1+ c) w (1+ l)) stream))
(#\Space (wc (make-state :out (1+ c) w l) stream))
(t (if (eq s :out)
(wc (make-state :in (1+ c) (1+ w) l) stream)
(wc (make-state :in (1+ c) w l) stream))))))
(with-input-from-string (stream "Hello Functional Programming World")
(wc (make-state :out 0 0 0) stream))
;;; => #S(STATE :STATE :IN :CHARS 34 :WORDS 4 :LINES 0)
I believe you could write this somewhat more elegantly while still only iterating over the input once, but you'll need to make GHC do more work, certainly use -O2.
I have not yet compiled this code, much less compared it's speed vs. Thomas DuBuisson's answer, but this should indicate the basic direction.
{-# LANGUAGE BangPatterns #-}
import Data.List
wordcount = snd . foldl' go (False,0)
where go (!b,!n) !c = if elem c [' ','\t','\n'] then (False,n)
else (True, n + if b then 0 else 1)
linecount = foldl' go 0
where go !n !c = n + if c == '\n' then 1 else 0
main = interact $ show . go
where go x = (linecount x, wordcount x, foldl' (\!n _ ->n+1) 0 x)
If I understand stream fusion correctly, then GHC should inline wordcount and linecount into main, merge the three foldl' commands into one, well hopefully, and start rearranging the if checks. I'd hope it'd inlined elem and foldl' too of course.
If not, you could certainly force inlining and probably create a simple fusion rule, but maybe the defaults suffice. Or maybe some simple alterations produce the desired effect.
Btw, I have written foldl' (\n _ ->n+1) 0 x only because I've heard bad storied about length, but maybe length works fine, another change worth profiling.
In Haskell using strict IO rather than lazy. Does words only but you can easily implement characters and lines on top of this. Requires the text and conduit packages:
module Main
where
import Control.Applicative
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
import System.Environment
main :: IO ()
main = do args <- getArgs
print <$> (runResourceT $
CB.sourceFile (args !! 0)
$$ CB.lines
=$= CT.decode CT.utf8
=$= CL.map T.words
=$ CL.fold (\acc words -> acc + length words) 0)
Here's a version in Typed Racket using match and the for loop macros:
(: word-count : Input-Port -> Void)
(define (word-count in)
(define-values (nl nw nc st)
(for/fold: ([nl : Integer 0] [nw : Integer 0] [nc : Integer 0]
[state : (U 'in 'out) 'out])
([c (in-input-port-chars in)])
(match* (c state)
[(#\newline _) (values (add1 nl) nw (add1 nc) 'out)]
[((? char-whitespace?) _)
(values (add1 nl) nw (add1 nc) 'out)]
[(_ 'out) (values nl (add1 nw) (add1 nc) 'in)]
[(_ _) (values nl nw (add1 nc) state)])))
(printf "nl: ~s, nw: ~s, nc: ~s\n" nl nw nc))
Here is a Haskell implementation, where I have tried to stay close to the approach followed by the original C program. Iterations often become fold operations, with state-containing variables ending up as the first argument to the operation passed to fold.
-- Count characters, words, and lines in an input string.
wordCount::String->(Int, Int, Int)
wordCount str = (c,w,l)
where (inWord,c,w,l) = foldl op (False,0,0,1) str
where op (inWord,c,w,l) next | next == '\n' = (False,c+1,w,l+1)
| next == '\t' || next == ' ' = (False,c+1,w,l)
| inWord == False = (True,c+1,w+1,l)
| otherwise = (True,c+1,w,l)
main = interact $ show . wordCount
I fully understand the use of list in lisp but I 've got problem using string.
I try to write my own code of functions like string> or string< from common lisp to understand how lisp deals with string.
For example, abcde is bigger than abbb and returns 1.
I think that I will use the char function or do you think that I must use subseq ? or function that deals with ASCII code ?
Here are cases that I found :
-character 0 of each string are equal, we continue, with the next character.
-charactere 0 are different, one is smaller that other, we stop.
I need help about "go to the next character".
Thanks a lot !!
This is the Common Lisp version. You can just use ELT because
(type-of "a") => (SIMPLE-ARRAY CHARACTER (1))
(defun my-string< (a b &key (start 0))
(cond
((= start (length a) (length b))
nil)
((>= start (min (length a) (length b)))
(error "Undefined"))
((char= (elt a start) (elt b start))
(my-string< a b :start (1+ start)))
((char< (elt a start) (elt b start))
t)))
This is an implementation of a function that given two strings will return either -1, 0 or +1 depending on if the first is less than, equal or greater than the second.
In case one string is the initial part of the other then shorter string is considered to be "less than" the longer.
The algorithm is very simple... loops for every char until either the index gets past one of the strings or if a character is found to be different.
(defun strcmp (a b)
(do ((i 0 (1+ i))
(na (length a))
(nb (length b)))
((or (= i na) (= i nb) (char/= (elt a i) (elt b i)))
(cond
((= i na nb) 0) ;; Strings are identical
((= i na) -1) ;; a finished first
((= i nb) 1) ;; b finished first
((char< (elt a i) (elt b i)) -1) ;; Different char a < b
(t 1))))) ;; Only remaining case
(defun test (a b)
(format t "(strcmp ~s ~s) -> ~s~%"
a b (strcmp a b)))
(test "abc" "abc")
(test "ab" "abc")
(test "abc" "ab")
(test "abd" "abc")
(test "abc" "abd")
The output is
(strcmp "abc" "abc") -> 0
(strcmp "ab" "abc") -> -1
(strcmp "abc" "ab") -> 1
(strcmp "abd" "abc") -> 1
(strcmp "abc" "abd") -> -1
Your problem has been solved already but in case you run into others,
the following method might be useful:
I installed SBCL from source and keep the source around.
That allows me to run M-. on a function name like string<
and it will jump to the definition in your lisp implementation.
In my case I ended up at this macro:
;;; LESSP is true if the desired expansion is for STRING<* or STRING<=*.
;;; EQUALP is true if the desired expansion is for STRING<=* or STRING>=*.
(sb!xc:defmacro string<>=*-body (lessp equalp)
(let ((offset1 (gensym)))
`(with-two-strings string1 string2 start1 end1 ,offset1 start2 end2
(let ((index (%sp-string-compare string1 start1 end1
string2 start2 end2)))
(if index
(cond ((= (the fixnum index) (the fixnum end1))
,(if lessp
`(- (the fixnum index) ,offset1)
`nil))
((= (+ (the fixnum index) (- start2 start1))
(the fixnum end2))
,(if lessp
`nil
`(- (the fixnum index) ,offset1)))
((,(if lessp 'char< 'char>)
(schar string1 index)
(schar string2 (+ (the fixnum index) (- start2 start1))))
(- (the fixnum index) ,offset1))
(t nil))
,(if equalp `(- (the fixnum end1) ,offset1) nil))))))
) ; EVAL-WHEN
There is no direct concept of iteration ("next") with strings, in Scheme. That only applies to lists. So instead you have to iterate with indices:
(define (string<? lhs rhs)
(let* ((lhslen (string-length lhs))
(rhslen (string-length rhs))
(minlen (min lhslen rhslen)))
(let loop ((i 0))
(if (= i minlen) (< lhslen rhslen)
(let ((lhschar (string-ref lhs i))
(rhschar (string-ref rhs i)))
(cond ((char<? lhschar rhschar) #t)
((char<? rhschar lhschar) #f)
(else (loop (+ i 1)))))))))