I have been reading a little bit about functional programming on the web lately and I think I got a basic idea about the concepts behind it.
I'm curious how everyday programming problems which involve some kind of state are solved in a pure functional programing language.
For example: how would the word count program from the book 'The C programming Language' be implemented in a pure functional language?
Any contributions are welcome as long as the solution is in a pure functional style.
Here's the word count C code from the book:
#include <stdio.h>
#define IN 1 /* inside a word */
#define OUT 0 /* outside a word */
/* count lines, words, and characters in input */
main()
{
int c, nl, nw, nc, state;
state = OUT;
nl = nw = nc = 0;
while ((c = getchar()) != EOF) {
++nc;
if (c == '\n')
++nl;
if (c == ' ' || c == '\n' || c = '\t')
state = OUT;
else if (state == OUT) {
state = IN;
++nw;
}
}
printf("%d %d %d\n", nl, nw, nc);
}
Basically, in a functional styly you'll want to divide the IO operation of getting your stream of data from the pure operation of some stateful transistion based on the current character and the current state.
The Haskell solution from Tikhon is very clean but performs three passes on your input data and will result in the entire input being contained in memory until the result is computed. You can process data incrementally, which I do below using the Text package but no other advanced Haskell tools (which could clean this up at the expense of understandability by non-Haskellers).
First we have our preamble:
{-# LANGUAGE BangPatterns #-}
import Data.Text.Lazy as T
import Data.Text.Lazy.IO as TIO
Then we define our data structure to hold the state of the process (number of characters, words, and lines along with the State IN/OUT):
data Counts = Cnt { nc, nl, nw :: !Int
, state :: State }
deriving (Eq, Ord, Show)
data State = IN | OUT
deriving (Eq, Ord, Show)
Now I define a "zero" state just for easy use. I'd normally make a number of helper functions or use a package like lense to make incrementing each field in the Counts structure simple, but will go without for this answer:
zeros :: Counts
zeros = Cnt 0 0 0 OUT
And now I translate your series of if/else statements into a pure state machine:
op :: Counts -> Char -> Counts
op c '\n' = c { nc = nc c + 1, nw = nw c + 1, nl = nl c + 1, state=OUT }
op c ch | ch == ' ' || ch == '\t' = c { nc = nc c + 1, state=OUT }
| state c == OUT = c { nc = nc c + 1, nw = nw c + 1, state = IN }
| otherwise = c { nc = nc c + 1 }
Finally the main function just gets the input stream and folds our operation over the characters:
main = do
contents <- TIO.getContents
print $ T.foldl' op zeros contents
EDIT: You mentioned not understanding the syntax. Here is an even simpler version that I will explain:
import Data.Text.Lazy as T
import Data.Text.Lazy.IO as TIO
op (nc, nw, nl, st) ch
| ch == '\n' = (nc + 1, nw + 1 , nl + 1, True)
| ch == ' ' || ch == '\t' = (nc + 1, nw , nl , True)
| st = (nc + 1, nw + 1 , nl , False)
| otherwise = (nc + 1, nw , nl , st)
main = do
contents <- TIO.getContents
print $ T.foldl' op (0,0,0,True) contents
The import statements give us access to the getContents and foldl' functions we use.
The op function uses a bunch of guards - parts like | ch = '\n' - which is basically like a C if/elseif/else series.
The tuples ( ... , ... , ... , ... ) contain all our state. Haskell variables are immutable, so we make new tuples by adding one (or not) to the values of the previous variables.
A simple way to do it would be to read in the input and then use some simple functions to get the line/word/character count. Something like this would work:
count :: String -> (Int, Int, Int)
count str = (length $ lines str, length $ words str, length str)
main :: IO ()
main = fmap count getContents >>= print
This isn't exactly the same, but it's close.
This works really simply. Given a string, we can turn it into a list of lines with the standard lines function and a list of words with the standard words function. Since String is just [Char], length returns the number of characters. This is how we get the three counts. (For reference, length $ lines str is the same as length (lines str).)
The important idea is how the IO--reading the input and printing it out--is separated from the actual logic.
Also, instead of going through the input character by character keeping track of some state, we get the actual numbers by applying simple functions to the input. These functions are all just compositions of standard library functions.
In your loop there are four state variables, nc, nw, nl and state, plus the next character c. The loop remembers nc, nw, nl and state from the last time through the loop, and c changes each iteration through the loop. Imagine instead that you take those variables out of the loop and put them in a vector: [state, nc, nw, nl]. Then you change your loop construct into a function that takes two arguments, the first being a vector [state, nc, nw, nl], and the second being c, and returns a new vector with the updated values of nc, nw, nl and state. In C-ish pseudocode:
f([state, nc, nw, nl], c) {
++nc;
if (c == '\n')
++nl;
if (c == ' ' || c == '\n' || c = '\t')
state = OUT;
else if (state == OUT) {
state = IN;
++nw;
}
return [state, nc, nw, nl];
}
Now you can call that function with the vector [OUT, 0, 0, 0] and the first character in the string ("hello, world", say), and it will return a new vector [IN, 1, 0, 0]. Call f again with this new vector and the second character 'e', and it returns [IN, 2, 0, 0]. Repeat for the rest of the characters in the string, and the last call will return [IN, 12, 2, 0], identical to the values printed by the C code.
The basic idea is that you take the state variables out of the loop, turn the guts of the loop into a function, and pass the vector of state variables and the next input in as arguments to that function, and return a new state vector as a result. There is a function called reduce that does this.
Here's how you would do it in Clojure (formatted to emphasize the vectors returned):
(defn f [[state nc nw nl] c]
(let [nl (if (= c \n)(inc nl) nl)]
(cond
(or (= c \space)(= c \n)(= c \t)) [:out (inc nc) nw nl]
(= state :out) [:in (inc nc) (inc nw) nl]
true [state (inc nc) nw nl]
)))
(defn wc [s] (reduce f [:out 0 0 0] s))
(wc "hello, world")
which returns (and prints in the repl) [:in 12 2 0]
Here's my shot at a purely functional, strict, single-pass, tail-recursive solution in Scheme:
(define (word-count input-port)
(let loop ((c (read-char input-port))
(nl 0)
(nw 0)
(nc 0)
(state 'out))
(cond ((eof-object? c)
(printf "nl: ~s, nw: ~s, nc: ~s\n" nl nw nc))
((char=? c #\newline)
(loop (read-char input-port) (add1 nl) nw (add1 nc) 'out))
((char-whitespace? c)
(loop (read-char input-port) nl nw (add1 nc) 'out))
((eq? state 'out)
(loop (read-char input-port) nl (add1 nw) (add1 nc) 'in))
(else
(loop (read-char input-port) nl nw (add1 nc) state)))))
word-count receives an input port as a parameter; notice that no additional data structures are created (structs, tuples, vectors, etc.) instead, all state is kept in parameters. As an example, for counting the words in a file containing this:
hello, world
Call the procedure like this:
(call-with-input-file "/path/to/file" word-count)
> nl: 0, nw: 2, nc: 12
Common Lisp is mentioned, but it is not a pure functional programming language and it does not support TCO in its standard. Individual implementations do.
Tail recursive version, if the compiler supports it:
(defun word-count (&optional (stream *standard-input*))
(labels ((word-count-aux (in-p chars words lines)
(case (read-char stream nil :eof)
(:eof (values chars words lines))
(#\newline (word-count-aux nil (1+ chars) words (1+ lines)))
((#\space #\tab) (word-count-aux nil (1+ chars) words lines))
(otherwise (word-count-aux t
(1+ chars)
(if in-p words (1+ words))
lines)))))
(word-count-aux nil 0 0 0)))
But since TCO is not in the standard, a portable version would look more like this:
(defun word-count (&optional (stream *standard-input*)
&aux (in-p nil) (chars 0) (words 0) (lines 0) char)
(loop while (setf char (read-char stream nil nil)) do
(case char
(#\newline (setf in-p nil) (incf lines))
((#\space #\tab) (setf in-p nil))
(otherwise (unless in-p (incf words)) (setf in-p t)))
(incf chars))
(values chars words lines))
Above is no longer Functional.
We can replace the loop with a higher-order stream-map:
(defun stream-map (function stream)
(loop for char = (read-char stream nil nil)
while char do (funcall function char)))
(defun word-count (&optional (stream *standard-input*)
&aux (in-p nil) (chars 0) (words 0) (lines 0) char)
(stream-map (lambda (char)
(incf chars)
(when (eql char #\newline)
(incf lines))
(if (member char '(#\space #\newline #\tab))
(setf in-p nil)
(unless in-p
(incf words)
(setf in-p t))))
stream)
(values chars words lines))
The state is modified by the closure. To get rid of that we can implement a stream-reduce.
(defun stream-reduce (function stream &key initial-value)
(let ((value initial-value))
(loop for char = (read-char stream nil nil)
while char
do (setf value (funcall function value char)))
value))
(defun word-count (&optional (stream *standard-input*))
(rest (stream-reduce
(lambda (state char)
(destructuring-bind (in-p chars words lines) state
(case char
(#\newline (list nil (1+ chars) words (1+ lines)))
((#\space #\tab) (list nil (1+ chars) words lines))
(otherwise (list t
(1+ chars)
(if in-p words (1+ words))
lines)))))
stream
:initial-value (list nil 0 0 0))))
Here is a Scheme version of the program, from my blog, which implements the entire Unix word count program, including argument- and file-handling. The key function is wc, which is purely functional. It moves all local variables into the arguments of a local function (defined via named-let), which is the standard idiom for converting an imperative loop to functional style. The man page and code appear below:
NAME
wc -- word count
SYNOPSIS
wc [ -lwc ] [ name ... ]
DESCRIPTION
Wc counts lines, words and characters in the named files,
or in the standard input if no name appears. A word is a
maximal string of characters delimited by spaces, tabs or
newlines.
If the optional argument is present, just the specified
counts (lines, words, or characters) are selected by the
letters l, w or c.
#! /usr/bin/scheme --script
(define l-flag #t)
(define w-flag #t)
(define c-flag #t)
(define (update-flags fs)
(if (not (member #\l fs)) (set! l-flag #f))
(if (not (member #\w fs)) (set! w-flag #f))
(if (not (member #\c fs)) (set! c-flag #f)))
(define (put-dec n width)
(let* ((n-str (number->string n)))
(display (make-string (- width (string-length n-str)) #\space))
(display n-str)))
(define (wc)
(let loop ((inword #f) (c (read-char)) (ls 0) (ws 0) (cs 0))
(cond ((eof-object? c) (values ls ws cs))
((char=? c #\newline)
(loop #f (read-char) (add1 ls) ws (add1 cs)))
((not (member c '(#\space #\newline #\tab)))
(if inword
(loop #t (read-char) ls ws (add1 cs))
(loop #t (read-char) ls (add1 ws) (add1 cs))))
(else (loop #f (read-char) ls ws (add1 cs))))))
(define (main args)
(when (and (pair? args) (char=? (string-ref (car args) 0) #\-))
(update-flags (cdr (string->list (car args))))
(set! args (cdr args)))
(if (null? args)
(let-values (((ls ws cs) (wc)))
(when l-flag (display ls) (display " "))
(when w-flag (display ws) (display " "))
(when c-flag (display cs) (display " "))
(newline))
(let loop ((args args) (l-tot 0) (w-tot 0) (c-tot 0))
(if (null? args)
(begin (when l-flag (put-dec l-tot 12))
(when w-flag (put-dec w-tot 12))
(when c-flag (put-dec c-tot 12)))
(with-input-from-file (car args)
(lambda ()
(let-values (((ls ws cs) (wc)))
(when l-flag (put-dec ls 12))
(when w-flag (put-dec ws 12))
(when c-flag (put-dec cs 12))
(display " ") (display (car args)) (newline)
(loop (cdr args) (+ l-tot ls) (+ w-tot ws) (+ c-tot cs)))))))))
(main (cdr (command-line)))
Here's a solution based on the Clojure example posted here but in CL using recursion.
(defstruct (state (:constructor make-state (state chars words lines)))
state chars words lines)
(defun wc (state stream)
(symbol-macrolet ((s (state-state state))
(c (state-chars state))
(w (state-words state))
(l (state-lines state)))
(case (read-char stream nil :eof)
(:eof state)
(#\Newline (wc (make-state :out (1+ c) w (1+ l)) stream))
(#\Space (wc (make-state :out (1+ c) w l) stream))
(t (if (eq s :out)
(wc (make-state :in (1+ c) (1+ w) l) stream)
(wc (make-state :in (1+ c) w l) stream))))))
(with-input-from-string (stream "Hello Functional Programming World")
(wc (make-state :out 0 0 0) stream))
;;; => #S(STATE :STATE :IN :CHARS 34 :WORDS 4 :LINES 0)
I believe you could write this somewhat more elegantly while still only iterating over the input once, but you'll need to make GHC do more work, certainly use -O2.
I have not yet compiled this code, much less compared it's speed vs. Thomas DuBuisson's answer, but this should indicate the basic direction.
{-# LANGUAGE BangPatterns #-}
import Data.List
wordcount = snd . foldl' go (False,0)
where go (!b,!n) !c = if elem c [' ','\t','\n'] then (False,n)
else (True, n + if b then 0 else 1)
linecount = foldl' go 0
where go !n !c = n + if c == '\n' then 1 else 0
main = interact $ show . go
where go x = (linecount x, wordcount x, foldl' (\!n _ ->n+1) 0 x)
If I understand stream fusion correctly, then GHC should inline wordcount and linecount into main, merge the three foldl' commands into one, well hopefully, and start rearranging the if checks. I'd hope it'd inlined elem and foldl' too of course.
If not, you could certainly force inlining and probably create a simple fusion rule, but maybe the defaults suffice. Or maybe some simple alterations produce the desired effect.
Btw, I have written foldl' (\n _ ->n+1) 0 x only because I've heard bad storied about length, but maybe length works fine, another change worth profiling.
In Haskell using strict IO rather than lazy. Does words only but you can easily implement characters and lines on top of this. Requires the text and conduit packages:
module Main
where
import Control.Applicative
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
import System.Environment
main :: IO ()
main = do args <- getArgs
print <$> (runResourceT $
CB.sourceFile (args !! 0)
$$ CB.lines
=$= CT.decode CT.utf8
=$= CL.map T.words
=$ CL.fold (\acc words -> acc + length words) 0)
Here's a version in Typed Racket using match and the for loop macros:
(: word-count : Input-Port -> Void)
(define (word-count in)
(define-values (nl nw nc st)
(for/fold: ([nl : Integer 0] [nw : Integer 0] [nc : Integer 0]
[state : (U 'in 'out) 'out])
([c (in-input-port-chars in)])
(match* (c state)
[(#\newline _) (values (add1 nl) nw (add1 nc) 'out)]
[((? char-whitespace?) _)
(values (add1 nl) nw (add1 nc) 'out)]
[(_ 'out) (values nl (add1 nw) (add1 nc) 'in)]
[(_ _) (values nl nw (add1 nc) state)])))
(printf "nl: ~s, nw: ~s, nc: ~s\n" nl nw nc))
Here is a Haskell implementation, where I have tried to stay close to the approach followed by the original C program. Iterations often become fold operations, with state-containing variables ending up as the first argument to the operation passed to fold.
-- Count characters, words, and lines in an input string.
wordCount::String->(Int, Int, Int)
wordCount str = (c,w,l)
where (inWord,c,w,l) = foldl op (False,0,0,1) str
where op (inWord,c,w,l) next | next == '\n' = (False,c+1,w,l+1)
| next == '\t' || next == ' ' = (False,c+1,w,l)
| inWord == False = (True,c+1,w+1,l)
| otherwise = (True,c+1,w,l)
main = interact $ show . wordCount