Can parser combinators be made efficient? - haskell

Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.
Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:
import Control.Applicative
import Text.Parsec hiding ((<|>))
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
fact = read <$> many1 digit <|> char '(' *> expr <* char ')'
eval :: String -> Int
eval = either (error . show) id . parse expr "" . filter (/= ' ')
main :: IO ()
main = do
file <- readFile "expr"
putStr $ show $ eval file
putStr "\n"
and here's my self-contained precedence climbing parser in F#:
let rec (|Expr|) = function
| P(f, xs) -> Expr(loop (' ', f, xs))
| xs -> invalidArg "Expr" (sprintf "%A" xs)
and loop = function
| ' ' as oop, f, ('+' | '-' as op)::P(g, xs)
| (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->
let h, xs = loop (op, g, xs)
match op with
| '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)
|> fun op -> loop (oop, op f h, xs)
| _, f, xs -> f, xs
and (|P|_|) = function
| '('::Expr(f, ')'::xs) -> Some(P(f, xs))
| c::_ as xs when '0' <= c && c <= '9' ->
let rec loop n = function
| c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs
| xs -> Some(P(n, xs))
loop 0 xs
| _ -> None
My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?
EDIT:
Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:
open Printf
let rec f ff n =
if n=0 then fprintf ff "1" else
fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)
let () =
let n = try int_of_string Sys.argv.(1) with _ -> 3 in
fprintf stdout "%a\n" f n

I've come up with a Haskell solution that is 30× faster than the Haskell solution you posted (with my concocted test expression).
Major changes:
Change Parsec/String to Attoparsec/ByteString
In the fact function, change read & many1 digit to decimal
Made the chainl1 recursion strict (remove $! for the lazier version).
I tried to keep everything else you had as similar as possible.
import Control.Applicative
import Data.Attoparsec
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
expr :: Parser Int
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
term :: Parser Int
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
fact :: Parser Int
fact = decimal <|> char '(' *> expr <* char ')'
eval :: B.ByteString -> Int
eval = either (error . show) id . eitherResult . parse expr . B.filter (/= ' ')
chainl1 :: (Monad f, Alternative f) => f a -> f (a -> a -> a) -> f a
chainl1 p op = p >>= rest where
rest x = do f <- op
y <- p
rest $! (f x y)
<|> pure x
main :: IO ()
main = B.readFile "expr" >>= (print . eval)
I guess what I concluded from this is that the majority of the slowdown for the parser combinator was that it was sitting on an inefficient base, not that it was a parser combinator, per se.
I imagine with more time and profiling this could go faster, as I stopped when I went past the 25× mark.
I don't know if this would be faster than the precedence climbing parser ported to Haskell. Maybe that would be an interesting test?

I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.
[Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]
I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).
I measured the following numbers (updated numbers for v. 0.9 in parens):
Jon's hand-rolled parser: ~230ms
FParsec parser #1: ~270ms (~235ms)
FParsec parser #2: ~110ms (~102ms)
In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec
automatically generates highly readable error messages,
supports very large files as input (with arbitrary backtracking), and
comes with a declarative, runtime-configurable operator-precedence parser module.
Here's the code for the two FParsec implementations:
Parser #1 (Translation of djahandarie's parser):
open FParsec
let str s = pstring s
let expr, exprRef = createParserForwardedToRef()
let fact = pint32 <|> between (str "(") (str ")") expr
let term = chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))
let parse str = run expr str
Parser #2 (Idiomatic FParsec implementation):
open FParsec
let opp = new OperatorPrecedenceParser<_,_,_>()
type Assoc = Associativity
let str s = pstring s
let noWS = preturn () // dummy whitespace parser
opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))
let expr = opp.ExpressionParser
let term = pint32 <|> between (str "(") (str ")") expr
opp.TermParser <- term
let parse str = run expr str

In a nutshell, parser combinators are slow for lexing.
There was a Haskell combinator library for building lexers (see "Lazy Lexing is Fast" Manuel M. T. Chakravarty) - as the tables were generated at runtime, there wasn't the hassle of code generation. The library got used a bit - it was initially used in one of the FFI preprocessors, but I don't think it ever got uploaded to Hackage, so maybe it was a little too inconvenient for regular use.
In the OCaml code above, the parser is directly matching on char-lists so it can be as fast as list destructuring is in the host language (it would be much faster than Parsec if it were re-implemented in Haskell). Christian Lindig had an OCaml library that had a set of parser combinators and a set of lexer combinators - the lexer combinators were certainly much simpler than Manuel Chakravarty's, and it might might be worthwhile tracking down this library and bench-marking it before writing a lexer generator.

Have you tried one of the known fast parser libraries? Parsec's aims have never really been speed, but ease of use and clarity. Comparing to something like attoparsec may be a more fair comparison, especially because the string types are likely to be more equal (ByteString instead of String).
I also wonder which compile flags were used. This being another trolling post by the infamous Jon Harrop, it would not surprise me if no optimisations were used at all for the Haskell code.

Related

How to use parsec to get sub-strings of specific pattern in a string

I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,
"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on,
but F alone should not be chosen, that is, choose those which start with F
followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++
['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."
the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.
With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:
hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
; many space
; c1 <- digit
; cs <- many hao
; return (c:c1:cs)
}
parse tuhao "" str -- can parse the str and get one sub-string.
However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.
So I appreciate any tip! ^_^
F12, F 12, F1(a), F2a, F5-A, F34-5
This is an incomplete description, so I'll make some guesses.
I would start by defining a type that can contain the logical parts of these expressions. E.g.
newtype F = F (Int, Maybe String) deriving Show
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.
Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]
I would then proceed to write parsers for each of these sub-parts and compose them:
module Main where
import Text.Parsec
import Data.Maybe (catMaybes)
symbol :: String -> Parser String
symbol s = string s <* spaces
parens :: Parser a -> Parser a
parens = between (string "(") (string ")")
digits :: Parser Int
digits = read <$> many1 digit
parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
where
firstPart :: Parser Int
firstPart = symbol "F" >> digits
secondPart :: Parser (Maybe String)
secondPart = optionMaybe $ choice
[ many1 letter
, parens (many1 letter)
, string "-" >> many1 alphaNum
]
(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,
extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
<|> do anyChar >> extract p
<|> do eof >> return []
readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s
main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
This prints:
Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
Takeaways:
You can parse optional whitespace using token parsing (symbol).
You can parse optional parts with option, optionMaybe or optional.
You can alternate between combinators using a <|> b <|> c or choice [a, b, c].
When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)
[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:
newtype F = F (Int, Maybe String, Maybe String)
We can get sub-strings of specific pattern in a string with the
findAll
combinator from
replace-megaparsec.
Notice that this tuhao parser doesn't actually return anything. The findAll combinator just checks for success of the parser to find sub-strings which match the pattern.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
let tuhao :: Parsec Void String ()
tuhao = do
void $ single 'F'
void $ space
void $ digitChar
void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."
rights $ fromJust $ parseMaybe (findAll tuhao) input
["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

Cutting specific chunks from a Haskell String

I'm trying to cut chunks from a list, with a given predicate. I would have preferred to use a double character, e.g. ~/, but have resolved to just using $. What I essentially want to do is this...
A: "Hello, my $name is$ Danny and I $like$ Haskell"
What I want to turn this into is this:
B: "Hello, my Danny and I Haskell"
So I want to strip everything in between the given symbol, $, or my first preference was ~/, if I can figure it out. What I tried was this:
s1 :: String -> String
s1 xs = takeWhile (/= '$') xs
s2 :: String -> String
s2 xs = dropWhile (/= '$') xs
s3 :: String -> String
s3 xs = s3 $ s2 $ s1 xs
This solution seems to just bug my IDE out (possibly infinite looping).
Solution:
s3 :: String -> String
s3 xs
|'$' `notElem` xs = xs
|otherwise = takeWhile (/= '$') xs ++ (s3 $ s1 xs)
s1 :: String -> String
s1 xs = drop 1 $ dropWhile (/= '$') $ tail $ snd $ break ('$'==) xs
This seems like a nice application for parsers. A solution using trifecta:
import Control.Applicative
import Data.Foldable
import Data.Functor
import Text.Trifecta
input :: String
input = "Hello, my $name is$ Danny and I $like$ Haskell"
cutChunk :: CharParsing f => f String
cutChunk = "" <$ (char '$' *> many (notChar '$') <* char '$')
cutChunk matches $, followed by 0 or more (many) non-$ characters, then another $. Then we use ("" <$) to make this parser's value always be the empty string, thus discarding all the characters that this parser matches.
includeChunk :: CharParsing f => f String
includeChunk = some (notChar '$')
includeChunk matches the text that we want to include in the result, which is anything that's not the $ character. It's important that we use some (matching one or more characters) and not many (matching zero or more characters) because we're going to include this parser within another many expression next; if this parser matched on the empty string, then that could loop infinitely.
chunks :: CharParsing f => f String
chunks = fold <$> many (cutChunk <|> includeChunk)
chunks is the parser for everything. Read <|> as "or", as in "parse either a cutChunk or an includeChunk". many (cutChunk <|> includeChunk) is a parser that produces a list of chunks e.g. Success ["Hello, my ",""," Danny and I ",""," Haskell"], so we fold the output to concatenate those chunks together into a single string.
result :: Result String
result = parseString chunks mempty input
The result:
Success "Hello, my Danny and I Haskell"
Your infinite loop comes from calling s3 recursively with no base case:
s3 :: String -> String
s3 xs = s3 $ s2 $ s1 xs
Adding a base case corrects the infinite loop:
s3 xs
| '$' `notElem` xs = xs
| otherwise = ...
This is not the whole answer. Think about what s1 actually does and where you use its return value:
s1 "hello $my name is$ ThreeFx" == "hello "
For further reference, see the break function:
break :: (a -> Bool) -> [a] -> ([a], [a])
I think your logic is wrong, perhaps easier to write it in an elementary way
Prelude> let pr xs = go xs True
Prelude| where go [] _ = []
Prelude| go (x:xs) f | x=='$' = go xs (not f)
Prelude| | f = x : go xs f
Prelude| | otherwise = go xs f
Prelude|
Prelude> pr "Hello, my $name is$ Danny and I $like$ Haskell"
"Hello, my Danny and I Haskell"
Explanation The flag f keeps track of the state (either pass mode or not). If the current char is a token skip and switch state.

Haskell calculator - order of operations

I am very new to Haskell and I need to make a working calculator what will give answers to expressions like: 2+3*(5+12)
I have something that manages to calculate more or less but I have a problem with order of operations. I have no idea how to do it. Here is my code:
import Text.Regex.Posix
import Data.Maybe
oblicz :: String -> Double
oblicz str = eval (Nothing, None) $ map convertToExpression $ ( tokenize str )
eval :: (Maybe Double,Expression)->[Expression]->Double
eval (Nothing, _) ((Variable v):reszta) = eval (Just v, None) reszta
eval (Just aktualnyWynik, None) ((Operator o):reszta) = eval ((Just aktualnyWynik), (Operator o)) reszta
eval (Just aktualnyWynik, (Operator o)) ((Variable v):reszta) = eval (Just $ o aktualnyWynik v , None) reszta
eval (aktualnyWynik, operator) (LeftParenthesis:reszta)
= eval (aktualnyWynik, operator) ((Variable (eval (Nothing, None) reszta)):(getPartAfterParentheses reszta))
eval (Just aktualnyWynik, _) [] = aktualnyWynik
eval (Just aktualnyWynik, _) (RightParenthesis:_) = aktualnyWynik
data Expression = Operator (Double->Double->Double)
| Variable Double
| LeftParenthesis
| RightParenthesis
| None
tokenize :: String -> [String]
tokenize expression = getAllTextMatches(expression =~ "([0-9]+|\\(|\\)|\\+|-|%|/|\\*)" :: AllTextMatches [] String)
convertToExpression :: String -> Expression
convertToExpression "-" = Operator (-)
convertToExpression "+" = Operator (+)
convertToExpression "*" = Operator (*)
convertToExpression "/" = Operator (/)
convertToExpression "(" = LeftParenthesis
convertToExpression ")" = RightParenthesis
convertToExpression variable = Variable (read variable)
getPartAfterParentheses :: [Expression] -> [Expression]
getPartAfterParentheses [] = []
getPartAfterParentheses (RightParenthesis:expressionsList) = expressionsList
getPartAfterParentheses (LeftParenthesis:expressionsList) = getPartAfterParentheses (getPartAfterParentheses expressionsList)
getPartAfterParentheses (expression:expressionsList) = getPartAfterParentheses expressionsList
I thought maybe I could create two stacks - one with numbers and one with operators. While reading the expression, I could push numbers on one stack and operators on another. When it is an operator I would check if there is something already on the stack and if there is check if I should pop it from the stack and do the math or not - just like in onp notation.
Unfortunately, as I said, I am VERY new to haskell and have no clue how to go about writing this.
Any hints or help would be nice :)
Pushing things on different stacks sure feels very much a prcedural thing to do, and that's generally not nice in Haskell. (Stacks can be realised as lists, which works quite nice in a purely functional fashion. Even real mutable state can be fine if only as an optimisation, but if more than one object needs to be modified at a time then this isn't exactly enjoyable.)
The preferrable way would be to build up a tree representing the expression.
type DInfix = Double -> Double -> Double -- for readability's sake
data ExprTree = Op DInfix ExprTree ExprTree
| Value Double
Evaluating this tree is basically evalTree (Op c t1 t2) = c (evalTree t1) (evalTree t2), i.e. ExprTree->Double right away.
To build the tree up, the crucial point: get the operator fixities right. Different operators have different fixity. I'd put that information in the Operator field:
type Fixity = Int
data Expression = Operator (Double->Double->Double) Fixity
| ...
which then requires e.g.
...
convertToExpression "+" = Operator (+) 6
convertToExpression "*" = Operator (*) 7
...
(Those are the fixities that Haskell itself has for the operators. You can :i + in GHCi to see them.)
Then you'd build the tree.
toExprTree :: [Expression] -> ExprTree
Obvious base case:
toExprTree [Variable v] = Value v
You might continue with
toExprTree (Variable v : Operator c _ : exprs) = Op c (Value v) (toExprTree exprs)
But that's actually not right: for e.g. 4 * 3 + 2 it would give 4 * (3 + 2). We actually need to bring the 4 * down the remaining expressions tree, as deep as the fixities are lower. So the tree needs to know about that as well
data ExprTree = Op DInfix Fixity ExprTree ExprTree
| Value Double
mergeOpL :: Double -> DInfix -> Fixity -> ExprTree -> ExprTree
mergeOpL v c f t#(Op c' f' t' t'')
| c > c' = Op c' f' (mergeOpL v c f t') t''
mergeOpL v c f t = Op c f (Value v) t
What remains to be done is handling parentheses. You'd need to take a whole matching-brackets expression and assign it a tree-fixity of, say tight = 100 :: Fixity.
As a note: such a tokenisation - manual parsing workflow is pretty cumbersome, regardless how nicely functional you do it. Haskell has powerful parser-combinator libraries like parsec, which take most of the work and bookkeeping off you.
What you need to solve this problem is the Shunting-yard Algorithm of Edsger Dijstra as described at http://www.wcipeg.com/wiki/Shunting_yard_algorithm. You can see my implementation at the bottom of this file.
If you are limiting your self to just +,-,*,/ you can also solve the problem using the usual trick in most intro to compiler examples simply parsing into two different non-terminals, ofter called term and product to build the correct tree. This get unwieldy if you have to deal with a lot of operators or they are user defined.

How to make the Parsec chainl1 function follow operator precedence rules

I'm programming a standard math notation -> DC POSIX-compliant format converter. It takes the input string, parses it into an intermediary data type and then turns it into the output string by showing it.
This is the Data type used. I have no problems with the Data type -> Output String conversion, it works flawlessly:
data Expression = Expression :+ Expression
| Expression :- Expression
| Expression :* Expression
| Expression :/ Expression
| Expression :^ Expression
| Cons String
infixr 0 :+
infixr 0 :-
infixr 1 :*
infixr 1 :/
infixr 2 :^
instance Show Expression where
show (x :+ y) = unwords [show x, show y, "+"]
show (x :- y) = unwords [show x, show y, "-"]
show (x :* y) = unwords [show x, show y, "*"]
show (x :/ y) = unwords [show x, show y, "/"]
show (x :^ y) = unwords [show x, show y, "^"]
show (Cons y) = y
The Parsec parser part, however, refuses to comply with the defined operator precedency rules. Clearly because of the way chainl1 is used in the subexpression parser definition:
expression :: Parser Expression
expression = do
spaces
x <- subexpression
spaces >> eof >> return x
subexpression :: Parser Expression
subexpression = (
(bracketed subexpression) <|>
constant
) `chainl1` (
try addition <|>
try substraction <|>
try multiplication <|>
try division <|>
try exponentiation
)
addition = operator '+' (:+)
substraction = operator '-' (:-)
multiplication = operator '*' (:*)
division = operator '/' (:/)
exponentiation = operator '^' (:^)
operator :: Char -> (a -> a -> a) -> Parser (a -> a -> a)
operator c op = do
spaces >> char c >> spaces
return op
bracketed :: Parser a -> Parser a
bracketed parser = do
char '('
x <- parser
char ')'
return x
constant :: Parser Expression
constant = do
parity <- optionMaybe $ oneOf "-+"
constant <- many1 (digit <|> char '.')
return (if parity == Just '-'
then (Cons $ '_':constant)
else Cons constant)
Is there a way of making the parser take into account the operator precedence rules without having to rewrite the entirety of my code?
Well, you don't need to rewrite your entire code, but since your subexpression parser doesn't take precedence into account at all, you have to rewrite that - substantially.
One possibility is to build it from parsers for subexpressions with top-level operators with the same precedence,
atom :: Parser Expression
atom = bracketed subexpression <|> constant
-- highest precedence operator is exponentiation, usually that's
-- right-associative, hence I use chainr1 here
powers :: Parser Expression
powers = atom `chainr1` try exponentiation
-- a multiplicative expression is a product or quotient of powers,
-- left-associative
multis :: Parser Expression
multis = powers `chainl1` (try multiplication <|> try division)
-- a subexpression is a sum (or difference) of multiplicative expressions
subexpression :: Parser Expression
subexpression = multis `chainl1` (try addition <|> try substraction)
Another option is to let the precedence and associativities be taken care of by the library and use Text.Parsec.Expr, namely buildExpressionParser:
table = [ [binary "^" (:^) AssocRight]
, [binary "*" (:*) AssocLeft, binary "/" (:/) AssocLeft]
, [binary "+" (:+) AssocLeft, binary "-" (:-) AssocLeft]
]
binary name fun assoc = Infix (do{ string name; spaces; return fun }) assoc
subexpression = buildExpressionParser table atom
(which requires that bracketed parser and constant consume the spaces after the used tokens).

Is this idiomatic use of Text.Parsec?

My use of Text.Parsec is a little rusty. If I just want to return the matched string is this idiomatic?
category :: Stream s m Char => ParsecT s u m [Char]
category = concat <$> (many1 $ (:) <$> char '/' <*> (many1 $ noneOf "/\n"))
I feel like there might be an existing operator for liftM concat . many1 or (:) <$> p1 <*> p2 that I'm ignoring, but I'm not sure.
That's fine, I think. A little judicious naming would make it prettier:
category = concat <$> many1 segment
where
segment = (:) <$> char '/' <*> many1 (noneOf "/\n")
I think it would be slightly more idiomatic use of Parsec to return something more structured, for example, the list of strings:
catList :: Parser [String]
catList = char '/' *> many1 alphaNum `sepBy1` char '/'
I don't think there's a combinator like the one you were wondering there was, but this is Haskell, and roll-your-own-control-structure-or-combinator is always available:
concatMany1 :: Parser [a] -> Parser [a]
concatMany1 p = concat <$> many1 p
catConcat = concatMany1 $ (:) <$> char '/' <*> many1 alphaNum
But this next combinator is even nicer, and definitely idiomatic Haskell at least:
infixr 5 <:>
(<:>) :: Applicative f => f a -> f [a] -> f [a]
hd <:> tl = (:) <$> hd <*> tl
So now we can write
catCons :: Parser String
catCons = concatMany1 (char '/' <:> many1 alphaNum)
but incidentally also
contrivedExample :: IO String
contrivedExample = getChar <:> getLine
moreContrived :: String -> Maybe String
moreContrived name = find isLetter name <:> lookup name symbolTable
noneOf
You'll notice I've used alphaNum where you used noneOf "/\n". I think noneOf is not good practice; parsers should be really careful to accept onlt the right thing. Are you absolutely sure you want your parser to accept /qwerty/12345/!"£$%^&*()#:?><.,#{}[] \/ "/" /-=_+~? Should it really be happy with /usr\local\bin?
As it stands, your parser accepts any string as long as it starts with / and ends before \n with something that's not /. I think you should rewrite it with alphaNum <|> oneOf "_-.',~+" or similar instead of using noneOf. Using noneOf allows you to avoid thinking about what you should allow and focus on getting positive examples to parse instead of only positive examples to parse.
Parser
I've also always gone for Parser a instead of Stream s m t => ParsecT s u m a. That's just lazy typing, but let's pretend I did it to make it clearer what my code was doing, shall we? :) Use what type signature suits you, of course.

Resources