How to simulate groups/named groups with the parser? - haskell

There is a library regex-applicative. I want to extract the file name from Content-Disposition HTTP header looking as:
attachment; filename="this is file name .ext"
attachment;filename=fname.ext
and similar...
It seems that the function getFile matches such fragments:
import Text.Regex.Applicative
...
getFile :: String -> Maybe (String, String, String) -- prefix, RESULT, suffix
getFile hdr =
parse
where
unquotedName = many $ psym (/= ' ')
quotedName = "\"" <> many (psym (/= '"')) <> "\""
name = "filename" <> "=" <> (quotedName <|> unquotedName)
parse = findFirstInfix name hdr
but how to extract the name of the file? In standard regexp we can use groups/named groups like filename=([^ ]+), so the name will be in the first group. But how to do it with my code above?
I tried to add something like:
newtype FN = FN String deriving Show
...
... (FN <$> many (psym (/='"')) ...
but it seems I am doing it wrongly.
EDIT:
Not sure is it the most convenient way to do it:
data FN = FN String | N deriving Show
instance Semigroup FN where
N <> a = a
a <> _ = a
getFilename1 hdr =
parse
where
unquotedName = FN <$> (many $ psym (/= ' '))
quotedName = (N <$ "\"") <> (FN <$> many (psym (/= '"'))) <> (N <$ "\"")
name = (N <$ ("filename" <> "=")) <> (quotedName <|> unquotedName)
parse = findFirstInfix name hdr
EDIT:
PS. Instead of FN - Maybe (First a) can be used sure.

Use *> and <* instead of <> to drop the results of the irrelevant parts. For multiple groups, you can also use <$> and <*>. Read about parser combinators to learn more about this.
getFilename1 hdr = findFirstInfix name hdr
where
unquotedName = many $ psym (/= ' ')
quotedName = "\"" *> many (psym (/= '"')) <* "\""
name = "filename" *> "=" *> (quotedName <|> unquotedName)

Related

Haskell: successive modifications of a text

I want to know how to make modifications to a text that is full of special characters and codes and replace those codes with strings.
I have the following text:
text=
"#chomsky/syntactic structures/chomskySynt/: published in 1957. #bloomfield/language/bloomfieldLan/: published in 1933. #chomsky/aspects of a theory of syntax/chomskyAsp/: published in 1965. ... #see/chomskySynt/ is considered the starting point of generative linguistics.... Another hypothesis was introduced in #see/chomskyAsp/."
I want to turn it into=
"Chomsky 1: Syntactic structures : published in 1957. Bloomfield 1: Language : published in 1933. Chomsky 2: Aspects of a theory of syntax : published in 1965. ... Chomsky 1 is considered the starting point of generative linguistics ... Another hypothesis was introduced in Chomsky 2..."
Explanation of the special characters and codes: the information on a book starts with # followed by the name of the author (chomsky for example) followed by / then title of the book / then the special code for the book (chomskyAsp) then /
The citation of a book starts with #see followed by / then the special code of the book (ex. chomskySyn) /
The modifications are:
To count how many times an author is cited and concatenate the number to the name: Chomsky 1, for example.
Author name will start with a capital letter
Remove the special code : chomskySynt which serves only as an identification code.
Replace the reference : #see/chomskyAsp with the Chomsky 2. That is replace the reference with the actual author and number.
Here is my code:
RemoveSlash = myReplace "/"" " text
removeDash = map lines $ (filter(any isLetter) . groupBy ( (==) `on` (=='#'))) $ removeSlash
flattenList= concat removeDash
splitIntoWords = map words flattenList
And here is the myReplace function:
myReplace _ _ [] = []
myReplace a b s#(x:xs)= if isPrefixOf a s
then b++myReplace a b (drop(length a)s)
else x: myReplace a b xs
Here is the result so far:
[["chomsky syntactic structures chomskySynt published in 1957. "], ["bloomfield language bloomfieldLan published in 1933. "],["chomsky aspects of a theory of syntax chomskyAsp published in 1965. ... "],["see chomskySynt is considered the starting point of generative linguistics.... Another hypothesis was introduced in "],["see chomskyAsp"]]
The reason I flattened the list and split it into words is now if I do:
map head splitIntoWords
I get ["chomsky","bloomfield","chomsky","see","see"]
I am stuck at this stage. How do I count how many times an author is cited and concatenate the number to the name. I thought of using the zip function:
zipChomsky =zip [1, 2][x | x <- diviser,(head x) == "chomsky"]
This gives:
[(1["chomsky","syntactic","structures","chomskySynt","published","in","1957."]),(2,["chomsky","aspects","of","a","theory","of","syntax","chomskyAsp","published","in","1965.","..."])]
But the result is very different from: Chomsky 1: ...
EDIT: I didn't mean to make the answer this long, but the problem turned out a non-trivial task, and I'm not quite sure how much detail I should put in the answer. In case you understand all the tools I'm using, the full code is just at the end of this answer.
In your case, you'll need:
an approach to parse your input document
a suitable data structure to store the input information
displaying the data as output format
For the parsing part, perhaps Regex is enough (maybe), but I guess the Parsec library is a better choice. For detailed usage of Parsec please refer to the link, and I'll only try to show how to use it in your case:
First, import Text.ParserCombinators.Parsec.
A document is a list of
a literal string
a definition, with format #<Author>/<Title>/<Code>/, as in "#chomsky/syntactic structures/chomskySynt/"
a citation, with format #see/<Code>/, as in "#see/chomskyAsp/"
Hence we define
data Index = Index {
getAuthor :: String,
getTitle :: String,
getSpecialCode :: String,
getAuthorCount :: Int
-- For counting author later.
} deriving (Show)
data Content = Def Index
| Cite String Index
-- We'll fill in Index later.
| Literal String
deriving (Show)
and our input document will just be turned into [Content].
Correspondingly, we'll use the following function (actually, parser) to parse the input:
document = many (try def <|> try cite <|> literal)
literal = Literal <$> many1 (noneOf "#")
def = do
char '#'
author <- many1 $ noneOf "/"
char '/'
title <- many1 $ noneOf "/"
char '/'
code <- many1 $ noneOf "/"
char '/'
return $ Def author title code
cite = do
try $ string "#see/"
code <- many1 $ noneOf "/"
char '/'
return $ Cite code nullIndex
A short explanation:
A document is many (def or cite or literal), with operator <|> combining parsers.
A literal is a string, stopping at '#', with at least 1 char (using many1); a parser inside many should not accept empty input, think of why!
A def is #<Author>/<Title>/<Code>/, and we can write in do-notation since Parser is a monad.
A cite goes similarly.
A def, cite, or string "#see/" parse multiple characters, hence is possible to fail when they have consumed some chars; therefore, we use the combinator try.
By the way, nullIndex is just a placeholder before we actually fill this record:
nullIndex :: Index
nullIndex = Index "" "" "" 0
Now we only need a function with signiture [Content] -> String.
We can start with captializing the author name:
capitalizeAuthor :: Content -> Content
capitalizeAuthor (Def x) = Def (x {getAuthor = author'}) where
author' = toUpper (head author) : tail author
author = getAuthor x
capitalizeAuthor y = y
The other tasks are not local, since the relation between Contents should be observed, hence we will use a foldl across the list.
Define
import Data.Map.Strict ((!))
import qualified Data.Map.Strict as M
type CodeDict = M.Map String Index
-- Map Code Index
type AuthorDict = M.Map String Int
-- Map Author Count
type Fold = (CodeDict, AuthorDict, [Content])
emptyFold :: Fold
emptyFold = (M.empty, M.empty, [])
The Fold type will store the state when we modify along the original [Content].
(I realize that the code will be much clearer if I use the State monad, but I'm not sure if I need to explain it then ...)
In addition, a folding function for foldl
accum :: Fold -> Content -> Fold
accum (c,a,ls) (Def x) = (c',a',Def x':ls) where
a' = M.insertWith (+) author 1 a
c' = M.insert code x' c
x' = x {getAuthorCount = count}
count = maybe 1 (+1) $ a !? author
author = getAuthor x
code = getSpecialCode x
accum (c,a,ls) (Cite code _) = (c,a,Cite code (c ! code) : ls)
accum (c,a,ls) y = (c,a,y:ls)
After foldr, the resulted list will contain the contents with
getAuthorCount correctly filled
Cites transferred into Defs, since they have the same outputting format.
The resulted list is reversed, so you'll need Data.List.reverse.
Finally, you can define your own version of Show for Content. For example,
instance Show Index where
show x = getAuthor x ++ " "
++ show (getAuthorCount x) ++ ": "
++ getTitle x ++ " "
instance Show Content where
show (Def idx) = show idx
show (Cite x idx) = getAuthor idx ++ " "
++ show (getAuthorCount idx)
show (Literal x) = x
as I figured out from your output sample.
The full length code:
import Data.Char
import Data.List (reverse)
import Data.Map.Strict ((!),(!?))
import qualified Data.Map.Strict as M
import Text.ParserCombinators.Parsec
data Index = Index {
getAuthor :: String,
getTitle :: String,
getSpecialCode :: String,
getAuthorCount :: Int
-- For counting author later.
}
nullIndex :: Index
nullIndex = Index "" "" "" 0
instance Show Index where
show x = getAuthor x ++ " "
++ show (getAuthorCount x) ++ ": "
++ getTitle x ++ " "
data Content = Def Index
| Cite String Index
| Literal String
instance Show Content where
show (Def idx) = show idx
show (Cite x idx) = getAuthor idx ++ " "
++ show (getAuthorCount idx)
show (Literal x) = x
document = many (try cite <|> try def <|> literal)
literal = Literal <$> many1 (noneOf "#")
def = do
char '#'
author <- many1 $ noneOf "/"
char '/'
title <- many1 $ noneOf "/"
char '/'
code <- many1 $ noneOf "/"
char '/'
return $ Def $ Index author title code 0
cite = do
try $ string "#see/"
code <- many1 $ noneOf "/"
char '/'
return $ Cite code nullIndex
capitalizeAuthor :: Content -> Content
capitalizeAuthor (Def x) = Def (x {getAuthor = author'}) where
author' = toUpper (head author) : tail author
author = getAuthor x
capitalizeAuthor y = y
type CodeDict = M.Map String Index
-- Map Code Index
type AuthorDict = M.Map String Int
-- Map Author Count
type Fold = (CodeDict, AuthorDict, [Content])
emptyFold :: Fold
emptyFold = (M.empty, M.empty, [])
accum :: Fold -> Content -> Fold
accum (c,a,ls) (Def x) = (c',a',Def x':ls) where
a' = M.insertWith (+) author 1 a
c' = M.insert code x' c
x' = x {getAuthorCount = count}
count = maybe 1 (+1) $ a !? author
author = getAuthor x
code = getSpecialCode x
accum (c,a,ls) (Cite code _) = (c,a,Cite code (c ! code) : ls)
accum (c,a,ls) y = (c,a,y:ls)
main :: IO ()
main = do
line <- getLine
let parsed = parse document "" line
case parsed of
Left x -> print x
Right cs -> do
let cs1 = map capitalizeAuthor cs
let (_,_,cs2) = foldl accum emptyFold cs1
let output = concatMap show $ reverse cs2
putStrLn output

Parser for JSON String

I'm trying to write a parser for a JSON String.
A valid example, per my parser, would be: "\"foobar\"" or "\"foo\"bar\"".
Here's what I attempted, but it does not terminate:
parseEscapedQuotes :: Parser String
parseEscapedQuotes = Parser f
where
f ('"':xs) = Just ("\"", xs)
f _ = Nothing
parseStringJValue :: Parser JValue
parseStringJValue = (\x -> S (concat x)) <$>
((char '"') *>
(zeroOrMore (alt parseEscapedQuotes (oneOrMore (notChar '"'))))
<* (char '"'))
My reasoning is that, I can have a repetition of either escaped quotes "\"" or characters not equal to ".
But it's not working as I expected:
ghci> runParser parseStringJValue "\"foobar\""
Nothing
I don't know what parser combinator library you are using, but here is a working example using Parsec. I'm using monadic style to make it clearer what's going on, but it is easily translated to applicative style.
import Text.Parsec
import Text.Parsec.String
jchar :: Parser Char
jchar = escaped <|> anyChar
escaped :: Parser Char
escaped = do
char '\\'
c <- oneOf ['"', '\\', 'r', 't' ] -- etc.
return $ case c of
'r' -> '\r'
't' -> '\t'
_ -> c
jstringLiteral :: Parser String
jstringLiteral = do
char '"'
cs <- manyTill jchar (char '"')
return cs
test1 = parse jstringLiteral "" "\"This is a test\""
test2 = parse jstringLiteral "" "\"This is an embedded quote: \\\" after quote\""
test3 = parse jstringLiteral "" "\"Embedded return: \\r\""
Note the extra level of backslashes needed to represent parser input as Haskell string literals. Reading the input from a file would make creating the parser input more convenient.
The definition of the manyTill combinator is:
manyTill p end = scan
where
scan = do{ end; return [] }
<|>
do{ x <- p; xs <- scan; return (x:xs) }
and this might help you figure out why your definitions aren't working.

Haskell - reading from file, parse expressions, compute and return result

I'm quite new to Haskell and I want to work a bit with parsers.
I'm trying to create a small program that will read an expression from a text.txt file and will return the result.
First of all I read the text and convert what I read to a string:
module Main where
import Data.Char
convert :: String -> String
convert = unlines . (map convertLine) . lines
convertLine :: String -> String
convertLine = unwords . (map convertWord) . words
convertWord :: String -> String
convertWord s = (toUpper (head s)):(tail s)
main = do
name <- readFile "test.txt"
putStr $ convert name
Now, I want to parse that string and I start with:
parse :: (Read a, Num a, Fractional a) => String -> [SyntacticalElement a]
parse "" = []
parse putStr = element : (parse rest)
where (element, rest) = next_elem putStr
Here I get the error: next_elem is not in scope
Any idea why?
Update:
module Main where
import Data.Char
convert :: String -> String
convert = unlines . (map convertLine) . lines
convertLine :: String -> String
convertLine = unwords . (map convertWord) . words
convertWord :: String -> String
convertWord s = (toUpper (head s)):(tail s)
main = do
name <- readFile "test.txt"
putStr $ convert name
number = ['0'..'9'] ++ ['.']
operator = ['+', '-', '*', '/']
open_brackets = ['(', '[']
close_brackets = [')', ']']
brackets = open_brackets ++ close_brackets
allowed_chars = number ++ operator ++ brackets
parse :: (Read a, Num a, Fractional a) => String -> [SyntacticalElement a]
parse "" = []
parse putStr = element : (parse rest)
where (element, rest) = next_elem putStr
next_elem :: (Read a, Num a, Fractional a) => String -> (SyntacticalElement a, String)
next_elem s#(first:_)
| is_open_bracket first = (to_sublist content, rest_b)
| is_operator first = (to_operator operator, rest_o)
| is_number first = (to_number number, rest_n)
| is_close_bracket first = error "Unexpected closing bracket!"
| otherwise = error $ "Invalid Expression: \"" ++ s ++ "\""
where (number, rest_n) = span is_number s
(operator, rest_o) = span is_operator s
(content, rest_b) = parse_bracket s
Because the compiler doesn't know what next_elem means. Why would it? Where is it defined?
Btw for these situations it's often handy to use interact rather than bothering to read from a file.

Haskell parsec parsing to maybe

Just a simple question that I cannot solve.
I want to parse a string as either a String or a Maybe Double, where an empty string or an "n/a" is parsed as a Nothing. For example something like:
data Value = S String | N (Maybe Double)
value::CharParser () Value
value = val <* spaces
where val = N <$> v_number
<|> S <$> v_string
<|> N <$> v_nothing
I am having trouble with the v_nothing (and also leading and training white space).
Thanks.
EDIT:
v_number :: CharParser () (Maybe Double)
v_number = do s <- getInput
case readSigned readFloat s of
[(n, s')] -> Just n <$ setInput s'
_ -> empty
v_string :: CharParser () String
v_string = (many1 jchar)
where jchar = char '\\' *> (p_escape <|> p_unicode)
<|> satisfy (`notElem` "\"\\")
I tried all sort sorts of things for v_nothing to no avail.
Maybe something like this?
value = do skipMany space
choice $ map try [
do string "n/a" <|> (eof >> return [])
return $ N Nothing,
do d <- many digit
return $ N $ Just (read d)
-- do ...
]

Parsing a string in haskell

I've got some strings that I want to parse into a list of "chunks". My strings look like this
"some text [[anchor]] some more text, [[another anchor]]. An isolated ["
And I expect to get back something like this
[
TextChunk "some text ",
Anchor "anchor",
TextChunk " some more text, "
Anchor "another anchor",
TextChunk ". An isolated ["
]
I've managed to write a function and types that do what I need, but they seems overly ugly.
Is there a nicer way to do this?
data Token = TextChunk String | Anchor String deriving (Show)
data TokenizerMode = EatString | EatAnchor deriving (Show)
tokenize::[String] -> [Token]
tokenize xs =
let (_,_,tokens) = tokenize' (EatString, unlines xs, [TextChunk ""])
in reverse tokens
tokenize' :: (TokenizerMode, String, [Token]) -> (TokenizerMode, String,[Token])
-- If we're starting an anchor, add a new anchor and switch modes
tokenize' (EatString, '[':'[':xs, tokens) = tokenize' (EatIdentifier, xs, (Identifier ""):tokens )
-- If we're ending an anchor ass a new text chunk and switch modes
tokenize' (EatAnchor, ']':']':xs, tokens) = tokenize' (EatString, xs, (TextChunk ""):tokens )
-- Otherwise if we've got stuff to consume append it
tokenize' (EatString, x:xs, (TextChunk t):tokens) = tokenize'( EatString, xs, (TextChunk (t++[x])):tokens)
tokenize' (EatAnchor, x:xs, (Identifier t):tokens) = tokenize'( EatAnchor, xs, (Identifier (t++[x])):tokens)
--If we've got nothing more to consume we're done.
tokenize' (EatString, [], tokens) = ( EatString, [], tokens)
--We'll only get here if we're given an invalid string
tokenize' xx = error ("Error parsing .. so far " ++ (show xx))
This should work, including lone brackets:
import Control.Applicative ((<$>), (<*), (*>))
import Text.Parsec
data Text = TextChunk String
| Anchor String
deriving Show
chunkChar = noneOf "[" <|> try (char '[' <* notFollowedBy (char '['))
chunk = TextChunk <$> many1 chunkChar
anchor = Anchor <$> (string "[[" *> many (noneOf "]") <* string "]]")
content = many (chunk <|> anchor)
parseS :: String -> Either ParseError [Text]
parseS input = parse content "" input
Note the use of try to allow backtracking when the chunkChar parser matches two opening brackets. Without try, the first bracket would have been consumed at that point.
Here is a simplistic version using two mutually recursive functions.
module Tokens where
data Token = TextChunk String | Anchor String deriving (Show)
tokenize :: String -> [Token]
tokenize = textChunk emptyAcc
textChunk :: Acc -> String -> [Token]
textChunk acc [] = [TextChunk $ getAcc acc]
textChunk acc ('[':'[':ss) = TextChunk (getAcc acc) : anchor emptyAcc ss
textChunk acc (s:ss) = textChunk (snocAcc acc s) ss
anchor :: Acc -> String -> [Token]
anchor acc [] = error $ "Anchor not terminated"
anchor acc (']':']':ss) = Anchor (getAcc acc) : textChunk emptyAcc ss
anchor acc (s:ss) = anchor (snocAcc acc s) ss
-- This is a Hughes list (also called DList) which allows
-- efficient 'Snoc' (adding to the right end).
--
type Acc = String -> String
emptyAcc :: Acc
emptyAcc = id
snocAcc :: Acc -> Char -> Acc
snocAcc acc c = acc . (c:)
getAcc :: Acc -> String
getAcc acc = acc []
This version has a problem that it will generate empty TextChunks if the input starts or ends with an Anchor or if there are two contiguous anchors in the text.
It is straight-forward to add checks to not generate a TextChunk if the accumulator is empty but it makes the code about twice as long - maybe I would reach for Parsec after all...
Solution using monadic Parsec.
import Text.ParserCombinators.Parsec
data Text = TextChunk String
| Anchor String
deriving Show
inputString = "some text [[anchor]] some more text, [[another anchor]]."
content :: GenParser Char st [Text]
content = do
s1 <- many (noneOf "[")
string "[["
s2 <- many (noneOf "]")
string "]]"
s3 <- many (noneOf "[")
string "[["
s4 <- many (noneOf "]")
string "]]."
return $ [TextChunk s1, Anchor s2, TextChunk s3, Anchor s4]
parseS :: String -> Either ParseError [Text]
parseS input = parse content "" input
How it works:
> parseS inputString
Right [TextChunk "some text ",Anchor "anchor",TextChunk " some more text, ",Anchor "another anchor"]
it :: Either ParseError [Text]

Resources