How can I process citations using Pandoc's Citeproc, in Haskell? - haskell

Starting with "A Simple Example" from the Pandoc documentation, I want to add citation processing functionality. The docs for Text.Pandoc.Citeproc show a function processCitations which supposedly processes citations. Yet given simple org-mode input, and a citation [#test2022], it doesn't seem to work. It compiles and runs just fine, but the output of the code below is: <p><span class="spurious-link" target="url"><em>testing</em></span> [#test2022]</p>, i.e., the citation isn't actually processed. What am I doing wrong? And how can I get this to process my citation?
import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Pandoc.Filter
import Text.Pandoc.Citeproc
main :: IO ()
main = do
result <- runIO $ do
doc <- readOrg def (T.pack "#+bibliography: test.bib\n [[url][testing]]\n[#test2022]")
processed <- processCitations doc
writeHtml5String def processed
html <- handleError result
TIO.putStrLn html
For reference, here's my test.bib bibtex file:
#Book{test2022,
author = {Barus, Foobius},
title = {The Very Persistent Foo or Bar},
publisher = {Foobar Publications, Inc},
year = {2022}
}

I figured this out myself, eventually. Turns out you have to set some extensions, and some options, and set the metadata for the document:
{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Pandoc.Filter
import Text.Pandoc.Citeproc
import qualified Data.Map as M
import Text.Pandoc.Builder (setMeta)
main :: IO ()
main = do
let exts = extensionsFromList [ Ext_citations ]
let readerOptions = def{ readerExtensions = exts}
let writerOptions = def{ writerExtensions = exts}
result <- runIO $ do
doc <- readMarkdown readerOptions (T.pack "Testing testing\n[#test2022]\n")
let doc' = setMeta (T.pack "bibliography") (T.pack "test.bib") doc :: Pandoc
processed <- processCitations doc'
writeHtml5String writerOptions processed
html <- handleError result
TIO.putStrLn html

Related

What is the best way to get data from url and parse it on Haskell?

I'm having trouble with parsing data from url.
I have url with "https://" so i think i should use import Network.HTTP.Conduit
But
simpleHttp url
returns L.ByteString
I really don't understand what shoud i do after that
So i have such code to get data
toStrict1 :: L.ByteString -> B.ByteString
toStrict1 = B.concat . L.toChunks
main :: IO ()
main = do
lbs <- simpleHttp url
let page = toStrict1 lbs
and example of parsing
let lastModifiedDateTime = fromFooter $ parseTags doc
putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
How can i combine this two parts of code?
As you've seen, the simpleHttp function returns a lazy bytestring. There are several ways to deal with this in TagSoup.
First, it turns out that you can parse it directly. The function parseTags has signature:
parseTags :: StringLike str => str -> [Tag str]
meaning that it can parse any type str with a StringLike instance, and if you look at the Text.StringLike module documentation, you'll see that lazy ByteStrings have a StringLike instance.
However, if you go this route, you need to be aware that everything's kind of "trapped" in a ByteString world, so you have to write your code using versions of functions like words and unwords that are bytestring-compatible, and even your putStrLn needs an adapter. A full working example would look like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as CL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags lbs
putStrLn $ "wiki.haskell.org was last modified on "
++ CL.unpack lastModifiedDateTime
where fromFooter = CL.unwords . drop 6 . CL.words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
and it works fine:
> main
wiki.haskell.org was last modified on 9 September 2013, at 22:38.
>
The functions from Data.ByteString.Lazy.Char8 basically assume that the bytestring is ASCII-encoded, which is close enough for this example to work.
However, it would be more robust to decode the bytestring based on the proper character encoding to a valid text type. The two main text types in Haskell are the default String type, which is inefficient and slow, but easy to work with, and the Text type, which is highly efficient but a bit more complicated. (Like ByteString, you need to use Text-compatible versions of functions like words and so on.) Both String and Text have StringLike instances, so they both work fine with TagSoup.
If we were going to write production-quality code, we'd actually consult the response headers from the HTTP request and/or check for a <meta> tag in the HTML to determine the real encoding. But, if we just assume the coding is UTF-8 (which it is), the Text version looks like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL
import qualified Data.ByteString.Lazy as BL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags (TL.decodeUtf8 lbs)
putStrLn $ "wiki.haskell.org was last modified on "
++ TL.unpack lastModifiedDateTime
where fromFooter = TL.unwords . drop 6 . TL.words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
and a String version using Data.ByteString.Lazy.UTF8 from the utf8-string package looks like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.UTF8 as BL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags (BL.toString lbs)
putStrLn $ "wiki.haskell.org was last modified on "
++ lastModifiedDateTime
where fromFooter = unwords . drop 6 . words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")

Heist not substituting templates

I have the following code, just copy-pasted and modernised (the original example does not compile with recent versions of Heist anymore) from here.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString.Char8 as BS
import Data.Monoid
import Data.Maybe
import Data.List
import Control.Applicative
import Control.Lens
import Control.Monad.Trans
import Control.Monad.Trans.Either
import Heist
import Heist.Compiled
import Blaze.ByteString.Builder
conf :: HeistConfig IO
conf = set hcTemplateLocations [ loadTemplates "." ] $
set hcInterpretedSplices defaultInterpretedSplices $
emptyHeistConfig
runHeistConf :: Either [String] (HeistState IO) -> IO (HeistState IO)
runHeistConf (Right hs) = return hs
runHeistConf (Left msgs) = error . intercalate "\n" $ map ("[Heist error]: " ++) msgs
main :: IO ()
main = do
heist <- id <$> (runEitherT $ initHeist conf) >>= runHeistConf
output <- fst $ fromMaybe (error "xxx") $ renderTemplate heist "billy"
BS.putStrLn . toByteString $ output
And the following template:
<!-- billy.tpl -->
<bind tag="wanted">Playstation 4</bind>
<bind tag="got">Monopoly board game</bind>
<apply template="letter">
<bind tag="kiddo">Billy</bind>
I regret to inform you the "<wanted />" you have requested is currently
unavailable. I have substituted this with "<got />". I hope this does not
disappoint you.
</apply>
Running this program outputs to the console the whole template (almost) as is. No substistutions are made. Probably there's some function call missing, required by modern Hesit versions. I was trying to track it down in the documentation, but no luck. Why doesn't it work?
Output:
<!-- billy.tpl --><bind tag='wanted'>Playstation 4</bind>
<bind tag='got'>Monopoly board game</bind>
<apply template='letter'>
<bind tag='kiddo'>Billy</bind>
I regret to inform you the "<wanted></wanted>" you have requested is currently
unavailable. I have substituted this with "<got></got>". I hope this does not
disappoint you.
</apply>
It looks like you are using renderTemplate from Heist.Compiled, but defining interpreted splices. I believe if you change this line:
set hcInterpretedSplices defaultInterpretedSplices
to this
set hcLoadTimeSplices defaultLoadTimeSplices
it should work

how to get html from blaze -- print to file

I am working through the blaze-html tutorial. I just want a simple Hello World page.
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad (forM_)
import Text.Blaze.Html5
import Text.Blaze.Html5.Attributes
import qualified Text.Blaze.Html5 as H
import qualified Text.Blaze.Html5.Attributes as A
import Text.Blaze.Html.Renderer.Text
notes :: Html
notes = docTypeHtml $ do
H.head $ do
H.title "John´ s Page"
body $ do
p "Hello World!"
Where is it? How do I get my HTML? Can I just print it to the terminal or a file? That would be a great start.
<html>
<head><title>John's Page</title></head>
<body><p>Hello World!</p></body>
</html>
And are all the import statements really necessary? I just want it to work.
I tried printing using the renderHTML function but I just get an error message:
main = (renderHtml notes) >>= putStrLn
notes.hs:21:9:
Couldn't match expected type `IO String'
with actual type `Data.Text.Internal.Lazy.Text'
In the return type of a call of `renderHtml'
In the first argument of `(>>=)', namely `(renderHtml notes)'
In the expression: (renderHtml notes) >>= putStrLn
The result of "renderHtml" is not wrapped in a monad so you don't need to use >>=
Just print out the result:
main = putStrLn $ show $ renderHtml notes
The result is:
"<!DOCTYPE HTML>\n<html><head><title>John' s
Page</title></head><body><p>Hello World!</p></body></html>"
Generally speaking, the place to start with errors like this is to load the file into GHCI and see what the types are. Here is the session I'd use for this issue:
*Main> :t notes
notes :: Html
*Main> :t renderHtml notes
renderHtml notes :: Data.Text.Internal.Lazy.Text
You can see that the output of renderHtml notes is just an instance of Text. Text has a Show instance so we can just call "putStrLn $ show $ renderHtml notes" to get the desired output.
However, it is usually better to use the Data.Text.[Lazy.]IO package to perform IO when using Text. Note the import for "TIO" and the last line in the code below:
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad (forM_)
import Text.Blaze.Html5
import Text.Blaze.Html5.Attributes
import qualified Text.Blaze.Html5 as H
import qualified Text.Blaze.Html5.Attributes as A
import Text.Blaze.Html.Renderer.Text
import qualified Data.Text.Lazy.IO as TIO
notes :: Html
notes = docTypeHtml $ do
H.head $ do
H.title "John' s Page"
body $ do
p "Hello World!"
--main = putStrLn $ show $ renderHtml notes
main = TIO.putStr $ renderHtml notes

Idiomatic io-streams directory traversal

I was discussing some code on Reddit, and it made me curious about how this would be implemented in io-streams. Consider the following code which traverses a directory structure and prints out all of the filenames:
import Control.Exception (bracket)
import qualified Data.Foldable as F
import Data.Streaming.Filesystem (closeDirStream, openDirStream,
readDirStream)
import System.Environment (getArgs)
import System.FilePath ((</>))
printFiles :: FilePath -> IO ()
printFiles dir = bracket
(openDirStream dir)
closeDirStream
loop
where
loop ds = do
mfp <- readDirStream ds
F.forM_ mfp $ \fp' -> do
let fp = dir </> fp'
ftype <- getFileType fp
case ftype of
FTFile -> putStrLn fp
FTFileSym -> putStrLn fp
FTDirectory -> printFiles fp
_ -> return ()
loop ds
main :: IO ()
main = getArgs >>= mapM_ printFiles
Instead of simply printing the files, suppose we wanted to create some kind of streaming filepath representation. I know how this would work in enumerator, conduit, and pipes. However, since the intermediate steps require acquisition of a scarce resource (the DirStream), I'm not sure what the implementation would be for io-streams. Can someone provide an example of how that would be done?
For comparison, here's the conduit implementation, which is made possible via bracketP and MonadResource. And here's how the conduit code would be used to implemented the same file printing program as above:
import Control.Monad.IO.Class (liftIO)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Conduit (($$))
import Data.Conduit.Filesystem (sourceDirectoryDeep)
import qualified Data.Conduit.List as CL
import System.Environment (getArgs)
main :: IO ()
main =
getArgs >>= runResourceT . mapM_ eachRoot
where
-- False means don't traverse dir symlinks
eachRoot root = sourceDirectoryDeep False root
$$ CL.mapM_ (liftIO . putStrLn)
Typical style would be to do something like this:
traverseDirectory :: RawFilePath -> (InputStream RawFilePath -> IO a) -> IO a
i.e. a standard "with-" function, with the obvious implementation.
Edit: added a working example implementation: https://gist.github.com/gregorycollins/00c51e7e33cf1f9c8cc0
It's not exactly complicated but it's also not as trivial as I had first suggested.

Haskell. MongoDB driver or Aeson charset problem

Good day, i have mongodb database filled with some data, i ensured that data stored in correct charset, to fetch data i use following snippet:
{-# LANGUAGE OverloadedStrings #-}
import Network.Wai
import Network.Wai.Handler.Warp (run)
import Data.Enumerator (Iteratee (..))
import Data.Either (either)
import Control.Monad (join)
import Data.Maybe (fromMaybe)
import Network.HTTP.Types (statusOK, status404)
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Char8 (unpack)
import Data.ByteString.Lazy.Char8 (pack)
import qualified Data.Text.Lazy as T
import Data.Text (Text(..))
import Control.Monad.IO.Class (liftIO, MonadIO)
import Data.Aeson (encode)
import qualified Data.Map as Map
import qualified Database.MongoDB as DB
application dbpipe req = do
case unpack $ rawPathInfo req of
"/items" -> itemsJSON dbpipe req
_ -> return $ responseLBS status404 [("Content-Type", "text/plain")] "404"
indexPage :: Iteratee B.ByteString IO Response
indexPage = do
page <- liftIO $ processTemplate "templates/index.html" []
return $ responseLBS statusOK [("Content-Type", "text/html; charset=utf-8")] page
processTemplate f attrs = do
page <- L.readFile f
return page
itemsJSON :: DB.Pipe -> Request -> Iteratee B.ByteString IO Response
itemsJSON dbpipe req = do
dbresult <- liftIO $ rundb dbpipe $ DB.find (DB.select [] $ tu "table") >>= DB.rest
let docs = either (const []) id dbresult
-- liftIO $ L.putStrLn $ encode $ show $ map docToMap docs
return $ responseLBS statusOK [("Content-Type", "text/plain; charset=utf-8")]
(encode $ map docToMap docs)
docToMap doc = Map.fromList $ map (\f -> (T.dropAround (== '"') $ T.pack $ show $ DB.label f, T.dropAround (== '"') $ T.pack $ show $ DB.value f)) doc
main = do
pipe <- DB.runIOE $ DB.connect $ DB.host "127.0.0.1"
run 3000 $ application pipe
rundb pipe act = DB.access pipe DB.master database act
tu :: B.ByteString -> UString
tu = DB.u . C8.unpack
Then the result is suprprising, DB.label works well, but DB.value giving me native characters as some escape codes, so the result is look like:
curl http://localhost:3000/items gives:
[{"Марка": "\1058\1080\1087 \1087\1086\1076",
"Model": "BD-W LG BP06LU10 Slim \1058\1080\1087 \1087\1086\1076\1082\1083\1102\1095\1077\1085\1080\1103"},
...
]
This happens in case i trying to print data and also in case i return data encoded as JSON
Any idea how correctly extract values from MongoDB driver ?
The following line confirms that aeson's encoding works properly (using the utf8-string library to read utf8 data off the lazy bytestring back to a haskell string:
> putStrLn $ Data.ByteString.Lazy.UTF8.toString $ encode $ ("\1058\1080\1087 \1087\1086\1076",12)
["Тип под",12]
Looking at your code more closely I see the real problem. You're calling T.pack $ show $ DB.value -- this will render out as literal codepoints, and then pack those into a text object. The fix is to switch from show to something smarter. Look at this (untested)
smartShow :: DB.Value -> Text
smartShow (String s) = Data.Text.Encoding.decodeUtf8 $ Data.CompactString.UTF8.toByteString s
smartShow x = T.pack $ show x
Obviously to handle the recursive cases, etc. you need to be smarter than that, but that's the general notion...
In fact, the "best" thing to do is to write a function of BSON -> JSON directly, rather than go through any intermediate structures at all.
Everything is working as expected -- only your expectations are wrong. =)
What you're seeing there are not raw Strings; they are String's which have been escaped to exist purely in the printable ASCII range by the show function, called by print:
print = putStrLn . show
Never fear: in memory, the string that prints as "\1058" is in fact a single Unicode codepoint long. You can observe this by printing the length of one of the Strings you're interested in and comparing that to the number of Unicode codepoints you expect.

Resources