Problems with Character Encoding Using Haskells Text.Pandoc - haskell

I want to parse a LaTeX-File using Pandoc and output the text, like this:
import qualified Text.Pandoc as P
import Text.Pandoc.Error (handleError)
tex2Str = do
file <- readFile "test.tex"
let p = handleError $ P.readLaTeX P.def file
writeFile "A.txt" $ P.writePlain P.def p
writeFile "B.txt" $ file
While the encoding in file B.txt seems to be "right" (i.e. uft-8), the encoding in file A.txt is not correct.
Here the respective extracts of the files:
A.txt:
...
Der _Crawler_ läuft hierbei über die Dokumentenbasis
...
B.txt:
...
\usepackage[utf8]{inputenc}
...
Der \emph{Crawler} läuft hierbei über die Dokumentenbasis
...
Anyone knows how to fix this? Why does Pandoc use the wrong encoding (I thought, it uses utf-8 by default)?
Update:
I got a (partial) solution: Using the readFile and writeFile-Functions from Text.Pandoc.UTF8 seems to fix some of the problems, i.e.
import qualified Text.Pandoc as P
import Text.Pandoc.Error (handleError)
import qualified Text.Pandoc.UTF8 as UTF (readFile, writeFile)
tex2Str = do
file <- UTF.readFile "test.tex"
let p = handleError $ P.readLaTeX P.def file
UTF.writeFile "A.txt" $ P.writePlain P.def p
UTF.writeFile "B.txt" $ file
However, I still didnt get the clue what the actual problem was, since both Prelude.readFile and Prelude.writeFile seem to work uft8-aware...

Related

Cabal package difference between readPackageDescription and parsePackageDescription

Haskell package Cabal-1.24.2 has module Distribution.PackageDescription.Parse.
Module has 2 functions: readPackageDescription and parsePackageDescription.
When I run in ghci:
let d = readPackageDescription normal "C:\\somefile.cabal"
I got parsed GenericPackageDescription
But when I run in ghci:
content <- readFile "C:\\somefile.cabal"
let d = parsePackageDescription content
I got Parse error:
ParseFailed (FromString "Plain fields are not allowed in between stanzas: F 2 \"version\" \"0.1.0.0\"" (Just 2))
File example is a file that generated using cabal init
parsePackageDescription expects the file contents themselves to be passed it, not the file path they are stored at. You'll want to readFile first... though beware of file encoding issues. http://www.snoyman.com/blog/2016/12/beware-of-readfile

hslogger & Duplicate Log Lines

I've configured logging like so:
import System.Environment
import System.Log.Logger
import System.Log.Handler (setFormatter)
import System.Log.Handler.Simple (streamHandler)
import System.Log.Formatter
import System.IO (getLine, stdout)
main = do
stdOutHandler <- streamHandler stdout DEBUG >>= \lh -> return $
setFormatter lh (simpleLogFormatter "[$time : $loggername : $prio] $msg")
updateGlobalLogger "Linker" (setLevel DEBUG . setHandlers [stdOutHandler])
infoM "Linker" "Hello world!"
Unfortunately, every time I use infoM (or any logging function), I get duplicate lines, e.g.
infoM "Linker" "hi there"
produces:
hi there
[2016-12-05 20:23:10 GMT : Linker : INFO] hi there
I thought setHandlers removed other handlers first.
I want just the lines that are formatted, not the "normal" format ala putStrLn etc.
I found error in your program. Actually, it was in your first code, I just didn't pay enough attention to it :(
All you need is to replace logger name with rootLoggerName in
updateGlobalLogger "Linker"
to
updateGlobalLogger rootLoggerName
This did the trick for me. I don't know what happens when you're not initializing with root logger but now it will at least work.
Also, if you're using stack and don't mind using github projects then you may wish to consider using our logging library (it is not currently on hackage) which is a wrapper around hslogger which adds some juice to it (like coloured logger names and more):
https://github.com/serokell/log-warper

Text encoding - fine on Windows, not nix

I have an issue with loading data between default encoding on Win and nix machines (ISO-8859-1 and UTF-8 respectively).
Example - Windows first:
library(stringi)
dummy <- as.character("BØÅS")
write(dummy, "saveFile")
getData <- read.table("saveFile", header=F, sep="\t", quote="\"")
reEncode=function(x) {
stri_trans_general(x, "Latin-ASCII")
}
enCoded <- apply(getData, 1, reEncode)
result <- as.data.frame(enCoded)
In Windows the above produces "BOAS" as desired.
Now move to nix and use the saved file:
getData <- read.table("saveFile", header=F, sep="\t", quote="\"")
reEncode=function(x) {
stri_trans_general(x, "Latin-ASCII")
}
enCoded <- apply(getData, 1, reEncode)
result <- as.data.frame(enCoded)
Nix gives "B??S".
I believe this is a read.table encoding issue but haven't been able to figure out how to get nix to use ISO-8859-1. Any suggestions?
read.table("saveFile", header=F, sep="\t", quote="\"",encoding="latin1")

Scotty and POST params

I'm having an issue with the Scotty web server right now - rescue isn't working for unfound parameters - I'm still getting a 404 with the following code:
post "/newsletter/create" ( do
(param "subscriber[email]") `rescue` (\msg -> text msg)
formContent <- param "subscriber[email]"
text $ "found! " ++ show formContent )
I can see that when I just use params instead, my data is there, and indexed with "subscriber[email]". Is there something going on with [ escaping? Any help with this would be tremendous.
With some cleanup I got it to work:
{-# LANGUAGE OverloadedStrings #-}
import Web.Scotty
import qualified Data.Text.Lazy as TL
main = scotty 3000 $ do
post "/newsletter/create" $ do
formContent <- (param "subscriber[email]") `rescue` (\msg -> return msg)
text $ "found! " `TL.append` formContent
I made a bunch of modifications, but the key point was that rescue is used as a wrapper around param, not to change any internal state, hence you shouldn't call it twice. The square brackets didn't cause me any trouble.

Any example of a custom PreProcessor in Haskell?

I've walked through the cabal Distribution.Simple* packages to know that the PreProcessor data type can be used to defined custom pre-processors. But the example provided is not so useful. I don't know how to invoke the pre-processor.
Currently, I just define my own pre-processors in the Setup.hs file.
Are there any complete examples for this feature?
[EDITED]
Check this mail-list archive I just found. But the solution involves transforming from one type of file (identified by the extension of that file) to another.
What I want to do is to inject code into existing .hs files where a custom mark is defined, e.g.
-- <inject point="foo">
-- extra Haskell code goes here
-- </inject>
One of the most important things to do is setting your BuildType in your .Cabal file to Custom. If it stays at Simple Cabal will completely ignore the Setup.hs file.
Build-Type: Custom
Here is an example Custom preprocessor from my package, It first runs cpphs and then runs hsc2hs
#!/usr/bin/env runhaskell
> {-# LANGUAGE BangPatterns #-}
> import Distribution.Simple
> import Distribution.Simple.PreProcess
> import Distribution.Simple.Utils
> import Distribution.PackageDescription
> import Distribution.Simple.LocalBuildInfo
> import Data.Char
> import System.Exit
> import System.IO
> import System.Directory
> import System.FilePath.Windows
> main = let hooks = simpleUserHooks
> xpp = ("xpphs", ppXpp)
> in defaultMainWithHooks hooks { hookedPreProcessors = xpp:knownSuffixHandlers }
>
> ppXpp :: BuildInfo -> LocalBuildInfo -> PreProcessor
> ppXpp build local =
> PreProcessor {
> platformIndependent = True,
> runPreProcessor = mkSimplePreProcessor $ \inFile outFile verbosity ->
> do info verbosity (inFile++" is being preprocessed to "++outFile)
> let hscFile = replaceExtension inFile "hsc"
> runSimplePreProcessor (ppCpp build local) inFile hscFile verbosity
> handle <- openFile hscFile ReadMode
> source <- sGetContents handle
> hClose handle
> let newsource = unlines $ process $ lines source
> writeFile hscFile newsource
> runSimplePreProcessor (ppHsc2hs build local) hscFile outFile verbosity
> removeFile hscFile
> return ()
> }
This preprocessor will automatically be called by Cabal when any file with the extension .xpphs is found.
In your case just register the preprocessor with a .hs extension. (I'm not sure if Cabal allows this. But if it doesn't you can simply rename the files with the injection point to a .xh or something. This would actually be better since you don't process every file in your project then)

Resources