How to get utf8 rss feed? - haskell

I'm trying to use the package RSS with UTF8 string with no avail. (i don't want to use HXT which works, i just want to understand where i'm wrong)
In ghci when i put "test" i just get garbage with character such as "é".
If i get the string from reading a file with UTF8.readFile and send it to parseFromString it works, but when i download and use getRespBody it doesn't.
Here is my sample code :
import Network.HTTP (simpleHTTP, getRequest, getResponseBody)
import Data.Maybe (fromJust)
import Text.Feed.Import (parseFeedString)
import Text.RSS.Syntax
import Text.Feed.Types (Feed(..))
import Prelude hiding (putStrLn)
import Data.ByteString.Char8 (putStrLn)
import Data.ByteString.UTF8 (fromString)
siteUrl = "http://radiofrance-podcast.net/podcast09/rss_11549.xml"
type Links = [(String,String,String)]
-------------------------------------------------------------------------------
-- Main function
-------------------------------------------------------------------------------
test = getLinks siteUrl >>= mapM_ (putStrLn.fromString)
-------------------------------------------------------------------------------
-- Retrieve titles
-------------------------------------------------------------------------------
getLinks:: String -> IO [String]
getLinks url = simpleHTTP (getRequest url) >>= getResponseBody >>= parseDoc
parseDoc d = do
let RSSFeed rss = (fromJust . parseFeedString ) d
items = rssItems.rssChannel $ rss
titles = map (fromJust.rssItemTitle) items
return $ titles
Update:
thanks to Roman's answer, i have modified my code. Here are the modification for anyone who may be interested.
import Codec.Binary.UTF8.String (decodeString) -- <-- added
getLinks:: String -> IO [String]
getLinks url = simpleHTTP (getRequest url) >>= getResponseBody >>= parseDoc.decodeString -- <-- modified

The fact that simpleHTTP may return String-based responses is a bit confusing. In reality they are not Unicode strings, but byte strings that contain the HTTP response as is. No automatic decoding is done.
So, you need to decode the http response before passing it to feed parsing functions (e.g. using the encoding or utf8-string package).
You probably want to extract the source encoding information from the Content-Type http header or from the RSS document itself.

Related

What is the best way to get data from url and parse it on Haskell?

I'm having trouble with parsing data from url.
I have url with "https://" so i think i should use import Network.HTTP.Conduit
But
simpleHttp url
returns L.ByteString
I really don't understand what shoud i do after that
So i have such code to get data
toStrict1 :: L.ByteString -> B.ByteString
toStrict1 = B.concat . L.toChunks
main :: IO ()
main = do
lbs <- simpleHttp url
let page = toStrict1 lbs
and example of parsing
let lastModifiedDateTime = fromFooter $ parseTags doc
putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
How can i combine this two parts of code?
As you've seen, the simpleHttp function returns a lazy bytestring. There are several ways to deal with this in TagSoup.
First, it turns out that you can parse it directly. The function parseTags has signature:
parseTags :: StringLike str => str -> [Tag str]
meaning that it can parse any type str with a StringLike instance, and if you look at the Text.StringLike module documentation, you'll see that lazy ByteStrings have a StringLike instance.
However, if you go this route, you need to be aware that everything's kind of "trapped" in a ByteString world, so you have to write your code using versions of functions like words and unwords that are bytestring-compatible, and even your putStrLn needs an adapter. A full working example would look like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as CL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags lbs
putStrLn $ "wiki.haskell.org was last modified on "
++ CL.unpack lastModifiedDateTime
where fromFooter = CL.unwords . drop 6 . CL.words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
and it works fine:
> main
wiki.haskell.org was last modified on 9 September 2013, at 22:38.
>
The functions from Data.ByteString.Lazy.Char8 basically assume that the bytestring is ASCII-encoded, which is close enough for this example to work.
However, it would be more robust to decode the bytestring based on the proper character encoding to a valid text type. The two main text types in Haskell are the default String type, which is inefficient and slow, but easy to work with, and the Text type, which is highly efficient but a bit more complicated. (Like ByteString, you need to use Text-compatible versions of functions like words and so on.) Both String and Text have StringLike instances, so they both work fine with TagSoup.
If we were going to write production-quality code, we'd actually consult the response headers from the HTTP request and/or check for a <meta> tag in the HTML to determine the real encoding. But, if we just assume the coding is UTF-8 (which it is), the Text version looks like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL
import qualified Data.ByteString.Lazy as BL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags (TL.decodeUtf8 lbs)
putStrLn $ "wiki.haskell.org was last modified on "
++ TL.unpack lastModifiedDateTime
where fromFooter = TL.unwords . drop 6 . TL.words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")
and a String version using Data.ByteString.Lazy.UTF8 from the utf8-string package looks like this:
import Network.HTTP.Conduit
import Text.HTML.TagSoup
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.UTF8 as BL
main :: IO ()
main = do
lbs <- simpleHttp "https://wiki.haskell.org"
let lastModifiedDateTime = fromFooter $ parseTags (BL.toString lbs)
putStrLn $ "wiki.haskell.org was last modified on "
++ lastModifiedDateTime
where fromFooter = unwords . drop 6 . words
. innerText . take 2 . dropWhile (~/= "<li id=footer-info-lastmod>")

Exception POSTing multipart form with http-conduit

I'm trying to POST a multipart form request to an internal website which should reply with an XML response. Using another simple script I have in Python with the requests library, everything works fine, however, using http-conduit I keep receiving an exception ExpectedBlankAfter100Continue.
If I replace the internal url with "https://httpbin.org/post", I also receive a reply back without issue.
Is there something I'm doing wrong? It seems like either a bug in the library or the site is not behaving as expected. If the latter is the case, is there an option for me to disable this check in http-conduit?
Sample code:
{-# LANGUAGE OverloadedStrings #-}
import Network
import Network.HTTP.Conduit
import Network.HTTP.Client.MultipartFormData
import qualified Data.ByteString.Lazy.Char8 as BL
import Data.Maybe (fromJust)
import System.Environment (getArgs)
import Control.Monad.IO.Class
main = do
[x] <- getArgs
--let url = "https://url.net/api.asp"
let url = "https://httpbin.org/post"
withSocketsDo $ withManager $ \m -> do
r <- flip httpLbs m =<< (formDataBody (request $ BL.pack x) $ fromJust $ parseUrl url)
liftIO $ BL.putStrLn $ responseBody r
request :: BL.ByteString -> [Part]
request x = <code removed>
This sounds like the server is returning a malformed 100-continue response. But there's not enough information here to properly debug this, it's probably better to handle this in a Github issue.

Haskell does not evaluate block

I am writing simple sitemap.xml crawler. The code is below. My question is why the code in the end of main does not print anything. I suspect it's because haskell's lazyness but don't know how to deal with it here:
import Network.HTTP.Conduit
import qualified Data.ByteString.Lazy as L
import Text.XML.Light
import Control.Monad.Trans (liftIO)
import Control.Monad
import Data.String.Utils
import Control.Exception
download :: Manager -> Request -> IO (Either HttpException L.ByteString)
download manager req = do
try $
fmap responseBody (httpLbs req manager)
downloadUrl :: Manager -> String -> IO (Either HttpException L.ByteString)
downloadUrl manager url = do
request <- parseUrl url
download manager request
getPages :: Manager -> [String] -> IO [Either HttpException L.ByteString]
getPages manager urls =
sequence $ map (downloadUrl manager) urls
main = withManager $ \ manager -> do
-- I know simpleHttp is bad here
mapSource <- liftIO $ simpleHttp "http://example.com/sitemap.xml"
let elements = (parseXMLDoc mapSource) >>= Just . findElements (mapElement "loc")
Just urls = liftM (map $ (replace "/#!" "?_escaped_fragment_=") . strContent) elements
mapElement name = QName name (Just "http://www.sitemaps.org/schemas/sitemap/0.9") Nothing
return $
getPages manager urls >>= \ pages -> do
print "evaluate me!"
sequence $ map print pages
You're running into the same problem I describe here, at least as far as having incorrect code that typechecks when it should actually give a type error: Why is the type of "Main.main", "IO ()" and not "IO a"?. This is why you should always give main the type signature main :: IO () explicitly.
To fix the problem, you will want to replace return with lift (see http://hackage.haskell.org/package/transformers/docs/Control-Monad-Trans-Class.html#v:lift) and replace sequence $ map ... with mapM_. mapM_ f is equivalent to sequence_ . map f.
Substitute your last return with runResourceT (http://hackage.haskell.org/package/resourcet-1.1.1/docs/Control-Monad-Trans-Resource.html#v:runResourceT). As it's type suggests, it would turn ResourceT into IO action.

x-oauth-basic header isn't the same in CURL and HTTP.Conduit

This is more than likely me missing some vital piece of information or something, but here goes.
Currently I'm trying to insert my own header, namely x-oauth-basic, into my HTTP request using the HTTP.Conduit library. It sorta works, but not in my intended way,
submitPostRequest urlString githubKey body =
case parseUrl urlString of
Nothing -> return $ "URL Syntax Error"
Just initReq -> withManager $ \manager -> do
let req = initReq { secure = False -- Turn on https
, method = "POST"
, requestHeaders = [("x-oauth-basic", (encodeUtf8 githubKey))]
<> [("User-Agent", "HsCMS")]
, requestBody = RequestBodyBS (toStrict body)
, checkStatus = \_ _ _ -> Nothing
}
res <- httpLbs req manager
return $ responseBody res
the important bit being
requestHeaders = [("x-oauth-basic", (encodeUtf8 githubKey))]
<> [("User-Agent", "HsCMS")]
Using a HTTP sinkhole, I can see the header is formed as HTTP_X_OAUTH_BASIC. It shouldn't have the HTTP bit in front. Testing with curl,
curl -u 78y8713k1j23nkjnkjnuy366366363666gdasddd:x-oauth-basic --request POST --data '{"description":"Updated via API","files":{"file1.txt":{"filename": "newsies.txt", "content":"New Demo"}}' http://www.posttestserver.com/post.php\?dir\=Testing
the header doesn't appear there, which suggests that the sinkhole doesn't pick up x-headers. The curl example also works with my intended endpoint which is the github API, so I know the curl approach is correct, and my HTTP.Conduit one is not.
So my question is, how do I get my HTTP.Conduit header to appear as a x-header, such as curls', instead of the current http-x-header that I'm getting?
Also, don't worry, the github key used is not an actual key...
Update and fix
So, as mentioned in a comment to Michael Snoymans' answer, it got solved by using a different header, namely ("Authorization", "token " <> (encodeUtf8 githubKey)) which apparently was somewhat what CURL was sending when doing <token>:x-oauth-basic.
I've tried to update the title to fit a little better, but am open to suggestions...
Thanks for all the help!
I think the problem is with your sinkhole application. It appears like it's printing CGI versions of the headers. I don't know what the sinkhole looks like, so I implemented a simple one in Warp, and indeed the request header is being passed through correctly. You can clone the project on FP Haskell Center to try it out yourself. For completeness, here's the code below:
{-# LANGUAGE OverloadedStrings #-}
import Control.Concurrent.Async (withAsync)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import qualified Data.ByteString.Lazy as L
import Data.Monoid (mempty, (<>))
import Data.Text (Text)
import Data.Text.Encoding (encodeUtf8)
import Network.HTTP.Conduit (RequestBody (RequestBodyBS),
checkStatus, httpLbs, method,
parseUrl, requestBody,
requestHeaders, responseBody, secure,
withManager)
import Network.HTTP.Types (status200)
import qualified Network.Wai as Wai
import Network.Wai.Handler.Warp (run)
import System.Environment (getEnv)
main :: IO ()
main = do
port <- fmap read $ getEnv "PORT"
withAsync (run port app) $ const $ do
submitPostRequest
("http://localhost:" ++ show port)
"dummy-key"
"dummy body" >>= print
app :: Wai.Application
app req = do
liftIO $ mapM_ print $ Wai.requestHeaders req
return $ Wai.responseLBS status200 [] mempty
submitPostRequest :: String -> Text -> ByteString -> IO L.ByteString
submitPostRequest urlString githubKey body =
case parseUrl urlString of
Nothing -> return $ "URL Syntax Error"
Just initReq -> withManager $ \manager -> do
let req = initReq { secure = False -- Turn on https
, method = "POST"
, requestHeaders = [("x-oauth-basic", (encodeUtf8 githubKey))]
<> [("User-Agent", "HsCMS")]
, requestBody = RequestBodyBS body
, checkStatus = \_ _ _ -> Nothing
}
res <- httpLbs req manager
return $ responseBody res
When I run this, the output in the console is:
("Host","localhost:8004")
("Accept-Encoding","gzip")
("Content-Length","10")
("x-oauth-basic","dummy-key")
("User-Agent","HsCMS")
Empty

Haskell design: Struggling with IO

I am a novice haskell programmer and I am trying to write some Haskell cgi which will read from a MySQL DB and output JSON. I am able to generate the right JSON but am unable to get the data types correctly to be able to output JSON correctly. I also think that I am primarily thinking imperative still. Here is my code. Note that getTopBrands provides json output.
My problem is that I am unable to figure out how to return "[Char]" from getTopBrands and not "IO [Char]". It looks to me I am still thinking imperative. Any pointers, suggestions to fix this would be greatly appreciated. Please let me know if I need to provide the rest of the code.
RODB.hs:
{-# LANGUAGE RecordWildCards, OverloadedStrings, PackageImports #-}
module Main where
import RODB
import ROOutput
import System.Environment
import Database.HDBC
import Network.Socket(withSocketsDo)
import Network.CGI
import Text.XHtml
import qualified "bytestring" Data.ByteString.Lazy.Char8 as LBS
import Data.Aeson
page :: Html
page = body << h1 << str
main = runCGI $ handleErrors cgiMain
cgiMain :: CGI CGIResult
cgiMain =
do out <- getTopBrands 10 1
setHeader "Content-type" "application/json"
output $ renderHtml page out
getTopBrands :: Integer -> Integer -> IO [Char]
getTopBrands limit sorted =
do let temp = 0
dbh <- connect "127.0.0.1" "ReachOutPublicData" "root" "admin" "/tmp/mysql.sock"
if sorted == 1
then do brandlist <- getBrands dbh limit True
json <- convPublicBrandEntrytoJSON brandlist
return $ LBS.unpack json
else do brandlist <- getBrands dbh limit False
json <- convPublicBrandEntrytoJSON brandlist
return $ LBS.unpack json
As Niklas B said, getTopBrands being in IO is right, since it depends on I/O. I guess your problem is that you get a type error from that when you try to use it directly,
cgiMain :: CGI CGIResult
cgiMain =
do out <- getTopBrands 10 1
setHeader "Content-type" "application/json"
output $ renderHtml page out
since all statements in a do-block must belong to the same monad, and the rest of the block is in CGI. But, CGI is a MonadIO, thus you can simply liftIO it into CGI,
cgiMain :: CGI CGIResult
cgiMain =
do out <- liftIO $ getTopBrands 10 1
setHeader "Content-type" "application/json"
output $ renderHtml page out
The next point Niklas raised is also right, the second Integer argument of getTopBrands should really be a Bool. However, even with its current type, the code duplication is entirely unnecessary, the difference between the two branches is just the Bool argument to getBrands, so
getTopBrands :: Integer -> Integer -> IO [Char]
getTopBrands limit sorted =
do let temp = 0
dbh <- connect "127.0.0.1" "ReachOutPublicData" "root" "admin" "/tmp/mysql.sock"
brandlist <- getBrands dbh limit (sorted == 1)
json <- convPublicBrandEntrytoJSON brandlist
return $ LBS.unpack json
just pass it the condition on which you branched.
Niklas' third point
I also don't see why convPublicBrandEntrytoJSON would need to live in IO, but since you didn't provide its definition I cannot suggest an improvement here.
also looks very valid, a conversion would usually be a pure function. If the only reason why it is in IO is the ability to write
json <- convPublicBrandEntrytoJSON brandlist
you should be aware that you can bind results of pure functions in a do-block
let json = convPublicBrandEntrytoJSON brandlist
using let.

Resources