Concatenating scrapeURL results from multiples scrapings into one list

Concatenating scrapeURL results from multiples scrapings into one list - haskell

I am scraping https://books.toscrape.com using Haskell's Scalpel library. Here's my code so far:
import Text.HTML.Scalpel
import Data.List.Split (splitOn)
import Data.List (sortBy)
import Control.Monad (liftM2)
data Entry = Entry {entName :: String
, entPrice :: Float
, entRate :: Int
} deriving Eq
instance Show Entry where
show (Entry n p r) = "Name: " ++ n ++ "\nPrice: " ++ show p ++ "\nRating: " ++ show r ++ "/5\n"
entries :: Maybe [Entry]
entries = Just []
scrapePage :: Int -> IO ()
scrapePage num = do
items <- scrapeURL ("https://books.toscrape.com/catalogue/page-" ++ show num ++ ".html") allItems
let sortedItems = items >>= Just . sortBy (\(Entry _ a _) (Entry _ b _) -> compare a b)
>>= Just . filter (\(Entry _ _ r) -> r == 5)
maybe (return ()) (mapM_ print) sortedItems
allItems :: Scraper String [Entry]
allItems = chroots ("article" #: [hasClass "product_pod"]) $ do
p <- text $ "p" #: [hasClass "price_color"]
t <- attr "href" $ "a"
star <- attr "class" $ "p" #: [hasClass "star-rating"]
let fp = read $ flip (!!) 1 $ splitOn "£" p
let fStar = drop 12 star
return $ Entry t fp $ r fStar
where
r f = case f of
"One" -> 1
"Two" -> 2
"Three" -> 3
"Four" -> 4
"Five" -> 5
main :: IO ()
main = mapM_ scrapePage [1..10]
Basically, allItems scrapes for each book's title, price and rating, does some formatting for price to get a float, and returns it as a type Entry. scrapePage takes a number corresponding to the result page number, scrapes that page to get IO (Maybe [Entry]), formats it - in this case, to filter for 5-star books and order by price - and prints each Entry. main performs scrapePage over pages 1 to 10.
The problem I've run into is that my code scrapes, filters and sorts each page, whereas I want to scrape all the pages then filter and sort.
What worked for two pages (in GHCi) was:
i <- scrapeURL ("https://books.toscrape.com/catalogue/page-1.html") allItems
j <- scrapeURL ("https://books.toscrape.com/catalogue/page-2.html") allItems
liftM2 (++) i j
This returns a list composed of page 1 and 2's results that I could then print, but I don't know how to implement this for all 50 result pages. Help would be appreciated.

Just return the entry list without any processing (or you can do filtering in this stage)
-- no error handling
scrapePage :: Int -> IO [Entry]
scrapePage num =
concat . maybeToList <$> scrapeURL ("https://books.toscrape.com/catalogue/page-" ++ show num ++ ".html") allItems
Then you can process them later together
process = filter (\e -> entRate e == 5) . sortOn entPrice
main = do
entries <- concat <$> mapM scrapePage [1 .. 10]
print $ process entries
Moreover you can easily make your code concurrent with mapConcurrently from async package
main = do
entries <- concat <$> mapConcurrently scrapePage [1 .. 20]
print $ process entries

Related

How to call a function in haskell?

I have a code that creates a list and then shuffle it. But i cannot execute because a problem with the = in the main = do section. The error is "parse error on input".
This is the code:
import System.IO
import System.Random
shuffle :: [a] -> [a]
shuffle list = if length list < 2 then return list else do
i <- randomRIO (0, length list-1)
r <- shuffle (take i list ++ drop (i+1) list)
return (list!!i : r)
main = do --the problem is in this line
putStrLn "Enter the number:"
number <- getLine
let n = (read number :: Int)
let list = [1..n]
print list
shuffle list

Calling the function is not the problem. Your definition of shuffle has indentation problems which isn't a problem for the parser until it reaches the main = do line.
import System.IO
import System.Random
shuffle :: [a] -> IO [a]
shuffle list = if length list < 2 then return list else do
i <- randomRIO (0, length list-1)
r <- shuffle (take i list ++ drop (i+1) list)
return (list!!i : r)
main = do --the problem is in this line
putStrLn "Enter the number:"
number <- getLine
let n = (read number :: Int)
let list = [1..n]
print list
shuffled <- shuffle list
print shuffled
Note the additional changes to correctly work with IO.

How can I read a sentence, separate the words and apply my function to each word? Haskell

I have a function that reads a word, separates the first and the last letter and the remaining content mixes it and at the end writes the first and last letter of the word but with the mixed content.
Example:
Hello -> H lle o
But I want you to be able to read a phrase and do the same in each word of the sentence. What I can do?
import Data.List
import System.IO
import System.Random
import System.IO.Unsafe
import Data.Array.IO
import Control.Monad
import Data.Array
oracion = do
frase <- getLine
let pL = head frase
let contentR = devContent frase
charDisorder <- aleatorio contentR
let uL = last frase
putStrLn $ [pL] ++ charDisorder ++ [uL]
aleatorio :: [d] -> IO [d]
aleatorio xs = do
ar <- newArray n xs
forM [1..n] $ \i -> do
t <- randomRIO (i,n)
vi <- readArray ar i
vt <- readArray ar t
writeArray ar t vi
return vt
where
n = length xs
newArray :: Int -> [d] -> IO (IOArray Int d)
newArray n xs = newListArray (1,n) xs
devContent :: [Char] -> [Char]
devContent x = init (drop 1 x)

That should go like this:
doStuffOnSentence sentence = mapM aleatorio (words sentence)
Whenever You are dealing with monads (especially IO) mapM is real lifesaver.
What's more, if You want to concatenate the final result You can add:
concatIoStrings = liftM unwords
doStuffAndConcat = concatIoStrings . doStuffOnSentence

Simple Haskell program not behaving correct

I'm new to Haskell and trying to write simple program to find maximal element and it's index from intput. I receive values to compare one by one. Maximal element I'm holding in maxi variable, it's index - in maxIdx. Here's my program:
loop = do
let maxi = 0
let maxIdx = 0
let idx = 0
let idxN = 0
replicateM 5 $ do
input_line <- getLine
let element = read input_line :: Int
if maxi < element
then do
let maxi = element
let maxIdx = idx
hPutStrLn stderr "INNER CHECK"
else
hPutStrLn stderr "OUTER CHECK"
let idx = idxN + 1
let idxN = idx
print maxIdx
loop
Even though I know elements coming are starting from bigger to smaller (5, 4, 3, 2, 1) program enters INNER CHECK all the time (it should happen only for the first element!) and maxIdx is always 0.
What am I doing wrong?
Thanks in advance.

Anyway, let's have fun.
loop = do
let maxi = 0
let maxIdx = 0
let idx = 0
let idxN = 0
replicateM 5 $ do
input_line <- getLine
let element = read input_line :: Int
if maxi < element
then do
let maxi = element
let maxIdx = idx
hPutStrLn stderr "INNER CHECK"
else
hPutStrLn stderr "OUTER CHECK"
let idx = idxN + 1
let idxN = idx
print maxIdx
loop
is not a particularly Haskelly code (and as you know is not particularly correct).
Let's make if Haskellier.
What do we do here? We've an infinite loop, which is reading a line 5 times, does something to it, and then calls itself again for no particular reason.
Let's split it:
import Control.Monad
readFiveLines :: IO [Int]
readFiveLines = replicateM 5 readLn
addIndex :: [Int] -> [(Int, Int)]
addIndex xs = zip xs [0..]
findMaxIndex :: [Int] -> Int
findMaxIndex xs = snd (maximum (addIndex xs))
loop :: ()
loop = loop
main :: IO ()
main = do xs <- readFiveLines
putStrLn (show (findMaxIndex xs))
snd returns the second element from a tuple; readLn is essentially read . getLine; zip takes two lists and returns a list of pairs; maximum finds a maximum value.
I left loop intact in its original beauty.
You can be even Haskellier if you remember that something (huge expression) can be replaced with something $ huge expression ($ simply applies its left operand to its right operand), and the functions can be combined with .: f (g x) is the same as (f . g) x, or f . g $ x (see? it's working for the left side as well!). Additionally, zip x y can be rewritten as x `zip` y
import Control.Monad
readFiveLines :: IO [Int]
readFiveLines = replicateM 5 readLn
addIndex :: [Int] -> [(Int, Int)]
addIndex = (`zip` [0..])
findMaxIndex :: [Int] -> Int
findMaxIndex = snd . maximum . addIndex
main :: IO ()
main = do xs <- readFiveLines
putStrLn . show . findMaxIndex $ xs
As for debug print, there's a package called Debug.Trace and a function traceShow which prints its first argument (formatted with show, hence the name) to stderr, and returns its second argument:
findMaxIndex :: [Int] -> Int
findMaxIndex = snd . (\xs -> traceShow xs (maximum xs)) . addIndex
That allows you to tap onto any expression and see what's coming in (and what are the values around — you can show tuples, lists, etc.)

I think alf's answer is very good, but for what it's worth, here's how I would interpret your intention.
{-# LANGUAGE FlexibleContexts #-}
module Main where
import System.IO
import Control.Monad.State
data S = S { maximum :: Int
, maximumIndex :: Int
, currentIndex :: Int }
update :: Int -> Int -> S -> S
update m mi (S _ _ ci) = S m mi ci
increment :: S -> S
increment (S m mi ci) = S m mi (ci+1)
next :: (MonadIO m, MonadState S m) => m ()
next = do
S maxi maxIdx currIdx <- get
input <- liftIO $ getLine
let element = read input :: Int
if maxi < element
then do
modify (update element currIdx)
liftIO $ hPutStrLn stderr "INNER CHECK"
else
liftIO $ hPutStrLn stderr "OUTER CHECK"
modify increment
run :: Int -> IO S
run n = execStateT (replicateM_ n next) (S 0 0 0)
main :: IO ()
main = do
S maxi maxIdx _ <- run 5
putStrLn $ "maxi: " ++ (show maxi) ++ " | maxIdx: " ++ (show maxIdx)
This uses a monad transformer to combine a stateful computation with IO. The get function retrieves the current state, and the modify function lets you change the state.

zip AST with bool list

I have an AST representing a haskell program and a bitvector/bool list representing the presence of strictness annotations on Patterns in order.For example, 1000 represents a program with 4 Pats where the first one is a BangPat. Is there any way that I can turn on and off the annotations in the AST according to the list?
-- EDIT: further clarify what I want editBang to do
Based on user5042's answer:
Simple.hs :=
main = do
case args of
[] -> error "blah"
[!x] -> putStrLn "one"
(!x : xs) -> putStrLn "many"
And I want editBang "Simple.hs" [True, True, True, True] to produce
main = do
case args of
[] -> error "blah"
[!x] -> putStrLn "one"
(!(!x : !xs)) -> putStrLn "many"
Given that above are the only 4 places that ! can appear

As a first step, here's how to use transformBi:
import Data.Data
import Control.Monad
import Data.Generics.Uniplate.Data
import Language.Haskell.Exts
import Text.Show.Pretty (ppShow)
changeNames x = transformBi change x
where change (Ident str) = Ident ("foo_" ++ str)
change x = x
test2 = do
content <- readFile "Simple.hs"
case parseModule content of
ParseFailed _ e -> error e
ParseOk a -> do
let a' = changeNames a
putStrLn $ ppShow a'
The changeNames function finds all occurrences of a Ident s and replaces it with Ident ("foo_"++s) in the source tree.
There is a monadic version called transformBiM which allows the replacement function to be monadic which would allow you to consume elements from your list of Bools as you found bang patterns.
Here is a complete working example:
import Control.Monad
import Data.Generics.Uniplate.Data
import Language.Haskell.Exts
import Text.Show.Pretty (ppShow)
import Control.Monad.State.Strict
parseHaskell path = do
content <- readFile path
let mode = ParseMode path Haskell2010 [EnableExtension BangPatterns] False False Nothing
case parseModuleWithMode mode content of
ParseFailed _ e -> error $ path ++ ": " ++ e
ParseOk a -> return a
changeBangs bools x = runState (transformBiM go x) bools
where go pp#(PBangPat p) = do
(b:bs) <- get
put bs
if b
then return p
else return pp
go x = return x
test = do
a <- parseHaskell "Simple.hs"
putStrLn $ unlines . map ("before: " ++) . lines $ ppShow a
let a' = changeBangs [True,False] a
putStrLn $ unlines . map ("after : " ++) . lines $ ppShow a'
You might also look into using rewriteBiM.
The file Simple.hs:
main = do
case args of
[] -> error "blah"
[!x] -> putStrLn "one"
(!x : xs) -> putStrLn "many"

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]

This question already has answers here:
How to get normal value from IO action in Haskell
(2 answers)
Closed 7 years ago.
I just started learning Haskell and got my first project working today. Its a small program that uses Network.HTTP.Conduit and Graphics.Rendering.Chart (haskell-chart) to plot the amount of google search results for a specific question with a changing number in it.
My problem is that simple-http from the conduit package returns a monad (I hope I understood the concept of monads right...), but I only want to use the ByteString inside of it, that contains the html-code of the website. So until now i use download = unsafePerformIO $ simpleHttp url to use it later without caring about the monad - I guess that's not the best way to do that.
So: Is there any better solution so that I don't have to carry the monad with me the whole evaluation? Or would it be better to leave it the way the result is returned (with the monad)?
Here's the full program - the mentioned line is in getResultCounter. If things are coded not-so-well and could be done way better, please remark that too:
import System.IO.Unsafe
import Network.HTTP.Conduit (simpleHttp)
import qualified Data.ByteString.Lazy.Char8 as L
import Graphics.Rendering.Chart.Easy
import Graphics.Rendering.Chart.Backend.Cairo
numchars :: [Char]
numchars = "1234567890"
isNum :: Char -> Bool
isNum = (\x -> x `elem` numchars)
main = do
putStrLn "Please input your Search (The first 'X' is going to be replaced): "
search <- getLine
putStrLn "X ranges from: "
from <- getLine
putStrLn "To: "
to <- getLine
putStrLn "In steps of (Only whole numbers are accepted):"
step <- getLine
putStrLn "Please have some patience..."
let range = [read from,(read from + read step)..read to] :: [Int]
let searches = map (replaceX search) range
let res = map getResultCounter searches
plotList search ([(zip range res)] :: [[(Int,Integer)]])
putStrLn "Done."
-- Creates a plot from the given data
plotList name dat = toFile def (name++".png") $ do
layout_title .= name
plot (line "Results" dat)
-- Calls the Google-site and returns the number of results
getResultCounter :: String -> Integer
getResultCounter search = read $ filter isNum $ L.unpack parse :: Integer
where url = "http://www.google.de/search?q=" ++ search
download = unsafePerformIO $ simpleHttp url -- Not good
parse = takeByteStringUntil "<"
$ dropByteStringUntil "id=\"resultStats\">" download
-- Drops a ByteString until the desired String is found
dropByteStringUntil :: String -> L.ByteString -> L.ByteString
dropByteStringUntil str cont = helper str cont 0
where helper s bs n | (bs == L.empty) = L.empty
| (n >= length s) = bs
| ((s !! n) == L.head bs) = helper s (L.tail bs) (n+1)
| ((s !! n) /= L.head bs) = helper s (L.tail bs) 0
-- Takes a ByteString until the desired String is found
takeByteStringUntil :: String -> L.ByteString -> L.ByteString
takeByteStringUntil str cont = helper str cont 0
where helper s bs n | bs == L.empty = bs
| n >= length s = L.empty
| s !! n == L.head bs = L.head bs `L.cons`
helper s (L.tail bs) (n + 1)
| s !! n /= L.head bs = L.head bs `L.cons`
helper s (L.tail bs) 0
-- Replaces the first 'X' in a string with the show value of the given value
replaceX :: (Show a) => String -> a -> String
replaceX str x | str == "" = ""
| head str == 'X' = show x ++ tail str
| otherwise = head str : replaceX (tail str) x

This is a lie:
getResultCounter :: String -> Integer
The type signature above is promising that the resulting integer only depends on the input string, when this is not the case: Google can add/remove results from one call to the other, affecting the output.
Making the type more honest, we get
getResultCounter :: String -> IO Integer
This honestly admits it's going to interact with the external world. The code then is easily adapted to:
getResultCounter search = do
let url = "http://www.google.de/search?q=" ++ search
download <- simpleHttp url -- perform IO here
let parse = takeByteStringUntil "<"
$ dropByteStringUntil "id=\"resultStats\">" download
return (read $ filter isNum $ L.unpack parse :: Integer)
Above, I tried to preserve the original structure of the code.
Now, in main we can no longer do
let res = map getResultCounter searches
but we can do
res <- mapM getResultCounter searches
after importing Control.Monad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Concatenating scrapeURL results from multiples scrapings into one list - haskell

Related

How to call a function in haskell?

How can I read a sentence, separate the words and apply my function to each word? Haskell

Simple Haskell program not behaving correct

zip AST with bool list

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]

Categories

Resources