I'm writing a sort of scraper or data miner in Haskell. It consists of a main loop and other shared logic, along with a number of "adapters" each of which is designed to scrape a particular type of resource (not only web pages but possibly filesystem objects as well). The adapters all produce the same type of result but I would like them to be independent otherwise. Also, I would like them to share the main loop and other logic.
This is what I have so far, using ExistentialQuantification to hide the adapter affiliation of a scraping job. My idea is that the main loop processes a sequence of jobs, dispatching on the "process" method to find the right adapter implementation.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ExistentialQuantification #-}
import Control.Monad.Trans.Except
import Data.Text (Text)
import qualified Data.Text as T
-- Metadata about an article
data Article = Article
{ articleSource :: Text
, articleTitle :: Text
, articleUrl :: Text
, articleText :: Text
, articleDate :: Maybe Text
}
deriving (Show)
-- Adapter for scraping a certain type of resource.
--
-- The adapter provides a seed job constructor which samples the current config
-- and creates a seed job to be run first. The seed job then generates all
-- other jobs, e.g. by scraping an index page.
--
--- Each adapter module exports a value of this type, and nothing else.
data Adapter = Adapter
{ adapterName :: Text
, adapterSeedJob :: ScraperConfig -> AnyJob
}
-- Each scrape operation produces zero or more articles, and zero or more
-- new scrape jobs.
type ScrapeResult j = ExceptT AdapterErr IO ([Article], [j])
-- Specification of a resource to be scraped by a certain adapter.
--
-- Each adapter defines its own job type, containing the necessary information.
class ScrapeJob j where
jobAdapter :: j -> Text
jobDesc :: j -> Text
jobProcess :: j -> ScrapeResult j
-- Opaque type encapsulating any adapter's job type, used outside the adapter.
data AnyJob = forall j. ScrapeJob j => AnyJob j
instance ScrapeJob AnyJob where
jobAdapter (AnyJob j) = jobAdapter j
jobDesc (AnyJob j) = jobDesc j
jobProcess (AnyJob j) = wrap <$> jobProcess j
where wrap (as, js) = (as, AnyJob <$> js)
-- Global configuration, such as which passwords to use
data ScraperConfig = ScraperConfig
{ -- ...
}
My problem is that each adapter also has some context attached to it. One example would be that most adapters need to perform some kind of login procedure before it can access any data. I would like this to be handled separately from the scraping itself if possible, but I can't use the trick with AnyJob for "AnyContext" (I think) since there would be no way for the main loop of guaranteeing that the types of the AnyJob and AnyContext match up correctly.
My current ideas, neither of which I'm really satisfied with, are:
Fix the type of the context to a "bag of data" such as Map Text Text and add a special setup method to each adpater which creates this context.
Add the required context as a field of every ScrapeJob instance, and copy it along explicitly whenever a new job is created.
Turn the design upside down, letting each adapter run its own main loop using shared functions defined in a utility module.
Is there something I'm missing here? Any advice on how to improve on this design would be appreciated.
Thanks!
Related
I'm creating an application that allows an admin to build a form, which a user can fill out. Questions can be of different types. Each kind of question corresponds to a type of response data.
Is it possible to encode this at the type level? How would you organize this?
data QuestionType = EmailText | PlainText | Numeric
-- this doesn't have any response information, it's the metadata
-- about the question itself.
data Question = { id :: ID, label :: Text, questionType :: QuestionType }
data Answer = { questionID :: ID, response :: Response }
-- I would like to map those question types to different response types
data Response = ???
-- EmailText => Text
-- PlainText => Text
-- Numeric => Int
I've thought about Type Families, which would work perfectly except that I want to map from different data constructors to different types, and type families would require separate types for each one.
This would be a good fit for a single ADT, with the response information included in each constructor, but I need to be able to deal with question types independently from responses.
How should I approach this?
I've not completely understood what you exactly want, but maybe this can be a starting point:
{-# LANGUAGE DataKinds, GADTs #-}
data Response (qt :: QuestionType) where
RPlainText :: Text -> Response PlainText
REmailText :: Text -> Response EmailText
RNumeric :: Int -> Response Numeric
data Answer qt = Answer {questionID :: ID, response :: Response qt}
If you do not want the qt argument in Answer qt, you probably need existential types to hide that, but at that point you'll probably want to link it with the questions somehow.
Having types depend on values is exactly what dependent types are for. Unfortunately, Haskell doesn't have dependent types. However, you can use a sum type to capture the possible response types:
data Response = ResponseText Text | ResponseInt Int
I've decided to try functional programming and Purescript. After reading "Learn you a Haskell for great good" and "PureScript by Example" and playing with code a little I think that I can say that I understand the basics, but one thing bothers me a lot - code looks very coupled. It's usual for me to change libraries very often and in OOP I can use onion architecture to decouple my own code from the library specific one, but I have no idea how to do this in Purescript.
I've tried to find how people do this in Haskell, but all I could find were answers like "No one has ever made complex apps in Haskell, so no one knows how to do it" or "You have input and you have output, everything in between are just pure functions". But at this moment I have a toy app that uses virtal dom, signals, web storage, router libs and each of them have their own effects and data structures, so it doesn't sound like one input and one output.
So my question is how should I structure my code or what technics should I use so that I could change my libs without rewriting half of my app?
Update:
Suggestion to use several layers and keep effects in the main module is quite common too and I understand why I should do so.
Here is a simple example that hopefully will illustrate the problem i'm talking about:
btnHandler :: forall ev eff. (MouseEvent ev) => ev -> Eff (dom :: DOM, webStorage :: WebStorage, trace :: Trace | eff) Unit
btnHandler e = do
btn <- getTarget e
Just btnId <- getAttribute "id" btn
Right clicks <- (getItem localStorage btnId) >>= readNumber
let newClicks = clicks + 1
trace $ "Button #" ++ btnId ++ " has been clicked " ++ (show newClicks) ++ " times"
setText (show newClicks) btn
setItem localStorage btnId $ show newClicks
-- ... maybe some other actions
return unit
-- ... other handlers for different controllers
btnController :: forall e. Node -> _ -> Eff (dom :: DOM, webStorage :: WebStorage, trace :: Trace | e) Unit
btnController mainEl _ = do
delegateEventListener mainEl "click" "#btn1" btnHandler
delegateEventListener mainEl "click" "#btn2" btnHandler
delegateEventListener mainEl "click" "#btn3" btnHandler
-- ... render buttons
return unit
-- ... other controllers
main :: forall e. Eff (dom :: DOM, webStorage :: WebStorage, trace :: Trace, router :: Router | e) Unit
main = do
Just mainEl <- body >>= querySelector "#wrapper"
handleRoute "/" $ btnController mainEl
-- ... other routes each with it's own controller
return unit
Here we have simple counter app with routing, web storage, dom manipulations and console logging. As you can see there is no single input and single output. We can get inputs from router or event listeners and use console or dom as an output, so it becomes a little more complicated.
Having all this effectful code in main module feels wrong for me for two reasons:
If I will keep adding routes and controllers this module will quickly turn into a thousand line mess.
Keeping routing, dom manipulations and data storing in the same module violates single responsibility principle (and I assume that it is important in FP too)
We can split this module into several ones, for example one module per controller and create some kind of effectful layer. But then when I have ten controller modules and I want to change my dom specific lib I should edit them all.
Both of this approaches are far from ideal, so the question is wich one I should choose? Or maybe there is some other way to go?
There's no reason you can't have a middle layer for abstracting over dependencies. Let's say you want to use a router for your application. You can define a "router abstraction" library that would look like the following:
module App.Router where
import SomeRouterLib
-- Type synonym to make it easy to change later
type Route = SomeLibraryRouteType
-- Just an alias to the Router library
makeRoute :: String -> Route -> Route
makeRoute = libMakeRoute
And then the new shiny comes out, and you want to switch your routing library. You'll need to make a new module that conforms to the same API, but has the same functions -- an adapter, if you will.
module App.RouterAlt where
import AnotherRouterLib
type Route = SomeOtherLibraryType
makeRoute :: String -> Route -> Route
makeRoute = otherLibMakeRoute
In your main app, you can now swap the imports, and everything should work alright. There will likely be more massaging that needs to happen to get the types and functions working as you'd expect them, but that's the general idea.
Your example code is very imperative in nature. It's not idiomatic functional code, and I think you're correct in noting that it's not sustainable. More functional idioms include purescript-halogen and purescript-thermite.
Consider the UI as a pure function of current application state. In other words, given the current value of things, what does my app look like? Also, consider that the current state of the application can be derived from applying a series of pure functions to some initial state.
What is your application state?
data AppState = AppState { buttons :: [Button] }
data Button = Button { numClicks :: Integer }
What kind of events are you looking at?
data Event = ButtonClick { buttonId :: Integer }
How do we handle that Event?
handleEvent :: AppState -> Event -> AppState
handleEvent state (ButtonClick id) =
let newButtons = incrementButton id (buttons state)
in AppState { buttons = newButtons }
incrementButton :: Integer -> [Button] -> [Button]
incrementButton _ [] = []
incrementButton 0 (b:bs) = Button (1 + numClicks b) : bs
incrementButton i (b:bs) = b : incrementButton (i - 1) buttons
How do you render the application, based on the current state?
render :: AppState -> Html
render state =
let currentButtons = buttons state
btnList = map renderButton currentButtons
renderButton btn = "<li><button>" ++ show (numClicks btn) ++ "</button></li>"
in "<div><ul>" ++ btnList ++ "</ul></div>"
This is a bit of an open ended question, so it's hard to answer specifically without concrete examples.
You have input and you have output, everything in between are just pure functions
Statements like this are actually pretty close to the truth. Since there are no stateful objects in Haskell and PureScript, the majority of the code in an app will be based around pure functions and simple data types (or records), and therefore it is not tightly coupled to any particular library (aside from things like Maybe, Either, Tuple, and so on, which aren't really libraries in the sense you're talking about).
As much as possible you should try to push code that uses effects to the “outside”. This is where you interleave the various libraries you require to process whatever inputs and produce whatever outputs your app requires. This layering makes it easy to switch libraries in and out, as here you'll mostly be lifting your core pure code into the Eff monad to “wire it up” to the external inputs and ouputs.
One way of looking at it, is if you find yourself using Eff much outside of the main module or top layer of your app, you're probably “doing it wrong”.
If you're writing Haskell, substitute anywhere I mention Eff with IO.
Is it possible to check if a function is defined, and use it as the Just value of a Maybe type if it is? And use Nothing if it's not defined, of course.
I'm writing a wrapper around atom for use with the TI MSP430 line. Part of what I'm doing is making a function that quickly compiles the code in the right format for MSP430 controllers - for example, compiling an atom to use in a timer interrupt requires a function definition like so:
#pragma vector=TIMERA0_VECTOR __interrupt
void timerAisr(void) {
...
}
At the moment, I have an object that holds references to the function the user would like to use for each different ISR. It looks a bit like this:
mspProgram = MSP430Compilation {
setupFn = Nothing,
setupFnName = "setup",
loopFn = Nothing,
loopFnName = "loop",
timerAISR = Nothing,
timerAISRName = "timerAISR",
And so on. Very configurable - you can choose the name of the function to output in C code, and the Atom to compile for that function. But I've decided I'd like to take more of a convention-over-configuration approach and basically assume some sensible function names. So instead of passing one of these configuration objects, I want the compilation code to check for definitions of sensibly-named functions.
For example, if the user defines an Atom called timerAISR, then my code should compile that atom to a C function named the same, with the appropriate #pragma matter for it to service the timer A interrupt.
So what I need to do is sort of meta-Haskell, checking if the user has defined a function and using that in my library code. I imagine this might involve template Haskell, so I'm off to research it.
EDIT:
I've realised that my original solution was too simplistic once I tried to fit it into my actual code. I hadn't absorbed Haskell's namespacing, so I didn't realise that lookupValueName would not work on values defined in user code. Here's the situation I'm dealing with:
main.hs:
module Main where
import Library
a = 1
main = libraryMain
Library.hs:
{-# LANGUAGE TemplateHaskell #-}
module Library where
import Template
libraryMain :: IO ()
libraryMain = do
$(printSomethingIfIsDefined "a")
$(printSomethingIfIsDefined "b")
Template.hs:
{-# LANGUAGE TemplateHaskell #-}
module Template where
import Language.Haskell.TH
printSomethingIfIsDefined name = do
maybeFn <- lookupValueName name
case maybeFn of
Just fn -> [| putStrLn "It's defined!" |]
Nothing -> [| return () |]
This prints nothing. If I define a in Library.hs, it will print out once, because a is defined in that scope.
Setup:
I have several collections of various data structures witch represent the state of simulated objects in a virtual system. I also have a number of functions that transform (that is create a new copy of the object based on the the original and 0 or more parameters) these objects.
The goal is to allow a user to select some object to apply transformations to (within the rules of the simulation), apply those the functions to those objects and update the collections by replacing the old objects with the new ones.
I would like to be able to build up a function of this type by combining smaller transformations into larger ones. Then evaluate this combined function.
Questions:
How to I structure my program to make this possible?
What kind of combinator do I use to build up a transaction like this?
Ideas:
Put all the collections into one enormous structure and pass this structure around.
Use a state monad to accomplish basically the same thing
Use IORef (or one of its more potent cousins like MVar) and build up an IO action
Use a Functional Reactive Programing Framework
1 and 2 seem like they carry a lot of baggage around especially if I envision eventually moving some of the collections into a database. (Darn IO Monad)
3 seems to work well but starts to look a lot like recreating OOP. I'm also not sure at what level to use the IORef. (e.g IORef (Collection Obj) or Collection (IORef Obj) or data Obj {field::IORef(Type)} )
4 feels the most functional in style, but it also seems to create a lot of code complexity without much payoff in terms of expressiveness.
Example
I have a web store front. I maintain a collections of products with (among other things) the quantity in stock and a price. I also have a collection of users who have credit with the store.
A user comes along ands selects 3 products to buy and goes to check out using store credit. I need to create a new products collection that has the amount in stock for the 3 products reduced, create a new user collection with the users account debited.
This means I get the following:
checkout :: Cart -> ProductsCol -> UserCol -> (ProductsCol, UserCol)
But then life gets more complicated and I need to deal with taxes:
checkout :: Cart -> ProductsCol -> UserCol -> TaxCol
-> (ProductsCol, UserCol, TaxCol)
And then I need to be sure to add the order to the shipping queue:
checkout :: Cart
-> ProductsCol
-> UserCol
-> TaxCol
-> ShipList
-> (ProductsCol, UserCol, TaxCol, ShipList)
And so forth...
What I would like to write is something like
checkout = updateStockAmount <*> applyUserCredit <*> payTaxes <*> shipProducts
applyUserCredit = debitUser <*> creditBalanceSheet
but the type-checker would have go apoplectic on me. How do I structure this store such that the checkout or applyUserCredit functions remains modular and abstract? I cannot be the only one to have this problem, right?
Okay, let's break this down.
You have "update" functions with types like A -> A for various specific types A, which may be derived from partial application, that specify a new value of some type in terms of a previous value. Each such type A should be specific to what that function does, and it should be easy to change those types as the program develops.
You also have some sort of shared state, which presumably contains all the information used by any of the aforementioned update functions. Further, it should be possible to change what the state contains, without significantly impacting anything other than the functions acting directly on it.
Additionally, you want to be able to abstractly combine update functions, without compromising the above.
We can deduce a few necessary features of a straightforward design:
An intermediate layer will be necessary, between the full shared state and the specifics needed by each function, allowing pieces of the state to be projected out and replaced independently of the rest.
The types of the update functions themselves are by definition incompatible with no real shared structure, so to compose them you'll need to first combine each with the intermediate layer portion. This will give you updates acting on the entire state, which can then be composed in the obvious way.
The only operations needed on the shared state as a whole are to interface with the intermediate layer, and whatever may be necessary to maintain the changes made.
This breakdown allows each entire layer to be modular to a large extent; in particular, type classes can be defined to describe the necessary functionality, allowing any relevant instance to be swapped in.
In particular, this essentially unifies your ideas 2 and 3. There's an inherent monadic context of some sort here, and the type class interface suggested would allow multiple approaches, such as:
Make the shared state a record type, store it in a State monad, and use lenses to provide the interface layer.
Make the shared state a record type containing something like an STRef for each piece, and combine field selectors with ST monad update actions to provide the interface layer.
Make the shared state a collection of TChans, with separate threads to read/write them as appropriate to communicate asynchronously with an external data store.
Or any number of other variations.
You can store your state in a record, and use lenses to update pieces of state. This lets you write the individual state updating components as simple, focused functions that may be composed to build more complex checkout functions.
{-# LANGUAGE TemplateHaskell #-}
import Data.Lens.Template
import Data.Lens.Common
import Data.List (foldl')
import Data.Map ((!), Map, adjust, fromList)
type User = String
type Item = String
type Money = Int -- money in pennies
type Prices = Map Item Money
type Cart = (User, [(Item,Int)])
type ProductsCol = Map Item Int
type UserCol = Map User Money
data StoreState = Store { _stock :: ProductsCol
, _users :: UserCol
, msrp :: Prices }
deriving Show
makeLens ''StoreState
updateProducts :: Cart -> ProductsCol -> ProductsCol
updateProducts (_,c) = flip (foldl' destock) c
where destock p' (item,count) = adjust (subtract count) item p'
updateUsers :: Cart -> Prices -> UserCol -> UserCol
updateUsers (name,c) p = adjust (subtract (sum prices)) name
where prices = map (\(itemName, itemCount) -> (p ! itemName) * itemCount) c
checkout :: Cart -> StoreState -> StoreState
checkout c s = (users ^%= updateUsers c (msrp s))
. (stock ^%= updateProducts c)
$ s
test = checkout cart store
where cart = ("Bob", [("Apples", 2), ("Bananas", 6)])
store = Store initialStock initialUsers prices
initialStock = fromList
[("Apples", 20), ("Bananas", 10), ("Lambdas", 1000)]
initialUsers = fromList [("Bob", 20000), ("Mary", 40000)]
prices = fromList [("Apples", 100), ("Bananas", 50), ("Lambdas", 0)]
Data.Binary is great. There is just one question I have. Let's imagine I've got a datatype like this:
import Data.Binary
data Ref = Ref {
refName :: String,
refRefs :: [(String, Ref)]
}
instance Binary Ref where
put a = put (refName a) >> put (refRefs a)
get = liftM2 Ref get get
It's easily to see that this is a recursive datatype, which works because Haskell is lazy. Since Haskell as a language uses neither references nor pointers, but presents the data as-is, I am not sure how this is going to be saved. I have the strong indication that this naive reproach will lead to an infinite bytestring...
So how can this type be safely saved?
If your data has no cycles you'll be fine. But a cycle, like
r = Ref "a" [("b", r)]
is indeed going to generate an infinite result. The only way around this is for you to give unique labels to all nodes and use those to avoid cycles when converting to binary.