How to create a diff of two complex data structures?

How to create a diff of two complex data structures? - haskell

Problem specification:
I am currently searching for a elegant and/but efficient solution to a problem that i guess is quite common. Consider the following situation:
I defined a fileformat based on a BTree that is defined (in a simplified way) like this:
data FileTree = FileNode [Key] [FileOffset]
| FileLeaf [Key] [Data]
Reading and writing this from a file to a lazy data structure is implemented and works just fine. This will result in a instance of:
data MemTree = MemNode [Key] [MemTree]
| MemLeaf [Key] [Data]
Now my goal is to have a generic function updateFile :: FilePath -> (MemTree -> MemTree) -> IO () that will read in the FileTree and convert it into a MemTree, apply the MemTree -> MemTree function and write back the changes to the tree structure. The problem is that the FileOffsets have to be conserved somehow.
I have two approaches to this problem. Both of them lack in elegance and/or efficiency:
Approach 1: Extend MemTree to contain the offsets
This approach extends the MemTree to contain the offsets:
data MemTree = MemNode [Key] [(MemTree, Maybe FileOffset)]
| MemNode [Key] [Data]
The read function would then read in the FileTree and stores the FileOffset alongside the MemTree reference. Writing will checks if a reference already has an associated offset and if it does it just uses it.
Pros: easy to implement, no overhead to find the offset
Cons: exposes internal to the user who is responsible to set the offset to Nothing
Approach 2: Store offsets in a secondary structure
Another way to attack this problem is to read in the FileTree and create a StableName.Map that holds onto the FileOffsets. That way (and if i understand the semantics of StableName correctly) it should be possible to take the final MemTree and lookup the StableName of each node in the the StableName.Map. If there is an entry the node is clean and doesn't have to be written again.
Pros: doesn't expose the internals to the user
Cons: involves overhead for lookups in the map
Conclusion
These are the two approaches i can think of. The first one should be more efficient, the second one is more pleasant to the eye. I'd like your comments on my ideas, maybe someone even has a better approach in mind?
[Edit] Reasonal
There are two reasons i am searching for a solution like this:
On the one hand you should try to handle errors before they arise by using the type system. The aforementioned user is of course the designer of the next layer in the system (ie me). By working on the pure tree representation some kinds of bugs won't be able to happen. All changes to the tree in the file should be in one place. That should make reasoning easier.
On the other hand i could just implement something like insert :: FilePath -> Key -> Value -> IO () and be done with it. But then i'll lose a very nice trait that comes free when i keep a (kind of a) log by updating the tree in place. Transactions (ie merging of several inserts) are just a matter of working on the same tree in memory and writing just the differences back to the file.

I think that the package Data.Generic.Diff may do exactly what you wanted. It references somebody's thesis for the idea of how it works.

I am very new at Haskell so I won't be showing code, but hopefully my explanation may help for a solution.
First, why not just expose only the MemTree to the user, since that is what they will update, and the FileTree can be kept completely hidden. That way, later, if you want to change this to be going to a database, for example, the user doesn't see any difference.
So, since the FileTree is hidden, why not just read it in when you are going to update, then you have the offsets, so do the update, and close the file again.
One problem with keeping the offsets is that it prevents another program from making any changes to the file, and in your case that may be fine, but I think as a general rule it is a bad design.
The main change, that I see, is that the MemTree shouldn't be lazy, since the file won't be staying open.

Related

Persistent - how to filter on a field of a record column

I'm in the middle of trying to build my first "real" Haskell app, an API using Servant, where I'm using Persistent for the database backend. But I've run into a problem when trying to use Persistent to make a certain type of database query.
My real problem is of course a fair bit more involved, but the essence of the problem I have run up against can be explained like this. I have a record type such as:
data Foo = Foo { a :: Int, b :: Int }
derivePersistField "Foo"
which I am including in an Entity like this:
share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persistLowerCase|
Thing
foo Foo
|]
And I need to be able to query for items in my database for which their Foo value has its a field greater than some aMin that is provided. (The intention is that it will actually be provided by the API request in a query string.)
For ordinary queries, say for an Int field Bar, I could simply do selectList [ThingBar >=. aMin] [], but I'm drawing a blank as to what to put in the filter list in order to extract the field from the record and do a comparison with it. Even though this feels like the sort of thing that Haskell should be able to do rather easily. It feels like there should be a Functor involved here that I can just fmap the a accessor over, but the relevant type, as far as I can tell from the documentation and the tutorial, is EntityField Thing defined by a GADT (actually generated by Template Haskell from the share call above), which in this case would have just one constructor yielding an EntityField Thing Foo, which it doesn't seem possible to make a Functor instance out of.
But without that, I'm drawing a blank as to how to deal with this, since the LHS of a combinator like >=. has to be an EntityField value, which stops me from trying to apply functions to the database value before comparing.
Since I know someone is going to say it (and most of you will be thinking it) - yes, in this toy example I could just as easily make the a and b into separate fields in my database table, and solve the problem that way. As I said, this is somewhat simplified, and in my real application doing it that way would feel unsatisfactory for a number of reasons. And it doesn't solve my wider question of, essentially, how to be able to do arbitrary transformations on the data before querying. [Edit: I have now gone with this approach, in order to move forward with my project, but as I said it's not totally satisfactory, and I'm still waiting for an answer to my general question - even if that's just "sorry it's not possible", as I increasingly suspect, I would appreciate a good explanation.]
I'm beginning to suspect this may simply be impossible because the data is ultimately stored in an SQL database and SQL simply isn't as expressive as Haskell - but what I'm trying to do, at least with record types (I confess I don't know how derivePersistField marshals these to SQL data types) doesn't seem too unreasonable so I feel I should ask if there are any workarounds for this, or do I really have to decompose my records into a bunch of separate fields if I want to query them individually.
[If there are any other libraries which can help then feel free to recommend them - I did look into Esqueleto but decided I didn't need it for this project, although that was before I ran into this problem. Would that be something that could help with this kind of query?]

You can use the -ddump-splices compiler flag to dump the code being generated by derivePersistField (and all the other Template Haskell calls). You may need to pass -fforce-recomp, too, if ghc doesn't think the file needs to be recompiled.
If you do, you'll see that the method persistent uses to marshal the Foo datatype to and from SQL is to use its read and show instances to store it as a text string. (This is actually explained in the documentation on Custom Fields.) This also means that a query like:
stuff <- selectList [ThingFoo >=. Foo 3 0] []
actually does string comparison at the SQL level, so Thing (Foo 10 2) wouldn't pass through this filter, because the string "Foo 10 2" sorts before "Foo 3 0".
In other words, you're pretty much out of luck here. Custom fields created by derivePersistField aren't really meant to be used for anything more sophisticated than the example from the Yesod documentation:
data Employment = Employed | Unemployed | Retired
The only way you can examine their structure would be to pass in raw SQL to parse the string field for use in a query, and that would be much uglier than whatever you're doing now and presumably no more efficient than querying for all records at the SQL level and doing the filtering in plain Haskell code on the result list.

Strict map with custom reading method or type with custom read instance

So, it is not a problem but I would want an opinion what would be a better way. So I need to read data from a outside source (TCP), that comes basically in this format:
key: value
okey: enum
stuff: 0.12240
amazin: 1020
And I need to parse it into a Haskell accessible format, so the two solutions I thought about, were either to, parse that into a strict String to String map, or record syntax type declarations.
Initially I thought to make a type synonym for my String => String map, and make extractor functions like amazin :: NiceSynonym -> Int, and do the necessary treatment and parsing within the method, but that felt like, sketchy at the time? Then I thought an actual type declaration with record syntax, with a custom Read instance. That was a nightmare, because there is a lot of enums and keys with different types and etc. And it felt... disappointing. It simply wraps the arguments and creates reader functions, not much different from the original: amazin :: TypeDeclaration -> Int.
Now I'm kind of regretting not going with reader functions as I initially envisioned. So, anything else I'm forgetting to consider? Any pros and cons of either sides to take note on? Is one objectively better then the other?
P.S.: Some considerations that may make one or the other better:
Once read I won't need to change it at all whatsoever, it's basically a status report
No need to compare, add, etc., again just status report no point
Not really a need for performance, I wont be reading hundreds a second or anything
TL;DR: Given that input example, what's the best way to make into a Haskell-readable format? map, data constructor, dependent map...

Both ways are very valid in their own respects, but since I was making an API to interact with such protocol too, I preferred the record syntax so I could cover all the properties more easily. Also I wasn't really going to do any checking or treatment in the getter functions, and no matter how boring making the reader instance for my type might have seemed, I bet doing all the get functions manually would be worse. Parsing stuff manually is inherently boring, I guess I was just looking for a magical funcional one liner to make all the work for me.

Syntax tree with parent in Haskell

I want to implement an AST in Haskell. I need a parent reference so it seems impossible to use a functional data structure. I've seen the following in an article. We define a node as:
type Tree = Node -> Node
Node allows us to get attribute by key of type Key a.
Is there anything to read about such a pattern? Could you give me some further links?

If you want a pure data structure with cyclic self-references, then as delnan says in the comments the usual term for that is "tying the knot". Searching for that term should give you more information.
Do note that data structures built by tying the knot are difficult (or impossible) to "update" in the usual manner--with a non-cyclic structure you can keep pieces of the original when building a new structure based on it, but changing any piece of a cycle requires you to rebuild the entire cycle as well. Depending on what you're doing, this may or may not be a problem, of course.

A specific use case of algebraic data types

I was writing an generic enumerator to scrape sites as an exercise and I did it, and it is complete and works fine, but I have a question. You can find it here: https://github.com/mindreader/scrape-enumerator if you want to look at the code.
The basic idea is I wanted an enumerator that spits out site defined entries on pages like search engines, blogs, things where you have to fetch a page, and it will have 25 entries, and you want one entry at a time. But at the same time I didn't want to write the plumbing for every site, so I wanted a generic interface. What I came up with is this (this uses type families):
class SiteEnum a where
type Result a :: *
urlSource :: a -> InputUrls (Int,Int)
enumResults :: a -> L.ByteString -> Maybe [Result a]
data InputUrls state =
UrlSet [URL] |
UrlFunc state (state -> (state,URL)) |
UrlPageDependent URL (L.ByteString -> Maybe URL)
In order to do this on every type of site, this requires a url source of some sort, which could be a list (possibly infinite) of pregenerated urls, or it could be an initial state and something to generate urls from it (like if the urls contained &page=1, &page=2, etc), and then for really screwed up pages like google, give an initial url and then provide a function that will search the body for the next link and then use that. Your site makes a data type an instance of SiteEnum and gives a type to Result which is site dependent and now the enumerator deals with all the I/O, and you don't have to think about it. This works perfectly and I implemented one site with it.
My question is that there is an annoyance with this implementation is the InputUrls datatype. When I use UrlFunc everything is golden. When I use UrlSet or UrlPageDependent, it isn't all fun and games because the state type is undefined, and I have to cast it to :: InputUrls () in order for it to compile. This seems totally unnecessary as that type variable due to the way the program is written, will never be used for the majority of sites, but I don't know how to get around it. I'm finding that I want to use types like this in a lot of different contexts, and I always end up with stray type variables that only are needed in certain pieces of the datatype, but it doesn't feel like I should be using it this way. Is there a better way of doing this?

Why do you need the UrlFunc case at all? From what I understand, the only thing you're doing with the state function is using it to build a list like the one in UrlSet anyway, so instead of storing the state function, just store the resulting list. That way, you can eliminate the state type variable from your data type, which should eliminate the ambiguity problems.

Is there an object-identity-based, thread-safe memoization library somewhere?

I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.
I'm aware of several different 'off the shelf' memoization libraries for Haskell:
The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace Cmp for Hashable and Data.Map for Data.HashMap and add the appropraite imports, you get a hash based version.)
However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!
Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.
The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.
My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?
Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?
UPDATE
Following #luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:
data Node = Node [Node] [Annotation]
This type can be used to represent a simple kind of rooted DAG in memory, where Nodes are DAG Nodes, the root is just a distinguished Node, and each node is annotated with some Annotations whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Nodes in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.
I have a function
myTransform : Node -> Node
the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node with the same children but possibly different annotations. I wish to write a function
recursiveTransform : Node -> Node
whose output 'looks the same' as the data structure as you would get by doing:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.
I appreciate that this would all be easier if say, the Nodes were numbered before I got them, or I could otherwise change the definition of a Node. I can't (easily) do either of these things.
I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )
1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.

If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.
For example, if you have a data structure like this
data Foo = Foo { ... }
expensive :: Foo -> Bar
then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.
data Foo = Foo { ..., memo :: Bar }
To make it easier to use, add a smart constructor to tie the knot.
makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo
Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.

It seems that stable-memo would be just what you needed (although I'm not sure if it can handle multiple threads):
Whereas most memo combinators memoize based on equality, stable-memo does it based on whether the exact same argument has been passed to the function before (that is, is the same argument in memory).
stable-memo only evaluates keys to WHNF.
This can be more suitable for recursive functions over graphs with cycles.
stable-memo doesn't retain the keys it has seen so far, which allows them to be garbage collected if they will no longer be used. Finalizers are put in place to remove the corresponding entries from the memo table if this happens.
Data.StableMemo.Weak provides an alternative set of combinators that also avoid retaining the results of the function, only reusing results if they have not yet been garbage collected.
There is no type class constraint on the function's argument.
stable-memo will not work for arguments which happen to have the same value but are not the same heap object. This rules out many candidates for memoization, such as the most common example, the naive Fibonacci implementation whose domain is machine Ints; it can still be made to work for some domains, though, such as the lazy naturals.

Ekmett just uploaded a library that handles this and more (produced at HacPhi): http://hackage.haskell.org/package/intern. He assures me that it is thread safe.
Edit: Actually, strictly speaking I realize this does something rather different. But I think you can use it for your purposes. It's really more of a stringtable-atom type interning library that works over arbitrary data structures (including recursive ones). It uses WeakPtrs internally to maintain the table. However, it uses Ints to index the values to avoid structural equality checks, which means packing them into the data type, when what you want are apparently actually StableNames. So I realize this answers a related question, but requires modifying your data type, which you want to avoid...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string