Data.HashMap performances outside of IO - haskell

I've discovered Helm (http://helm-engine.org/) the other day and I've been playing a bit with it. I like Elm, so Helm has been great to use thus far.
Simply put, the update function gets called every tick and get passed the model, and has to return an updated version of that model. Another function then gets called to render that model on screen.
For small games with not much in the model it seems ideal, but I've been thinking about something a bit bigger, for which a HashMap (or a lot of them) would be ideal, and I'm wondering about the performances of that.
I'm no expert but I believe using the Data.HashTable.IO should modify the hashtable in ram instead of creating a new copy on change, but that seems complicated to interface with Helm. That would mean using a Cmd for each lookups and each changes and returning that to Helmn and then get passed the result from a new call to update, a nightmare to use if you have more than one or two things to lookup I think.
Data.HashMap.Strict (or Lazy ?) would probably work better, but I imagine each change would create a new copy, and the GC would free up the old one at some future point. Is that correct ?
That would potentially mean hundreds of copy then free per frame, unless the whole thing is smart enough to realise I'm not using the old hashtable again after the change and just not copy it.
So how does this work in practice ? (I'm thinking of HashMap because it seems like the easier solution, but I guess this applies to regular lists too).

I support the comments about avoiding premature optimization and benchmarking, instead of guessing, to determine if performance is acceptable. That said, you had some specific questions too.
Data.HashMap.Strict (or Lazy ?) would probably work better, but I imagine each change would create a new copy, and the GC would free up the old one at some future point. Is that correct ?
Yes, the path to the modified node will consist of new nodes. Modulo balancing, the sub trees on the left and right of the path will all be shared (not copied) by the old and new tree structures.
That would potentially mean hundreds of copy then free per frame
I'm not sure where you get "hundreds" from. Are you saying there are hundreds of updates? For some structures there are rewrite rules that allow much of the intermediate values to be used in a mutating manner. See for example, this small examination of vector.
So how does this work in practice ?
In practice people implement what they want and rework parts that are too slow. I might reach for HashMap early, instead of assuming the containers Data.Map will suffice, but I don't go beyond that without evidence.

Related

Does Pharo have immutable data structures such as maps and sets in its standard library?

I might have just missed them, but I can't seem to find any mention of immutable data structures in Pharo. Coming from functional languages, I've found the immutable map and set useful on various occasions. Even though Pharo has a particular bias towards using mutation, I'd be surprised if nobody got around to implementing them yet.
The code at http://source.lukas-renggli.ch/container/ implements a modern container and iterator library; with mutable and immutable lists; unmodifiable views; and sorted, ordered and unordered set and map data structures. It also supports efficient lazy iteration over all containers using common filtering, mapping, flattening, partitioning, ... operations.
I am not claiming the library has a perfect design or is more performant than the standard collection library, but it is certainly a good starting point for further exploration.
It is completely possible that someone implemented something like that. And maybe there will be immutable collections as a part of the main library in the future. However, for now, there is nothing like that and it is for a very simple reason: what for? When I started to learn Pharo I was fascinated by the null-propagation idea of Objective-C (if you have null, and you send a message to a null you get null back, etc...) So the first thing that I did was to implement null propagation in Pharo. It was fun, it was educational, and it was completely useless. It was useless because no one uses Pharo in that way, it was a wrong approach for that context. I strongly encourage you to make your own immutable collections in Pharo.
But while you do this, think about what should be immutable and why. Is it about shrinking or growing a collection? Arrays are like that — they are fixed size. Is it about not being able to add/remove/swap elements? But what if you get an element and modify it? Finally, consider this example:
array := #('a' 'b' 'c').
array first become: 'd'.
array = #('d' 'b' 'c')
I don't use any setters and still, I can end up with a different array in the end.
Pharo community cares about transparency and good design. It is known that you shouldn't modify contents of collections directly, you shouldn't interact with the internal state of the objects from outside, etc… On the other hand, no one will punch you in the face if you want to do that. I mean, what if you prototype? what if you hack? what if there is literally no other way? You are always able to choose, the question is how can we help people to learn about better choices.
P.S. My answer may sound like immutability is not important. It's not the case. There were even prototypes of read-only objects that can be used to ensure a certain degree of security. It's not that simple to come up with a single concept that will work for everything though

Delete orphaned shared ValueObjects?

I'm still trying to understand DDD. Let's say I have an immutable VO PresentationSlide. Because it is immutable it can be easily shared and used by different entities.
But what happens if I try to map this to a database. In EntityFramework I could model the PresentationSlide as a ComplexType, meaning the properties of the shared PresentationSlide are mapped to the tables using it. This is fine, but a Slide might be quite large and therefore I'm wasting space, if it used/reference several times.
As an alternative approach I could map a PresentationSlide to a separate table and reference it. Because it is immutable this should also work. But now, if I modify a Presentation, I have to copy the content of the old PresentationSlide and create a new instance. If there are a lot of changes, I will soon have a lot of orphaned PresentationSlides in my database.
Isn't this a problem? Is there a solution?
Should I implement a custom periodical 'Cleanup orphaned PresentationSlides'-Task?
First you should think about ownership and life cycle of the PresentationSlide within your domain model. Always make sure you follow your model semantics when doing performance or storage optimization.
I'd start out with duplicating each PresentationSlide within its entity. This is just the natural way to do, because that's what you do in your domain model, too.
Put metrics in place that enable you to make an informed decision about the storage optimization. E.g. go with the duplication approach and calculate the space wasted due to duplicates after some time in production.
Only if you really have a problem, do the storage optimization. Otherwise you've become victim of Premature Optimization.
If you really need to make an optimization, I think the approach you mention is a reasonable one. Create a separate table, and regularly clean up PresentationSlides that are not referenced anymore.

What is the most efficient implementation of arrays with functional updates?

I need an array-like data structure with the fastest possible functional update. I've seen a few different implementation of flexible arrays that provide me with this property (Braun, Random Access Lists) but I'm wondering if there is an implementation that is specifically optimized for the case when we are not interested in append or prepend - just updates.
Jean-Cristophe Filliâtre has a very nice implementation of persistent arrays, that is described in the paper linked at the same page (which is about persistent union-find, of which persistent arrays are a core component). The code is directly available there.
The idea is that "the last version" of the array is represented as an usual array, with O(1) access and update operations, and previous versions are represented as this last version, plus a list of the differences. If you try to access a previous version of the structure, the array is "rerooted" to apply the list of differences and present you again the efficient representation.
This will of course not be O(1) under all workflows (if you constantly access and modify unrelated versions of the structure, you will pay rerooting costs frequently), but for the common workflow of mainly working with one version, and occasionally backtracking to an older version that becomes the "last version" again and gets the updates, this is very efficient. A very nice use of mutability hidden under a observationally pure interface.
I have a very good experience with repa (nice demo). Very good performance, automatic parallelism, multidimensional, polymorphic. Recommended to try.
Which language are you using? In Haskell you can use mutable arrays with the state monad, and in Mercury you can use mutable arrays by threading the IO state. Ocaml also has an array module, which unfortunately does not maintain referential transparency, if that is what you are after.
I also needed functional arrays and felt on this SO question some days ago. I was not satisfied with the solution proposed by Gasche as creating a new array is a costly operation and I need to access older versions of array quite frequently (I plan to use this for an AI alpha/beta implementation playing on an array).
(When I say costly, I guess it is O(n*h) where h is the history size because in the worst case only one cell was updated repeatedly and it is needed to go through the whole update list for each cell. I also expect most of the cells not being updated when I need to reroute the array).
This is why I propose another approach, maybe I can get some feedback here. My idea is to store the array like in a B-Tree, except that as it is not mutable, I can access and update any value by index quite easily.
I wrote a small introduction on the project's repository : https://github.com/shepard8/ocaml-ptarray. The order is chosen so as to even depth and order of the tree, so I can get nice complexities depending only on the order for get/set operations, namely O(k^2).
With k = 10 I can store up to 10^10 values. Actually my arrays should not contain more than 200 values but this is to show how robust my solution is intended to be.
Any advice welcome!

Is Monotouch.Dialog a suitable replacement for all UITableviews?

This question is mainly targeted towards Miguel as the creator of MT.Dialog but I would like to hear opinions of others as well.
I'm currently refactoring a project that has many table views. I'm wondering if I should replace All of them with MT.Dialog.
My pros are:
easy to use
simple code
hope that Xamarin will offer it cross platform one day
Cons:
my cells are complete custom made. Does it make sense in that case?
performance? Is that an issue?
breaking the MVC paradigms (source no longer separated from view and controller)
Is it in general better to just use MT.Dialog or inherit from it for specific use cases? What are your experiences?
To address some of your questions.
The major difference between MonoTouch.Dialog and UITableView is that with the former you "load" all the data that you want to render upfront, and then forget about it. You let MonoTouch.Dialog take care of rendering it, pushing views and taking care of sections/elements. With UITableView you need so provide callback methods for returning the number of sections, the titles for the sections and the data itself.
UITableView has the advantage that to render say a million rows with the same size and the same cells, you dont really have to load all the data upfront, you can just wait to be called back. That being said, this breaks quickly if you use cells with different heights, as UITableView will have to query for the sizes of all of your rows.
So in short:
(1) yes, even if you use custom cells, you will benefit from shorter code and a simpler programming model. Whether you use the other features of it or not, is up to you.
(2) For performance, the issue boils down to how many rows you will have. Like I mentioned before, if you are browsing a potentially large data set, you would have to load all of those cells in memory up front, or like TweetStation, add features to load "on-demand".
The reality is that it will consume more memory, because you need to load your data in MonoTouch.Dialog. Your best optimization technique is to keep your Elements very lightweight. Tweetstation for example uses a "TweetElement" that merely holds the ID to the tweet, and loads the actual contents on demand, to keep the size of the TweetElement in memory very small.
With UITableView, you do not pay for that price. But if you are not using a database of some sort, the data will still be in memory.
If your application calls for the data to be in memory, then you might as well move the data to be elements and use that as your model.
(3) This is a little bit of a straw man. Your data "source" is never really independent of UIKit. I know that people like to talk about these models as being reusable, but in practice, you wont ever be able to reuse a UITableViewSource as a source for anything but a UITableView. It's main use is to support scalable controls that do not require data to be loaded in memory up-front, it is not really about separating the Model from the View.
So what you really have is an adaptor class that bridges the world of the UITableView with your actual data model (a database, an XML list, an in-memory array, a Redis connection).
With UITableView, your adaptor code lives in the constructor and the UITableViewSource. With MonoTouch.Dialog your adatpro code lives in the code that populates the initial RootElement to DialogViewController.
So there are reasons to use UITableView over MonoTouch.Dialog, but it is none of those three Cons.
I use MonoTouch.Dialog (and it's brother QuickDialog for objc) pretty much every time I use a tableview. It does help a lot to simplify the code, and gives you a better abstraction of a table.
There's one exception, though, which is when the table will have thousands and thousands of rows, and the data is in a database. MT.D/QD requires you to load all the data upfront, so you can create the sections, and that's simply too slow if you don't already have the objects in memory.
Regarding "breaking MVC", I kind of agree with you. I never really use the reflection bindings in MT.D because of that fact. Usually I end up creating the root from scratch in code, or use something like JSON (in my fork https://github.com/escoz/MonoMobile.Forms), so that my domain objects don't need to know about MT.D.

How to avoid mutable state (when multithreading)

Multithreading is hard. The only this you can do is program very carefully and follow good advice. One great advice I got from the answers on this forum is to avoid mutable state. I understand this is even enforced in the Erlang language. However, I fail to see how this can be done without a severe performance hit and huge amounts of caches.
For example. You have a big list of objects, each containing quite a lot of properties; in other words: a large datastructure. Suppose you have got a bunch of threads and they all need to access and modify the list. How can this be done without shared memory without having to cache the whole datastructure in each of the threads?
Update: After reading the reactions so far, I would like to put some more emphasis on performance. Don't you think that copying the same data around will make the program slower than with shared memory?
Not each algorithm can be parallelized in a successful manner.
If your program doesn't exhibit any "parallel structure", then you're pretty doomed to use locking and shared, mutable structures.
If your algorithm exhibit structure, then you can express your computation in terms of some patterns or formalism (for ex., a macro dataflow graph) that makes the choice of an immutable datastruct trivial.
So: think in term of the structure of the algorithm and just not in term of the properties of the datastructure to use.
You can get a great start in thinking about immutable collections, where they are applicable, how they can actually work without requiring lots of copying, etc. by looking through Eric Lippert's articles tagged with immutability:
http://blogs.msdn.com/ericlippert/archive/tags/Immutability/default.aspx
I guess the first question is: why do they need to modify the list? Would it be possible for them to return their changes as a list of modifications rather than actually modifying the shared list? Could they work with a list which looks like it's a mutable version of the original list, but is actually only locally mutable? Are you changing which elements are in the list, or just the properties of those elements?
These are just questions rather than answers, but I'm trying to encourage you to think about your problem in a different way. Look at the bigger picture as the task you want to achieve instead of thinking about the way you'd tackle it in a normal imperative, mutable fashion. Changing the way you think about problems is very difficult, but you may find you get some great "aha!" moments :)
There are many pitfalls when working with multiple threads and large sets of data. The advice to avoid mutable state is meant to try to make life easier for you if you can manage to follow the guideline (i.e. if you have no mutable state then multi-threading will be much easier).
If you have a large amount of data that does need to be modified then you perhaps cannot avoid mutable state. An alternative though would be to partition the data into blocks, each of which is passed to a thread for manipulation. The block can be processed and then passed back, and the controller can then perform the updates where necessary. In this scenario you have removed the mutable state from out of the the thread.
If this cannot be done and each thread needs update access to the full list (i.e. it could update any item on the list at any time) then you are going to have a lot of fun trying to make sure you have got your locking strategies and concurrency issues sorted. I'm sure there are scenarios where this is required, and the design pattern of avoiding mutable state may not apply.
Just using immutable data-objects is a big help.
Modifying lists sounds like a constructed argument, but consider granular methods that are unaware of lists.
If you really need to update the structure one way to do this is have a single worker thread which picks up update requests from a fixed area prtected by a mutex.
If you are clever you can update the structure in place without affecting any "reading"
threads (e.g. If you are adding to the end of an array you do all the work to add the new structure but only as the very last instruction do you increment the NoOfMembers count -- the reading threads should not see the new entry until you do this - or - arrange your data as an array of references to structures -- when you want to update a structure you copy the current member, update it, then as the last operation replace the reference in the array)
The other threads then only need to check a single simple "update in progess" mutex only when they activly want to update.

Resources