I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.
Related
so I have a delta table that I want to update based on a condition of two column values combined;
i.e.
delta_table.update(
condition=is_eligible(col("name"), col("age"))
set={"pension_eligible": lit("yes")}
)
I'm aware that I can do something similar to:
delta_table.update(
condition=(col("name") == "Einar") & (col("age") > 65)
set={"pension_eligible": lit("yes")}
)
But since my logic for computing this is quite complex (I need to look up the name in a database) I would like to define my own Python function for computing this (is_eligible(...)). Other reasons are because this function is used elsewhere and I would like to minimize code duplication.
Is this possible at all? As I understand you could define it as an UDF, but they only take one parameter and I need at least two. I can not find anything about more complex conditions in the delta lake documentation, so I'd really appreciate some guidance here.
I'm in the middle of trying to build my first "real" Haskell app, an API using Servant, where I'm using Persistent for the database backend. But I've run into a problem when trying to use Persistent to make a certain type of database query.
My real problem is of course a fair bit more involved, but the essence of the problem I have run up against can be explained like this. I have a record type such as:
data Foo = Foo { a :: Int, b :: Int }
derivePersistField "Foo"
which I am including in an Entity like this:
share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persistLowerCase|
Thing
foo Foo
|]
And I need to be able to query for items in my database for which their Foo value has its a field greater than some aMin that is provided. (The intention is that it will actually be provided by the API request in a query string.)
For ordinary queries, say for an Int field Bar, I could simply do selectList [ThingBar >=. aMin] [], but I'm drawing a blank as to what to put in the filter list in order to extract the field from the record and do a comparison with it. Even though this feels like the sort of thing that Haskell should be able to do rather easily. It feels like there should be a Functor involved here that I can just fmap the a accessor over, but the relevant type, as far as I can tell from the documentation and the tutorial, is EntityField Thing defined by a GADT (actually generated by Template Haskell from the share call above), which in this case would have just one constructor yielding an EntityField Thing Foo, which it doesn't seem possible to make a Functor instance out of.
But without that, I'm drawing a blank as to how to deal with this, since the LHS of a combinator like >=. has to be an EntityField value, which stops me from trying to apply functions to the database value before comparing.
Since I know someone is going to say it (and most of you will be thinking it) - yes, in this toy example I could just as easily make the a and b into separate fields in my database table, and solve the problem that way. As I said, this is somewhat simplified, and in my real application doing it that way would feel unsatisfactory for a number of reasons. And it doesn't solve my wider question of, essentially, how to be able to do arbitrary transformations on the data before querying. [Edit: I have now gone with this approach, in order to move forward with my project, but as I said it's not totally satisfactory, and I'm still waiting for an answer to my general question - even if that's just "sorry it's not possible", as I increasingly suspect, I would appreciate a good explanation.]
I'm beginning to suspect this may simply be impossible because the data is ultimately stored in an SQL database and SQL simply isn't as expressive as Haskell - but what I'm trying to do, at least with record types (I confess I don't know how derivePersistField marshals these to SQL data types) doesn't seem too unreasonable so I feel I should ask if there are any workarounds for this, or do I really have to decompose my records into a bunch of separate fields if I want to query them individually.
[If there are any other libraries which can help then feel free to recommend them - I did look into Esqueleto but decided I didn't need it for this project, although that was before I ran into this problem. Would that be something that could help with this kind of query?]
You can use the -ddump-splices compiler flag to dump the code being generated by derivePersistField (and all the other Template Haskell calls). You may need to pass -fforce-recomp, too, if ghc doesn't think the file needs to be recompiled.
If you do, you'll see that the method persistent uses to marshal the Foo datatype to and from SQL is to use its read and show instances to store it as a text string. (This is actually explained in the documentation on Custom Fields.) This also means that a query like:
stuff <- selectList [ThingFoo >=. Foo 3 0] []
actually does string comparison at the SQL level, so Thing (Foo 10 2) wouldn't pass through this filter, because the string "Foo 10 2" sorts before "Foo 3 0".
In other words, you're pretty much out of luck here. Custom fields created by derivePersistField aren't really meant to be used for anything more sophisticated than the example from the Yesod documentation:
data Employment = Employed | Unemployed | Retired
The only way you can examine their structure would be to pass in raw SQL to parse the string field for use in a query, and that would be much uglier than whatever you're doing now and presumably no more efficient than querying for all records at the SQL level and doing the filtering in plain Haskell code on the result list.
What is to implement a custom string type in f# for interning strings. i have to read large csv files into memory. Given most of the columns are categorical, values are repeating and it makes sense to create new string first time it is encountered and only refer to it on subsequent occurrences to save memory.
In c# I do this by creating a global intern pool (concurrent dict) and before setting a value, lookup the dictionary if it already exists. if it exists, just point to the string already in the dictionary. if not, add it to the dictionary and set the value to the string just added to dictionary.
New to f# and wondering what is the best way to do this in f#. will be using the new string type in records named tuples etc and it will have to work with concurrent processes.
Edit:
String.Intern uses the Intern Pool. My understanding is, it is not very efficient for large pools and is not garbage collected i.e. any/all interned strings will remain in intern pool for lifetime of the app. Imagine a an application where you read a file, perform some operations and write data. Using Intern Pool solution will probably work. Now imagine you have to do the same 100 times and the strings in each file have little in common. If the memory is allocated on heap, after processing each file, we can force garbage collector to clear unnecessary strings.
I should have mentioned I could not really figure out how to do the C# approach in F# (other than implementing a C# type and using it in F#)
Memorisation pattern is slightly different from what I am looking for? We are not caching calculated results - we are ensuring each string object is created no more than once and all subsequent creations of same string are just references to the original. Using a dictionary to do this is a one way and using String.Intern is other.
sorry if is am missing something obvious here.
I have a few things to say, so I'll post them as an answer.
First, I guess String.Intern works just as well in F# as in C#.
let x = "abc"
let y = StringBuilder("a").Append("bc").ToString()
printfn "1 : %A" (LanguagePrimitives.PhysicalEquality x y) // false
let y2 = String.Intern y
printfn "2 : %A" (LanguagePrimitives.PhysicalEquality x y2) // true
Second, are you using a dictionary in combination with String.Intern in your C# solution? If so, why not just do s = String.Intern(s); after the string is ready following input from file?
To create a type for use in your business domain to handle string deduplication in general is a very bad idea. You don't want your business domain polluted by that kind of low level stuff.
As for rolling your own. I did that some years ago, probably to avoid that problem you mentioned with the strings not being garbage collected, but I never tested if that actually was a problem.
It might be a good idea to use a dictionary (or something) for each column (or type of column) where the same values are likely to repeat in great numbers. (This is pretty much what you said already.)
It makes sense to only keep these dictionaries live while you read the information from file, and stuff it into internal data structures. You might be thinking that you need the dictionaries for subsequent reads, but I am not so sure about that.
The important thing is to deduplicate the great majority of strings, and not necessarily every single duplicate. Because of this you can greatly simplify the solution as indicated. You most probably have nothing to gain by overcomplicating your solution to squeeze out the last fraction of memory savings.
Releasing the dictionaries after the file is read and structures filled, will have the advantage of not holding on to strings when they are no longer really needed. And of course you save memory by not holding onto the dictionaries.
I see no need to handle concurrency issues in the implementation here. String.Intern must necessarily be immune to concurrency issues. If you roll your own with the design suggested, you would not use it concurrently. Each file being read would have its own set of dictionaries for its columns.
In Cassandra, when specifying a table and fields, one has to give each field a type (text, int, boolean, etc.). The same applies for collections, you have to give lock a collection to specific type (set<text> and such).
I need to store a list of mixed types in Cassandra. The list may contain numbers, strings and booleans. So I would need something like list<?>.
Is this possible in Cassandra and if not, What workaround would you suggest for storing a list of mixed type items? I sketched a few, but none of them seem the right way to go...
Cassandra's CQL interface is strictly typed, so you will not be able to create a table with an untyped collection column.
I basically see two options:
Create a list field, and convert everything to text (not too nice, I agree)
Use the thift API and store everything as is.
As suggested at http://www.mail-archive.com/user#cassandra.apache.org/msg37103.html I decided to encode the various values into binary and store them into list<blob>. This allows to still query the collection values (in Cassandra 2.1+), one just needs to encode the values in the query.
On python, simplest way is probably to pickle and hexify when storing data:
pickle.dumps('Hello world').encode('hex')
And to load it:
pickle.loads(item.decode('hex'))
Using pickle ties the implementation to python, but it automatically converts to correct type (int, string, boolean, etc.) when loading, so it's convenient.
In C++ and other languages, add-on libraries implement a multi-index container, e.g. Boost.Multiindex. That is, a collection that stores one type of value but maintains multiple different indices over those values. These indices provide for different access methods and sorting behaviors, e.g. map, multimap, set, multiset, array, etc. Run-time complexity of the multi-index container is generally the sum of the individual indices' complexities.
Is there an equivalent for Haskell or do people compose their own? Specifically, what is the most idiomatic way to implement a collection of type T with both a set-type of index (T is an instance of Ord) as well as a map-type of index (assume that a key value of type K could be provided for each T, either explicitly or via a function T -> K)?
I just uploaded IxSet to hackage this morning,
http://hackage.haskell.org/package/ixset
ixset provides sets which have multiple indexes.
ixset has been around for a long time as happstack-ixset. This version removes the dependencies on anything happstack specific, and is the new official version of IxSet.
Another option would be kdtree:
darcs get http://darcs.monoid.at/kdtree
kdtree aims to improve on IxSet by offering greater type-safety and better time and space usage. The current version seems to do well on all three of those aspects -- but it is not yet ready for prime time. Additional contributors would be highly welcomed.
In the trivial case where every element has a unique key that's always available, you can just use a Map and extract the key to look up an element. In the slightly less trivial case where each value merely has a key available, a simple solution it would be something like Map K (Set T). Looking up an element directly would then involve first extracting the key, indexing the Map to find the set of elements that share that key, then looking up the one you want.
For the most part, if something can be done straightforwardly in the above fashion (simple transformation and nesting), it probably makes sense to do it that way. However, none of this generalizes well to, e.g., multiple independent keys or keys that may not be available, for obvious reasons.
Beyond that, I'm not aware of any widely-used standard implementations. Some examples do exist, for example IxSet from happstack seems to roughly fit the bill. I suspect one-size-kinda-fits-most solutions here are liable to have a poor benefit/complexity ratio, so people tend to just roll their own to suit specific needs.
Intuitively, this seems like a problem that might work better not as a single implementation, but rather a collection of primitives that could be composed more flexibly than Data.Map allows, to create ad-hoc specialized structures. But that's not really helpful for short-term needs.
For this specific question, you can use a Bimap. In general, though, I'm not aware of any common class for multimaps or multiply-indexed containers.
I believe that the simplest way to do this is simply with Data.Map. Although it is designed to use single indices, when you insert the same element multiple times, most compilers (certainly GHC) will make the values place to the same place. A separate implementation of a multimap wouldn't be that efficient, as you want to find elements based on their index, so you cannot naively associate each element with multiple indices - say [([key], value)] - as this would be very inefficient.
However, I have not looked at the Boost implementations of Multimaps to see, definitively, if there is an optimized way of doing so.
Have I got the problem straight? Both T and K have an order. There is a function key :: T -> K but it is not order-preserving. It is desired to manage a collection of Ts, indexed (for rapid access) both by the T order and the K order. More generally, one might want a collection of T elements indexed by a bunch of orders key1 :: T -> K1, .. keyn :: T -> Kn, and it so happens that here key1 = id. Is that the picture?
I think I agree with gereeter's suggestion that the basis for a solution is just to maintiain in sync a bunch of (Map K1 T, .. Map Kn T). Inserting a key-value pair in a map duplicates neither the key nor the value, allocating only the extra heap required to make a new entry in the right place in the index. Inserting the same value, suitably keyed, in multiple indices should not break sharing (even if one of the keys is the value). It is worth wrapping the structure in an API which ensures that any subsequent modifications to the value are computed once and shared, rather than recomputed for each entry in an index.
Bottom line: it should be possible to maintain multiple maps, ensuring that the values are shared, even though the key-orders are separate.