Persisting only part of a data source - intake

I'm using intake to access the catalog catalog.ocean.GFDL_CM2_6.GFDL_CM2_6_control_ocean_surface.
At the moment I only work with small patches of that data, but accessing that data every single time is still quite costly (it's on Google Cloud Storage). So I want to use the persist option of intake to store that data locally. However as far as I've understood from the docs, it looks like one can only persist the whole dataset. For that specific dataset that would amount to almost 400 dollars if I take a cost of 0.1$ per GB, since the total data is 3976GB.
Hence my questions:
Is there a way (especially for a zarr file which in theory should make this quite easy) to persist only parts of the data (for instance only a subset of the variables)
This is probably more complicated, but can I push things further, by persisting regions of data I'm interested in (in terms of coordinates values for instance)?

There is no direct Intake way to do what you are asking for. Intake was conceived as a way to get your data into a format that you can then manipulate as you normally do, i.e., deal with only the loading part, so that a persisted data-set is the same as the original.
However, it is not hard to accomplish manually: you should grab the xarray, filter for the region you need, and call to_zarr to save the new dataset. You can then point a simple catalogue entry like the old one at the new location.
You could have done this manipulation in a driver directly if this was a specific pattern that would repeat a lot. In fact, we have mooted the idea of whether/how to implement such processing steps in Intake, but there is no plan yet. In the end, we may take the work on pipelines in Holoviews to describe processing steps.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Considerations for time-series

We are looking into using Azure Table Storage (ATS) together with Deedle (or other libraries with similar functionality) for our time-series storage, manipulations and calculations. From what I can read, F# also seems like a good choice for operations on arrays.
Our starting point is a set of time-series for energy consumption. The series will either be the consumption within an interval (fixed or irregular intervals) or a counter (from which we can calculate the consumption from one reading to the next). As a data point is just a tag (used as a partition key), timestamp (rowkey) and value, this should be well suited for ATS.
From a user's perspective, they want to do calculations on the series for a given period and resolution, e.g. calculate a third series as a difference between two others, for one given year with monthly resolution.
This raises a number of questions:
Will ATS together with F# be fast enough? If we have 10.000 data points? 100.000? Compared to C#?
Resampling will require calculations of points between the series' timestamps. I haven't seen any Deedle examples for (linear) interpolation, but I assume that this is just passing a function which can look at the necessary data points? Will this be fast enough for our number of points?
The calculations will be determined by the users and we must have this as configurations. My best guess so far is to have the formula in some format we can parse easily into reverse polish notation, and take special care of tags that will represent series (ie. read from ATS, resample, then do the operations).
Any comments will be highly appreciated!
I think Isaac already mentioned the most important points, but as this question involves some of the things I'm involved with, I thought I'd share a few additional remarks!
BigDeedle. As Isaac mentioned, I used Azure Table storage in BigDeedle. This is mainly useful if you want to explore data interactively using Deedle APIs and do some filtering and range restriction before getting the data in memory and running your calculations. BigDeedle loads data lazily from potentially very big external data source. That said, if you eventually need to load all data into memory, this might not be all that useful for you.
The storage model used in BigDeedle might be useful though - it partitions data based on date, so when you want to get values in a given date range, it knows in which partitions to look. In my experience, loading data from ATS works pretty well, especially if you can do it on an MBrace cluster running in Azure (which is what my NDC demo does in the end).
Efficiency. I think the combination should work well for 10k or 100k data points - there will be no difference whether you do this from F# or C#. As for Deedle, I've definitely used it with data sets of this size - we optimize the library "as needed". Most of the functions are quite efficient already, but there may be some operations that are not efficient. This is something that can be fixed if you open issue on GitHub.
Resampling. There is built-in function for linear interpolation (see here), but I suspect you may need to write your own custom interpolation. Deedle does not "hide the underlying data" from you, so this is not too hard - the last example on this page shows a custom function for filling missing data that uses linear interpolation. If you are doing something like this, you'll need to have the data in memory (so BigDeedle would not be very useful here).
Specifying calculations. I suspect this is a separate question, but F# is great for domain-specific languages. I did a talk on that at earlier NDC. Generally, you can either specify your own DSL (and parse it) or have an embedded DSL where people write subset of F#. F# has good support for both.
PS: If you wanted to get some more help with F#, Deedle and Azure tables, feel free to get in touch. I'm happy to share my experience - you should be able to find a contact via my profile.
F# versus C# will probably be basically the same perf wise unless you do something completely different between the two (for example, immutable vs mutable data sets). Both compile down to IL at the end of the day.
Azure Table Storage - make sure you pick your partition + row keys correctly. There is a lot of documentation on picking Azure Table Storage partition keys, especially over time series - make sure you group rows up at the correct level to ensure data is distributed, with partitions not too large or small. You might also want to look at the Azure Storage Type Provider and / or Azure Storage F# libraries which makes working with ATS easier than the standard .NET SDK.
Deedle AFAIK does indeed have ability to replace missing values across time series, and there's at least a project called BigDeedle which works directly over ATS (although I'm not sure how ready this project is).

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.
If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

Space efficient embedded Haskell persistence solution

I'm looking for a persistence solution (maybe a NoSQL db? or something else...) that has the following criteria:
1) Has a Haskell API
2) Is disk space efficient--the db could easily get to many gigabytes of data but I need it to run well on a typical desktop. I need something that stores the data as efficiently as possible. So, for example, storing field names in a record would be bad.
3) High performance for reading sequential records. The typical use case is start somewhere and then read forward straight through the data--reading through possibly millions of records as quickly as possible.
4) Data is basically never changed (would only be changed if it was discovered data was incorrect somehow), just logged
5) It should act directly on file(s) that can be easily moved/copied around. It should not be calling a separate running server.
If you remove the "single file" requirement with no other running process, everything else can be fulfilled by every standard RDBMS, and depending on the type of data, sometimes especially well by columnar stores in particular.
The only single-file solution I know of is sqlite. Mainly sqlite founders when a single db needs to be accessed by multiple concurrent processes. If that isn't the case, then I wouldn't be surprised if you could scale it up singificantly.
Additionally, if you're only looking for sequential scans and key-value stores, you could just go with berkeleydb, which is known to be high-performance for very large data sets.
There are high quality Haskell bindings for talking to both sqlite and berkeleydb.
Edit: For sequential access only, its also blindingly straightforward to roll your own layer with the binary or cereal packages -- you basically need to write a helper function to wrap reading records from a file sequentially rather than all at once. An abstraction for folding over them is nice as well. Then you can decide to append to a single file, or spread your writes across files as you go. Either way, that's the most lightweight and straightforward option of all. The only drawback is having to worry about durability -- safe writes in the presence of interrupts, and all that other stuff that a good DB solution should take care of for you.
CouchDB ticks most of your boxes:
1) http://hackage.haskell.org/package/CouchDB
2) Depends on how you use it. You can store any binary data in it, but its up to you to know what it means. Or you can store XML or JSON, which is less space efficient but easier to migrate as your schema evolves (which it will).
3) Don't know, but its used for big web sites.
4) CouchDB uses a CM-like concept of updates and baselines, so old data stays around. It can be purged later as obsolete, but I think thats optional.
5) No. Its written in Erlang and runs (I believe) as a separate process. But why is that a problem?

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources