I am using Hazelcast 3.6.1. It is set up as a server/client. A Map is on the server (single node) and it is about 4Gb of data. My program creates a client and then needs to look up some data (very small in size - like 30MB). So I was getting the data from the map and looping through all of it to search for the data of interest - before I knew it the process size was 4Gb (as I did a get on the map for each piece of data I was analyzing it was loading it into memory (Lazy) until all the data was loaded!). So, I discovered that I could use aggregation which I was under the impression was all done server side and only the part I was interested in was returned to the client, but the client process still grows to 350MB!
Is aggregation solely done on the server?
Thanks
First of all you should upgrade to Hazelcast 3.8.x versions since the new aggregation system is way faster. Apart from that it depends on what you try to aggregate, but if you do real aggregations like sum, min or similar, aggregations is the way to got. The documentation for 3.8.x fast-aggregations is available here: http://docs.hazelcast.org/docs/3.8.3/manual/html-single/index.html#fast-aggregations
After some testing it appears that the collator portion of the aggregator is being done on the client.
Related
My previous question: Errors saving data to Google Datastore
We're running into issues writing to Datastore. Based on the previous question, we think the issue is that we're indexing a "SeenTime" attribute with YYYY-MM-DDTHH:MM:SSZ (e.g. 2021-04-29T17:42:58Z) and this is creating a hotspot (see: https://cloud.google.com/datastore/docs/best-practices#indexes).
We need to index this because we're querying the data by date and need the time for each observation in the end application. Is there a way around this issue where we can still query by date?
This answer is a bit late but:
On your previous question, before even writing a query, it feels like the main issue is "running into issues writing" (DEADLINE_EXCEEDED/UNAVAILABLE) -> it's happening on "some saves" -- so, it's not completely clear if it's due to data hot-spotting or from "ingesting more data in shorter bursts", which causes contention (see "Designing for scale").
A single entity in Datastore mode should not be updated too rapidly. If you are using Datastore mode, design your application so that it will not need to update an entity more than once per second. If you update an entity too rapidly, then your Datastore mode writes will have higher latency, timeouts, and other types of error. This is known as contention.
You would need to add a prefix to the key to index monotonically increasing timestamps (as mentioned in the best-practices doc). Then you can test your queries using GQL interface in the console. However, since you most likely want "all events", I don't think it would be possible, and so will result in hot-spotting & read-latency.
The impression is that the latency might be unavoidable. If so, then you would need to decide if it's acceptable, depending on the frequency of your query/number-of-elements returned, along with the amount of latency (performance impact).
Consider switching to Firestore Native Mode. It has a different architecture under the hood and is the next version of Datastore. While Firestore is not perfect, it can be more forgiving about hot-spotting and contention, so it's possible that you'll have fewer issues than in Datastore.
Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.
I am new to Accumulo. I know that I can write Java code to scan, insert, update and delete data using Hadoop and MapReduce. What I would like to know is whether aggregation is possible in Accumulo.
I know that in MySql we can use groupby,orderby,max,min,count,sum,joins, nested queries, etc. Is their is any possibility to use these functions in Accumulo either directly or indirectly.
Accumulo does support aggregation through the use of combiner iterators (Accumulo Combiner Example ).
Iterators mostly run server-side, but can be run client-side, and can perform quite a bit of computation before sending the data back to your client.
Accumulo comes packaged with many iterators, more specifically the summingCombiner is used to sum the values of entries. Dave Medinet's has a blog that has some good examples (Accumulo Blog). More specifically, using the summingCombiner to implement wordcount (Word Count in Accumulo). I also suggest signing up for the Accumulo users mailing lists (mailing lists).
I like to think Accumulo has great agg functionality. I run an OLAP solution on it with hundreds of millions of keys on 40 nodes. In addition to the basic SummingCombiner, I recommend the newer statscombiner as well
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/examples/simple/combiner/StatsCombiner.html
which gives you basic stats about a set of keys.
You can set combiners at maj compaction, minor compaction or scan time. If you have a ton of data with a lot of trickled keys, I don't recommend scan time combining, because it can slow down the scan time (not always).
HTH
Some aggregation is supported in Accumulo, over multiple entries, and even multiple rows, within each tablet. Aggregation across tablets would need to be done on the client side or in a MapReduce job.
Yes, Aggregations are possible in Accumulo. you can achieve them by -
1) Using in built Combiners which aggregate data when you ingest.
2) Make Customised Aggregation Iterator and then deploy it at minor or majour compactions.
I have a mongodb with thousands of records holding very long vectors.
I am looking for correlations between an input vector with my MDB data set using a certain algorithm.
psudo code:
function find_best_correlation(input_vector)
max_correlation = 0
return_vector = []
foreach reference_vector in dataset:
if calculateCorrelation(input_vector,reference_vector) > max_correlation then:
return_vector = reference_vector
return return_vector
This is a very good candidate for map-reduce pattern as I don't care for the order the calculations are run in.
The issue is that my database is on one node.
I would like to run many mappings simultaneously (I have an 8 core machine)
From what I understand, MongoDb only uses one thread of execution per node - in practice I am running my data set serially.
Is this correct?
If so can I configure the number of processes/threads per map-reduce run?
If I manage multiple threads running map-reduce in parallel and then aggregate the results will I have substantial performance increase (Has anybody tried)?
If not - can i have multiple replications of my DB on the same node and "trick" mongoDb to run on 2 replications?
Thanks!
Map reduce in MongoDB uses Spidermonkey, a single-threaded Javascript engine, so it is not possible to configure multiple processes (and there are no "tricks"). There is a JIRA ticket to use a multi-threaded JS engine, which you can follow here:
https://jira.mongodb.org/browse/SERVER-2407
If possible, I would consider looking into the new aggregation framework (available in MongoDB version 2.2), which is written in C++ instead of Javascript and may offer performance improvements:
http://docs.mongodb.org/manual/applications/aggregation/
I would like to stream some files in and out of cassandra since we already use it rather than setting up a full hadoop distributed filesystem. Is there any asynchronous puts in atyanax or hector that I provide a callback for when it is complete so I can avoid the 1 ms network delays for 1000 calls as I write 1000 entries(split between a few rows and colums as well so it is streamed to a few servers in parallel and then all the responses/callbacks come back when done streaming). Does Hector or astyanax support this?
It looks like astyanax supports a query callback so I think I can get with the primary keys to stream the file back with astyanax?
thanks,
Dean
Cassandra doesn't actually support streaming via the thrift API. Furthermore, breaking up the file into a a single mutation batch that spreads data across multiple row and columns can be very dangerous. That could result in blowing the heap on cassandra or you may also run into the 1MB socket write buffer limit which under certain error cases can actually cause your thrift connection to hang indefinitely (although I think this may be fixed in the latest version of cassandra).
The new chunked object store recipe in Astyanax (https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store) builds on our experience at Netflix with storing large objects in Cassandra and provides a simple API that handles all the chunking and parallelization for you. It could still make 1000's of calls to cassandra (depending on your file size and chunk size) but also handles all the retries and parallelization for you. The same goes for reading files. The API will read the chunks and reassemble them in order into an OutputStream.