we have a map of custom object key to custom value Object(complex Object). We set the in-memory-format as OBJECT. But IMap.get is taking more time to get the value when the retrieved object size is big. We cannot afford latency here and this is required for further processing. IMap.get is called in jvm where cluster is started. Do we have a way to get the objects quickly irrespective of its size?
This is partly the price you pay for in-memory-format==OBJECT
To confirm, try in-memory-format==BINARY and compare the difference.
Store and retrieve are slower with OBJECT, some queries will be faster. If you run enough of those queries the penalty is justified.
If you do get(X) and the value is stored deserialized (OBJECT), the following sequence occurs
1 - the object it serialized from object to byte[]
2 - the byte array is sent to the caller, possibly across the network
3 - the object is deserialized by the caller, byte[] to object.
If you change to store serialized (BINARY), step 1 isn't need.
If the caller is the same process, step 2 isn't needed.
If you can, it's worth upgrading (latest is 5.1.3) as there are some newer options that may perform better. See this blog post explaining.
You also don't necessarily have to return the entire object to the caller. A read-only EntryProcessor can extract part of the data you need to return across the network. A smaller network packet will help, but if the cost is in the serialization then the difference may not be remarkable.
If you're retrieving a non-local map entry (either because you're using client-server deployment model, or an embedded deployment with multiple nodes so that some retrievals are remote), then a retrieval is going to require moving data across the network. There is no way to move data across the network that isn't affected by object size; so the solution is to find a way to make the objects more compact.
You don't mention what serialization method you're using, but the default Java serialization is horribly inefficient ... any other option would be an improvement. If your code is all Java, IdentifiedDataSerializable is the most performant. See the following blog for some numbers:
https://hazelcast.com/blog/comparing-serialization-options/
Also, if your data is stored in BINARY format, then it's stored in serialized form (whatever serialization option you've chosen), so at retrieval time the data is ready to be put on the wire. By storing in OBJECT form, you'll have to perform the serialization at retrieval time. This will make your GET operation slower. The trade-off is that if you're doing server-side compute (using the distributed executor service, EntryProcessors, or Jet pipelines), the server-side compute is faster if the data is in OBJECT format because it doesn't have to deserialize the data to access the data fields. So if you aren't using those server-side compute capabilities, you're better off with BINARY storage format.
Finally, if your objects are large, do you really need to be retrieving the entire object? Using the SQL API, you can do a SELECT of just certain fields in the object, rather than retrieving the entire object. (You can also do this with Projections and the older Predicate API but the SQL method is the preferred way to do this). If the client code doesn't need the entire object, selecting certain fields can save network bandwidth on the object transfer.
Related
I have a nodeJS application that operates on user data in the form of JSON. The size of the JSON data is variable and depends on the user. Usually the size is around 30KB.
Every time any value of the the JSON parameter changes, I recalculate the JSON object, stringify it and encrypt the string using RSA encryption.
Calculating the json involves doing the following steps:
From the database get the data required to form the json. This involves querying atleast 4 tables in a nested for loop.
Use object.assign() to combine the data to form larger json object in the same for loop of order 2.
Once the final object is formed stringify and encrypt it using crypto module of nodejs.
So all this is causing my CPU to crash when the data is huge which means large number of iterations and lot of data to encrypt.
We use postgres database, hence I have to recalculate the entire json object even if only a single parameter value changed.
I was wondering if nodejs worker pool could be a solution to this? But I would like to know how the worker threads handles tasks under the hood. Suggestions on an alternate solution to this problem are also welcomed.
I am using hazelcast jet to aggreagte(sum) stream of data
Source is kafka where i receive integer and jet stream simply adds each incoming number.
I have few questions
1. When it receives each number along with a it saves the data in IMap, how can i access that snapshot?
#Abhishek, Hazelcast-Jet takes snapshots if you configure it, and not with each number, with a time period. If you want to access map, you cannot & even if you access, the data stored in that map uses an internal data structure, you cannot just view your numbers there.
If you can share what kind of information you're trying to get, I can help you more. (Along with your job definition to understand it a bit if possible)
For a project I am creating a queuing library and basically store URLs in a Set (it's actually an object, where I set keys to true, but one can see it as an array), so the queue only takes every url once. This works really well, however I am facing the problem that there are many URLs and so the RAM usage becomes really high.
Therefor I want to use an on-disk key-value store (actually only keys are required, no idea whether there is some different approach) with the following requirements:
No need to load the whole data set into RAM
Speedy lookups
Node.js bindings
It doesn't have to be too safe (losing data once in a while isn't a huge problem, low RAM requirements are more important) and even though I use Node.JS in this scenario this lookup doesn't necessarily need to run async.
Actually a side question would be whether there is some better way than a on-disk key-value approach. A term would be nice. Lookuptables somehow always lets me find data sets (IPs, ZIP codes, etc.)
I'd use a sql table with a single column (to store the url). Better control on memory usage than redis (which pretty much stores all in memory).
easy to check if there is already the same value
easy to insert
easy to remove one element
If it really "doesn't have to be too safe", another design would be to keep storing everything in memory but limit the number of URLs you store, for example by using an LRU cache.
You could either use a cache in node.js (easy to find via Google) or use a separate memcached server, possibly on the same machine.
I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?
Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".
This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.
I'm converting an app from SQLitePersistentObjects to CoreData.
In the app, have a class that I generate many* instances of from an XML file retrieved from my server. The UI can trigger actions that will require me to save some* of those objects until the next invocation of the app.
Other than having a single NSManagedObjectContext for each of these objects (shared only with their subservient objects which can include blobs). I can't see a way how I can have fine grained control (i.e. at the object level) over which objects are persisted. If I try and have a single context for all newly created objects, I get an exception when I try to move one of my objects to a new context so I can persist it on ots own. I'm guessing this is because the objects it owns are left in the 'old' context.
The other option I see is to have a single context, persist all my objects and then delete the ones I don't need later - this feels like it's going to be hitting the database too much but maybe CoreData does magic.
So:
Am I missing something basic about the way my CoreData app should be architected?
Is having a context per object a good design pattern?
Is there a better way to move objects between contexts to avoid 2?
* where "many" means "tens, maybe hundreds, not thousands" and "some" is at least one order of magnitude less than "many"
Also cross posted to the Apple forums.
Core Data is really not an object persistence framework. It is an object graph management framework that just happens to be able to persist that graph to disk (see this previous SO answer for more info). So trying to use Core Data to persist just some of the objects in an object graph is going to be working against the grain. Core Data would much rather manage the entire graph of all objects that you're going to create. So, the options are not perfect, but I see several (including some you mentioned):
You could create all the objects in the Core Data context, then delete the ones you don't want to save. Until you save the context, everything is in-memory so there won't be any "going back to the database" as you suggest. Even after saving to disk, Core Data is very good at caching instances in the contexts' row cache and there is surprisingly little overhead to just letting it do its thing and not worrying about what's on disk and what's in memory.
If you can create all the objects first, then do all the processing in-memory before deciding which objects to save, you can create a single NSManagedObjectContext with a persistent store coordinator having only an in-memory persistent store. When you decide which objects to save, you can then add a persistent (XML/binary/SQLite) store to the persistent store coordinator, assign the objects you want to save to that store (using the context's (void)assignObject:(id)object toPersistentStore:(NSPersistentStore *)store) and then save the context.
You could create all the objects outside of Core Data, then copy the objects to-be-saved into a Core Data context.
You can create all the objects in a single in-memory context and write your own methods to copy those objects' properties and relationships to a new context to save just the instances you want. Unless the entities in your model have many relationships, this isn't that hard (see this page for tips on migrating objects from one store to an other using a multi-pass approach; it describes the technique in the context of versioning managed object models and is no longer needed in 10.5 for that purpose, but the technique would apply to your use case as well).
Personally, I would go with option 1 -- let Core Data do its thing, including managing deletions from your object graph.