Hazelcast Projection fetch single field

Hazelcast Projection fetch single field - hazelcast

I have question regarding hazelcast Projection API.
Lets say I want to fetch just a single field from an entry in the map using Portable serialization.
In this example a name from an employee.
I'm guessing I will be getting better performance in relation to network traffic and deserialization by using a projection like this:
public String getName(Long key) {
return map.project(
(Projection<Entry<Long, Employee>, String>) entry -> entry.getValue().getName(),
(Predicate<Long, Event>) mapEntry -> mapEntry.getKey().equals(key))
.stream()
.findFirst().orElse(null);
}
Instead of something like:
public String getName(Long key) {
return map.get(key).getName();
}

Projection is a type of predicate that comes handy if you want to return only a part of the entire value object in a result set of many value objects. For one single key based lookup, it is an overkill. A map.get is a lighter weight operation than running a predicate.
Network traffic = not sure how much savings you will have as that depends on your network bandwidth + size of the object + no. of concurrent objects traveling.
Deserialization = not much saving unless the actual object stored as value is monstrous and the field you are extracting is a tiny bit.
If you are conscious of network bandwidth and ser/des cost then keep in-memory-format to OBJECT and use EntryProcessor to update. If you do not have anything to update then use ExecutorService.

Related

Latency in IMap.get(key) when the object being retrieved is heavy

we have a map of custom object key to custom value Object(complex Object). We set the in-memory-format as OBJECT. But IMap.get is taking more time to get the value when the retrieved object size is big. We cannot afford latency here and this is required for further processing. IMap.get is called in jvm where cluster is started. Do we have a way to get the objects quickly irrespective of its size?

This is partly the price you pay for in-memory-format==OBJECT
To confirm, try in-memory-format==BINARY and compare the difference.
Store and retrieve are slower with OBJECT, some queries will be faster. If you run enough of those queries the penalty is justified.
If you do get(X) and the value is stored deserialized (OBJECT), the following sequence occurs
1 - the object it serialized from object to byte[]
2 - the byte array is sent to the caller, possibly across the network
3 - the object is deserialized by the caller, byte[] to object.
If you change to store serialized (BINARY), step 1 isn't need.
If the caller is the same process, step 2 isn't needed.
If you can, it's worth upgrading (latest is 5.1.3) as there are some newer options that may perform better. See this blog post explaining.
You also don't necessarily have to return the entire object to the caller. A read-only EntryProcessor can extract part of the data you need to return across the network. A smaller network packet will help, but if the cost is in the serialization then the difference may not be remarkable.

If you're retrieving a non-local map entry (either because you're using client-server deployment model, or an embedded deployment with multiple nodes so that some retrievals are remote), then a retrieval is going to require moving data across the network. There is no way to move data across the network that isn't affected by object size; so the solution is to find a way to make the objects more compact.
You don't mention what serialization method you're using, but the default Java serialization is horribly inefficient ... any other option would be an improvement. If your code is all Java, IdentifiedDataSerializable is the most performant. See the following blog for some numbers:
https://hazelcast.com/blog/comparing-serialization-options/
Also, if your data is stored in BINARY format, then it's stored in serialized form (whatever serialization option you've chosen), so at retrieval time the data is ready to be put on the wire. By storing in OBJECT form, you'll have to perform the serialization at retrieval time. This will make your GET operation slower. The trade-off is that if you're doing server-side compute (using the distributed executor service, EntryProcessors, or Jet pipelines), the server-side compute is faster if the data is in OBJECT format because it doesn't have to deserialize the data to access the data fields. So if you aren't using those server-side compute capabilities, you're better off with BINARY storage format.
Finally, if your objects are large, do you really need to be retrieving the entire object? Using the SQL API, you can do a SELECT of just certain fields in the object, rather than retrieving the entire object. (You can also do this with Projections and the older Predicate API but the SQL method is the preferred way to do this). If the client code doesn't need the entire object, selecting certain fields can save network bandwidth on the object transfer.

Point Read a subdocument Azure CosmosDB

I am using the Core API and the NodeJS API for Cosmos DB. I am trying to do a point read to save on RUs and latency. This documentation lead me to this as my "solution".
However, this makes no sense to me. I believe from similar systems that needs the item ID and partition key, but this makes no reference to the latter to top things off.
By modifying some update code, by mostly pure luck I ended up with what MIGHT be a point read but it gets the full item, not the "map" value I am looking for.
const { resource: updated } = await container
.item(
id = email,
partitionKeyValue = email
)
.read('SELECT c.map');
console.log(updated)
How do I read just the "map" value? The full document has much more in it and it would probably waste the benefit of a point read to get the whole thing.

There are two ways to read data: either via a query (where you can do a projection, grouping, filtering, etc) or by a point-read (a direct read, specifying id and partition key value), and that point-read bypasses the query engine.
Point-reads cost a bit less in RU, but could potentially consume a bit more bandwidth, as it returns an entire document (the actual underlying API call only allows for ID plus partition key value, and returns a single matching document, in its entirety).
Via query, you have flexibility to return as much or as little as you want, but the resulting operation will cost a bit more in RU.

Service Fabric - (reaching MaxReplicationMessageSize) Huge amount of data in a reliable dictionary

EDIT question summary:
I want to expose an endpoints, that will be capable of returning portions of xml data by some query parameters.
I have a statefull service (that is keeping the converted to DTOs xml data into a reliable dictionary)
I use a single, named partition (I just cant tell which partition holds the data by the query parameters passed, so I cant implement some smarter partitioning strategy)
I am using service remoting for communication between the stateless WEBAPI service and the statefull one
XML data may reach 500 MB
Everything is OK when the XML only around 50 MB
When data gets larger I Service Fabric complaining about MaxReplicationMessageSize
and the summary of my few questions from below: how can one achieve storing large amount of data into a reliable dictionary?
TL DR;
Apparently, I am missing something...
I want to parse, and load into a reliable dictionary huge XMLs for later queries over them.
I am using a single, named partition.
I have a XMLData stateful service that is loading this xmls into a reliable dictionary in its RunAsync method via this peace of code:
var myDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, List<HospitalData>>>("DATA");
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myDictionary.TryGetValueAsync(tx, "data");
ServiceEventSource.Current.ServiceMessage(this, "data status: {0}",
result.HasValue ? "loaded" : "not loaded yet, starts loading");
if (!result.HasValue)
{
Stopwatch timer = new Stopwatch();
timer.Start();
var converter = new DataConverter(XmlFolder);
List <Data> data = converter.LoadData();
await myDictionary.AddOrUpdateAsync(tx, "data", data, (key, value) => data);
timer.Stop();
ServiceEventSource.Current.ServiceMessage(this,
string.Format("Loading of data finished in {0} ms",
timer.ElapsedMilliseconds));
}
await tx.CommitAsync();
}
I have a stateless WebApi service that is communicating with the above stateful one via service remoting and querying the dictionary via this code:
ServiceUriBuilder builder = new ServiceUriBuilder(DataServiceName);
DataService DataServiceClient = ServiceProxy.Create<IDataService>(builder.ToUri(),
new Microsoft.ServiceFabric.Services.Client.ServicePartitionKey("My.single.named.partition"));
try
{
var data = await DataServiceClient.QueryData(SomeQuery);
return Ok(data);
}
catch (Exception ex)
{
ServiceEventSource.Current.Message("Web Service: Exception: {0}", ex);
throw;
}
It works really well when the XMLs do not exceeds 50 MB.
After that I get errors like:
System.Fabric.FabricReplicationOperationTooLargeException: The replication operation is larger than the configured limit - MaxReplicationMessageSize ---> System.Runtime.InteropServices.COMException
Questions:
I am almost certain that it is about the partitioning strategy and I need to use more partitions. But how to reference a particular partition while in the context of the RunAsync method of the Stateful Service? (Stateful service, is invoked via the RPC in WebApi where I explicitly point out a partition, so in there I can easily chose among partitions if using the Ranged partitions strategy - but how to do that while the initial loading of data when in the Run Async method)
Are these thoughts of mine correct: the code in a stateful service is operating on a single partition, thus Loading of huge amount of data and the partitioning of that data should happen outside the stateful service (like in an Actor). Then, after determining the partition key I just invoke the stateful service via RPC and pointing it to this particular partition
Actually is it at all a partitioning problem and what (where, who) is defining the Size of a Replication Message? I.e is the partiotioning strategy influencing the Replication Message sizes?
Would excerpting the loading logic into a stateful Actor help in any way?
For any help on this - thanks a lot!

The issue is that you're trying to add a large amount of data into a single dictionary record. When Service Fabric tries to replicate that data to other replicas of the service, it encounters a quota of the replicator, MaxReplicationMessageSize, which indeed defaults to 50MB (documented here).
You can increase the quota by specifying a ReliableStateManagerConfiguration:
internal sealed class Stateful1 : StatefulService
{
public Stateful1(StatefulServiceContext context)
: base(context, new ReliableStateManager(context,
new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
{
MaxReplicationMessageSize = 1024 * 1024 * 200
}))) { }
}
But I strongly suggest you change the way you store your data. The current method won't scale very well and isn't the way Reliable Collections were meant to be used.
Instead, you should store each HospitalData in a separate dictionary item. Then you can query the items in the dictionary (see this answer for details on how to use LINQ). You will not need to change the above quota.
PS - You don't necessarily have to use partitioning for 500MB of data. But regarding your question - you could use partitions even if you can't derive the key from the query, simply by querying all partitions and then combining the data.

Is it possible to create a Couchbase view that groups output without a reduce function?

In Couchbase or CouchDB, is it possible to group without an explicit reduce function? In my client code, I want the data to be given to me just as a reduce would receive it (assuming all mappers were used as input, even during a rereduce). Using group=true without a defined reduce function gives me the error:
$ curl http://127.0.0.1:8092/default/_design/testing1/_view/all?group=true
{"error":"query_parse_error",
"reason":"Invalid URL parameter 'group' or 'group_level' for non-reduce view."}
I can add the identity reduce function:
reduce(keys,data) {return data;}
but Couchbase complains that I'm not actually reducing anything:
{"rows":[], "errors":[
{"from":"local","reason":"{<<"reduce_overflow_error">>,
<<"Reduce output must shrink more rapidly: Current output: '...'
}]}
I'd sure like to get the complete reduction in my client.

This not a technological limitation, it's a logical limitation. You cannot logically group results without some reduction of those results. This is analogous to the GROUP BY in SQL, you can't use that unless you also have some sort of aggregate function in your SQL query.

The use of the "identity reduce" is a common pitfall in Couch-style map reduce. The Couch reduce system is designed for reduce functions that look more like:
function(keys, values) {
return values.length;
}
The identity reduce will end up storing multiple copies of the map rows, in a non-indexed data structure.
It sounds like what you want is the unique set of keys, regardless of the reduction value. In that case, I would use the _count reduce function, since it is efficient and can run on any underlying data.

Preferred way to store a child object in Azure Table Storage

I did a little expirement with storing child objects in azure table storage today.
Something like Person.Project where Person is the table entity and Person is just a POCO. The only way I was able to achieve this was by serializing the Project into byte[]. It might be what is needed, but is there another way around?
Thanks
Rasmus

Personally I would prefer to store the Project in a different table with the same partition key that its parent have, which is its Person's partition key. It ensures that the person and underlying projects will be stored in the same storage cluster. On the code side, I would like to have some attributes on top of the reference properties, for example [Reference(typeof(Person))] and [Collection(typeof(Project))], and in the data context class I can use some extension method it retrieve the child elements on demand.

In terms of the original question though, you certainly can store both parent and child in the same table - were you seeing an error when trying to do so?
One other thing you sacrifice by separating out parent and child into separate tables is the ability to group updates into a transaction. Say you created a new 'person' and added a number of projects for that person, if they are in the same table with same partition key you can send the multiple inserts as one atomic operation. With a multi-table approach, you're going to have to manage atomicity yourself (if that's a requirement of your data consistency model).

I'm presuming that when you say person is just a POCO you mean Project is just a POCO?
My preferred method is to store the child object in its own Azure table with the same partition key and row key as the parent. The main reason is that this allows you to run queries against this child object if you have to. You can't run just one query that uses properties from both parent and child, but at least you can run queries against the child entity. Another advantage is that it means that the child class can take up more space, the limit to how much data you can store in a single property is less than the amount you can store in a row.
If neither of these things are a concern for you, then what you've done is perfectly acceptable.

I have come across a similar problem and have implemented a generic object flattener/recomposer API that will flatten your complex entities into flat EntityProperty dictionaries and make them writeable to Table Storage, in the form of DynamicTableEntity.
Same API will then recompose the entire complex object back from the EntityProperty dictionary of the DynamicTableEntity.
Have a look at: https://www.nuget.org/packages/ObjectFlattenerRecomposer/
Usage:
//Flatten complex object (of type ie. Order) and convert it to EntityProperty Dictionary
Dictionary<string, EntityProperty> flattenedProperties = EntityPropertyConverter.Flatten(order);
// Create a DynamicTableEntity and set its PK and RK
DynamicTableEntity dynamicTableEntity = new DynamicTableEntity(partitionKey, rowKey);
dynamicTableEntity.Properties = flattenedProperties;
// Write the DynamicTableEntity to Azure Table Storage using client SDK
//Read the entity back from AzureTableStorage as DynamicTableEntity using the same PK and RK
DynamicTableEntity entity = [Read from Azure using the PK and RK];
//Convert the DynamicTableEntity back to original complex object.
Order order = EntityPropertyConverter.ConvertBack<Order>(entity.Properties);

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string