Does index on IMap work with Kryo serialization? - hazelcast

I've created a Hazelcast IMap and defined some Index on value field. Does Index work with Kryo serialization ? I remember that in earlier version of Hazelcast index used to work only when in-memory-format was OBJECT.

Indexes indeed works with Kryo serialization. When an entry is being Hazelcast deserializes it to extract indexed fields.

Related

Queries in Hazerlcast

I have a Map that uses MapStore. In this way, some objects are not loaded into Memory. How I can search a required object if it isn't in memory?
Is 'read-through' feature working for queries?
You can query Hazelcast for data held in Hazelcast or for data external to Hazelcast using the same SQL,
SELECT * FROM etc..
For the latter, see documentation link.
Unfortunately, there is not currently an implementation for Mongo. So for now you are blocked, sorry.
Read-through (or query-through) would also require the remote store have the same format as the IMap, which is not otherwise required for MapStore.
If you can't host all your Mongo data in Hazelcast (which eliminates the need to query Mongo), then you could perhaps consider some sort of fly-weight design pattern and perhaps hold a projection.

How do I Persist special characters (like ñ) to Cassandra with Spring Data Cassandra?

We have a usecase where we need to persist and retrieve special character strings into Cassandra. We're using Spring-Data-Cassandra for this. However while persisting Ñisson with Spring Data Cassandra, ?isson is getting persisted. The special character is getting replaced with ? in Cassandra.
I could persist it to Cassandra using CQLSH, so the DB Schema supports these special characters. But SpringBoot is not able to persist it.
I tried adding properties as:-
spring.datasource.connectionProperties=useUnicode=true;characterEncoding=utf-8;
But nothing seemed to work.
Is it possible to persist it as Ñisson into DB and retrieve it also with Ñisson as the PartitionKey using Spring Data Cassandra?

Iterating a GraphTraversal with GraphFrame causes UnsupportedOperationException Row to Vertex conversion

The following
GraphTraversal<Row, Edge> traversal = gf().E().hasLabel("foo").limit(5);
while (traversal.hasNext()) {}
causes the following Exception:
java.lang.UnsupportedOperationException: Row to Vertex conversion is not supported: Use .df().collect() instead of the iterator
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.iterator$lzycompute(DseGraphTraversal.scala:92)
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.iterator(DseGraphTraversal.scala:78)
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.hasNext(DseGraphTraversal.scala:129)
Exception says to use .df().collect() but gf().E().hasLabel("foo") does not allow you to do .df() afterwards. In other words, method df() is not there for object returned by hasLabel()
I'm using the Java API via dse-graph-frames:5.1.4 along with dse-byos_2.11:5.1.4.
The short answer: You need to cast GraphTraversal to DseGraphTraversal that has df() method. Then use one of spark Dataset methods to collect Rows:
List<Row> rows =
((DseGraphTraversal)graph.E().hasLabel("foo"))
.df().limit(5).collectAsList();
DseGraphFrame does not yet support full TinkerPop specification. So you can not receive TinkerPop Vertex or Edge objects. ( limit() method is also not implemented in DSE 5.1.x). It is recommended to switch to spark dataset api with df() call, get Dataset<Row> and use Dataset base filtering and collecting
If you need only Edge/Vertex properties you still can use TinkerPop valueMap() or values()
GraphTraversal<Row, Map<String,Object>> traversal = graph.E().hasLabel("foo").valueMap();
while (traversal.hasNext()) {}

Hazelcast 3.4: how to avoid deserialization from near cache and get original item

Starting from version 3.X Hazelcast returns copy of the original object that is stored in a distributed map with near cache enabled, as opposed to version 2.5 where original object was returned.
This behavior allowed local modifications of entries stored in the map and GET operations was fast.
Now, with version 3.X it stores binary object in near cache, and it causes deserialization on every GET, which significantly impacts performance.
Is it possible to configure Hazelcast 3.4.2 Map's Near Cache to return reference to original object, and not a copy of the original entry?
In the <near-cache> section, if you set
<in-memory-format>OBJECT</in-memory-format>
AND
<cache-local-entries>true</cache-local-entries>
you should get the same instance returned.
This works for both client and member.
I do not think that there is a way to get original item.
To avoid deserialization you could try and set
<in-memory-format>OBJECT</in-memory-format>
in <near-cache> configuration. This way hazelcast will store data in <near-cache> in object form and deserialization would not be needed. But I guess this will work only if you configured <near-cahce> on the client side, because if <near-cache> is on the node you will still need serialization to pass object from node to client.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources