I need to load around 10 millions records from flat file in hazelcast map.Also the ttl needs to set based on each map entry.
What is most efficient way to do the same?
Currently i am using Imap.putall().Is there a way to set ttl based on map entry using putall?
There isn't an API that allows you to do bulk put with individual expiry.
The code below would be a way to do it with Hazelcast Jet writing into Hazelcast's IMap.
The client submits this job and the grid servers process, reading a single file of input server side. The line .groupingKey partitions the input stream by
the entry key, so each server does a map.put where the key will be local, but enriched with a different TTL for each entry.
This is an alternative to iterating across your input file and insert each key individually. Whether it is faster will depend on factors such as the network speed, number of servers, and so on. It is certainly more complicated than simple iteration, so the speed gain would need to justify the complexity.
public class MyClient implements EntryExpiredListener<Long, Long> {
private static final String INPUT_DIRECTORY = System.getProperty("user.home") + "/input_data";
private static final String MAP_NAME = "test";
public static void main(String[] args) {
new MyClient().go();
}
public void go() {
JetInstance jetInstance = Jet.newJetClient();
jetInstance.getMap(MAP_NAME).addEntryListener(this, false);
Pipeline pipeline = MyClient.buildPipeline();
JobConfig jobConfig = new JobConfig();
jobConfig.addClass(MyClient.class);
try {
jetInstance.newJob(pipeline, jobConfig).join();
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* Process a file that looks like <pre>
* % cat test/input
* 1
* 2
* 3
* 4
* 5
* </pre>
* #return
*/
private static Pipeline buildPipeline() {
ComparatorEx<Tuple3<Long, Long, Long>> comparatorEx = ComparatorEx.comparingLong(Tuple3::f0);
Pipeline pipeline = Pipeline.create();
BatchStage<String> input = pipeline.readFrom(MyClient.mySource(INPUT_DIRECTORY));
// Convert to trios of key, value, expiry
BatchStage<Tuple3<Long, Long, Long>> tuples
= input
.map(line -> {
long l = Long.parseLong(line);
return Tuple3.<Long, Long, Long>tuple3(100 * l, 200 * l, 300 * l);
});
// Route per JVM based on entry key
BatchStage<Entry<Long, Tuple3<Long, Long, Long>>> routedEntries
= tuples
.groupingKey(Tuple3::f0)
.rollingAggregate(AggregateOperations.maxBy(comparatorEx));
// Custom map save using expiry
routedEntries.writeTo(MyClient.mySink(MAP_NAME));
// [Optional] all log entries to systout
routedEntries.writeTo(Sinks.logger());
return pipeline;
}
private static BatchSource<String> mySource(String directory) {
return Sources.filesBuilder(directory)
.sharedFileSystem(true)
.build();
}
private static Sink<? super Entry<Long, Tuple3<Long, Long, Long>>> mySink(String mapName) {
return SinkBuilder.sinkBuilder("mySink",
processorContext -> processorContext.jetInstance().<Long, Long>getMap(mapName))
.receiveFn((IMap<Long, Long> map, Entry<Long, Tuple3<Long, Long, Long>> entry) -> {
map.put(entry.getKey(), entry.getValue().f1(), entry.getValue().f2(), TimeUnit.SECONDS);
})
.build();
}
#Override
public void entryExpired(EntryEvent<Long, Long> entryEvent) {
System.out.println(entryEvent.getEventType() + " for " + entryEvent.getKey());
}
}
Related
I understand that Near Caches are not guaranteed to be synchronized real-time when the value is updated elsewhere on some other node.
However I do expect it to be in sync with the EntryUpdatedListener that is on the same node and therefore the same process - or am I missing something?
Sequence of events:
Cluster of 1 node modifies the same key/value, flipping a value from X to Y and back to X on an interval every X seconds.
A client connects to this cluster node and adds an EntryUpdatedListener to observe the flipping value.
Client receives the EntryUpdatedEvent and prints the value given - as expected, it gives the value recently set.
Client immediately does a map.get for the same key (which should hit the near cache), and it prints a STALE value.
I find this strange - it means that two "channels" within the same client process are showing inconsistent versions of data. I would only expect this between different processes.
Below is my reproducer code:
public class ClusterTest {
private static final int OLD_VALUE = 10000;
private static final int NEW_VALUE = 88888;
private static final int KEY = 5;
private static final int NUMBER_OF_ENTRIES = 10;
public static void main(String[] args) throws Exception {
HazelcastInstance instance = Hazelcast.newHazelcastInstance();
IMap map = instance.getMap("test");
for (int i = 0; i < NUMBER_OF_ENTRIES; i++) {
map.put(i, 0);
}
System.out.println("Size of map = " + map.size());
boolean flag = false;
while(true) {
int value = flag ? OLD_VALUE : NEW_VALUE;
flag = !flag;
map.put(KEY, value);
System.out.println("Set a value of [" + value + "]: ");
Thread.sleep(1000);
}
}
}
public class ClientTest {
public static void main(String[] args) throws InterruptedException {
HazelcastInstance instance = HazelcastClient.newHazelcastClient(new ClientConfig().addNearCacheConfig(new NearCacheConfig("test")));
IMap map = instance.getMap("test");
System.out.println("Size of map = " + map.size());
map.addEntryListener(new MyEntryListener(instance), true);
new CountDownLatch(1).await();
}
static class MyEntryListener
implements EntryAddedListener,
EntryUpdatedListener,
EntryRemovedListener {
private HazelcastInstance instance;
public MyEntryListener(HazelcastInstance instance) {
this.instance = instance;
}
#Override
public void entryAdded(EntryEvent event) {
System.out.println("Entry Added:" + event);
}
#Override
public void entryRemoved(EntryEvent event) {
System.out.println("Entry Removed:" + event);
}
#Override
public void entryUpdated(EntryEvent event) {
Object o = instance.getMap("test").get(event.getKey());
boolean equals = o.equals(event.getValue());
String s = "Event matches what has been fetched = " + equals;
if (!equals) {
s += ", EntryEvent value has delivered: " + (event.getValue()) + ", and an explicit GET has delivered:" + o;
}
System.out.println(s);
}
}
}
The output from the client:
INFO: hz.client_0 [dev] [3.11.1] HazelcastClient 3.11.1 (20181218 - d294f31) is CLIENT_CONNECTED
Jun 20, 2019 4:58:15 PM com.hazelcast.internal.diagnostics.Diagnostics
INFO: hz.client_0 [dev] [3.11.1] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
Size of map = 10
Event matches what has been fetched = true
Event matches what has been fetched = false, EntryEvent value has delivered: 88888, and an explicit GET has delivered:10000
Event matches what has been fetched = true
Event matches what has been fetched = true
Event matches what has been fetched = false, EntryEvent value has delivered: 10000, and an explicit GET has delivered:88888
Near Cache has Eventual Consistency guarantee, while Listeners work in a fire & forget fashion. That's why there are two different mechanisms for both. Also, batching for near cache events reduces the network traffic and keeps the eventing system less busy (this helps when there are too many invalidations or clients), as a tradeoff it may increase the delay of individual invalidations. If you are confident that your system can handle each invalidation event, you can disable batching.
You need to configure the property on member side as events are generated on cluster members and sent to clients.
In my application, I'm trying to process data in IMap, the scenario is as follows:
application recieves request (REST for example) with set of keys to be processed
application processes entries with given key and returns result - map where key is original key of the entry and result is calculated
for this scenario IMap.executeOnKeys is almost perfect, with one problem - the entry is locked while being processed - and really it hurts thruput. The IMap is populated on startup and never modified.
Is it possible to process entries without locking them? If possible without sending entries to another node and without causing network overhead (sending 1000 tasks to single node in for-loop)
Here is reference implementation to demonstrate what I'm trying to achieve:
public class Main {
public static void main(String[] args) throws Exception {
HazelcastInstance instance = Hazelcast.newHazelcastInstance();
IMap<String, String> map = instance.getMap("the-map");
// populated once on startup, never modified
for (int i = 1; i <= 10; i++) {
map.put("key-" + i, "value-" + i);
}
Set<String> keys = new HashSet<>();
keys.add("key-1"); // every requst may have different key set, they may overlap
System.out.println(" ---- processing ----");
ForkJoinPool pool = new ForkJoinPool();
// to simulate parallel requests on the same entry
pool.execute(() -> map.executeOnKeys(keys, new MyEntryProcessor("first")));
pool.execute(() -> map.executeOnKeys(keys, new MyEntryProcessor("second")));
System.out.println(" ---- pool is waiting ----");
pool.shutdown();
pool.awaitTermination(5, TimeUnit.MINUTES);
System.out.println(" ------ DONE -------");
}
static class MyEntryProcessor implements EntryProcessor<String, String> {
private String name;
MyEntryProcessor(String name) {
this.name = name;
}
#Override
public Object process(Map.Entry<String, String> entry) {
System.out.println(name + " is processing " + entry);
return calculate(entry); // may take some time, doesn't modify entry
}
#Override
public EntryBackupProcessor<String, String> getBackupProcessor() {
return null;
}
}
}
Thanks in advance
In executeOnKeys the entries are not locked. Maybe you mean that the processing happens on partitionThreads, so that there may be no other processing for the particular key? Anyhow, here's the solution:
Your EntryProcessor should implement:
Offloadable interface -> this means that the partition-thread will be used only for reading the value. The calculation will be done in the offloading thread-pool.
ReadOnly interface -> in this case the EP won't hop on the partition-thread again to save the modification you might have done in the entry. Since your EP does not modify entries, this will increase the performance.
I'm using Cassandra 3.10 and I need to calculate how many data modifications I have. For resolving the issue I'm going to use cassandra trigger, but result of triggers work need to be Mutation
public interface ITrigger {
public Collection<Mutation> augment(Partition update);
}
But I have CounterMutation,
#Override
public Collection<Mutation> augment(Partition update) {
String keyspaceName = update.metadata().ksName;
//CFMetaData metadata = Schema.instance.getCFMetaData(keyspaceName, cfName);
long timestamp = System.currentTimeMillis();
String cfName = "user_product_count_cf";
CFMetaData counterCfMetadata = Schema.instance.getCFMetaData(keyspaceName, cfName);
ByteBuffer key = toByteBuffer("test-user-key");
PartitionUpdate.SimpleBuilder builder = PartitionUpdate.simpleBuilder(counterCfMetadata, key);
ByteBuffer columnName = toByteBuffer("test-counter-column");
ByteBuffer countValue = CounterContext.instance().createLocal(1);
builder.timestamp(timestamp).row(Clustering.make(columnName)).add("value", value);
Mutation mutation = builder.buildAsMutation();
//TODO this line does not work
//return Collections.singletonList(mutation);
CounterMutation cMutation = new CounterMutation(mutation, ConsistencyLevel.ONE);
//I do not understand what I have to do next.
//Now I use next line and it is work, but I don't sure that it is good practice.
return Collections.singletonList(cMutation.applyCounterMutation());
}
so how to convert CounterMutation to Mutation?
You can use QueryProcessor class
Example :
#Override
public Collection<Mutation> augment(Partition update) {
String keyspaceName = update.metadata().ksName;
String cfName = "user_product_count_cf";
QueryProcessor.process(
QueryBuilder.update(keyspaceName,cfName).with(incr("test-counter-column",1)).where(eq("test-user-key", bindMarker())).toString(),
ConsistencyLevel.LOCAL_QUORUM,
Arrays.asList(key) // Put the partition key value here
);
return Collections.EMPTY_LIST;
}
I am trying to match a field that is not a key with a remote hazelcast, the goal here is to create many remote instances and use it to store serialized objects.
what i noticed is that if i do both put and SQL in the same run, the return works, as follow :
my class
public class Provider implements Serializable {
private String resourceId;
private String first;
public String getResourceId() {
return resourceId;
}
public void setResourceId(String resourceId) {
this.resourceId = resourceId;
}
public String getFirst() {
return first;
}
public void setFirst(String first) {
this.first = first;
}
}
code :
/*********** map initlization ************/
Config config = new Config();
config.getNetworkConfig().setPublicAddress(host + ":" + port);
instance = Hazelcast.newHazelcastInstance(config);
map = instance.getMap("providerWithIndex");
String first = "XXXX"
/***** adding item ***************/
Provider provider = new Provider();
provider.setResourceId("11111");
provider.setFirst( first);
map.put(provider);
/********** getting items ************/
EntryObject e = new PredicateBuilder().getEntryObject();
Predicate predicate = e.get( "first" ).equal( first ) ;
Collection<Provider> providers = map.values(predicate);
once i run both put and get in different runs, the result is 0 - so to the same code i get no response.
my only thought is that it only does local fetch and not remote. but i hope i have a bug somewhere.
Your code looks good, apart from the fact that you can't do map.put(provider), you need to pass a key, so like map.put(someKey, provider);
The query looks & works fine.
I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)