Losing data on bulk inserts in Cassandra - cassandra

I'm in trouble for losing data on insert in my Cassandra.
I am doing great bulk inserts from csv files which I read via Stream. The data is duplicated into two tables, because of different queries. Every 30,000th element I split my data to new partition (chunkCounter).
private PersistenceInformation persist(final String period, final String tradePartner, final Integer version, Stream<Transaction> transactions) {
int elementsInChunkCounter = 0;
int chunkCounter = 1;
int elementCounter = 0;
Iterator<Transaction> iterator = transactions.filter(beanValidator).iterator();
List<List<?>> listImportData = new ArrayList<>(30000);
List<List<?>> listGtins = new ArrayList<>(30000);
while (iterator.hasNext()) {
Transaction tr = iterator.next();
List<Object> importTemp = new ArrayList<>(9);
importTemp.add(period);
importTemp.add(tradePartner);
importTemp.add(version);
importTemp.add(chunkCounter);
importTemp.add(tr.getMdhId());
importTemp.add(tr.getGtin());
importTemp.add(tr.getQuantity());
importTemp.add(tr.getTransactionId());
importTemp.add(tr.getTimestamp());
listImportData.add(importTemp);
List<Object> gtinTemp = new ArrayList<>(8);
gtinTemp.add(period);
gtinTemp.add(tradePartner);
gtinTemp.add(version);
gtinTemp.add(chunkCounter);
gtinTemp.add(tr.getMdhId());
gtinTemp.add(tr.getGtin());
gtinTemp.add(tr.getQuantity());
gtinTemp.add(tr.getTimestamp());
listGtins.add(gtinTemp);
elementsInChunkCounter++;
elementCounter++;
if (elementsInChunkCounter == 30000) {
elementsInChunkCounter = 0;
chunkCounter++;
ingestImportData(listImportData);
listImportData.clear();
ingestGtins(listGtins);
listGtins.clear();
}
}
if (!listImportData.isEmpty()) {
ingestImportData(listImportData);
}
if (!listGtins.isEmpty()) {
ingestGtins(listGtins);
}
return new PersistenceInformation();
}
private void ingestImportData(List<List<?>> list) {
String cqlIngest = "INSERT INTO import_data (pd, tp , ver, chunk, mdh_id, gtin, qty, id, ts) VALUES (?,?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
private void ingestGtins(List<List<?>> list) {
String cqlIngest = "INSERT INTO gtins (pd, tp, ver, chunk, mdh_id, gtin, qty, ts) VALUES (?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
This worked pretty well until I noticed that sometimes a dataset goes missing. There is an entry in the second table (gtins) but the data set in the main table was not inserted. The application counted it but the database did not write it.
The table is built this way:
CREATE TABLE import_data (
tp text,
pd text,
ver int,
chunk int,
mdh_id uuid,
gtin text,
qty float,
id text,
ts timestamp
PRIMARY KEY ((tp, pd, ver, chunk), ts, mdh_id)) WITH CLUSTERING ORDER BY (ts DESC);
The mdh_id is a UUID from my application, so that every data set has a unique key and is not accidentally overridden.
The Cassandra log files didn't even show a warning.
At the moment I am evaluating BatchStatement but I need to insert every 8th dataset because of the 5kb limit, otherwise the database lost even more entries.
Any suggestions whats going wrong in my application is highly appreciated. Thanks a lot.

Related

HazelcastJet rolling-aggregation with removing previous data and add new

We have use case where we are receiving message from kafka that needs to be aggregated. This has to be aggregated in a way that if an updates comes on same id then existing value if any needs to be subtracted and the new value has to be added.
From various forum i got to know that jet doesnt store raw values rather aggregated result and some internal data.
In such case how can i achieve this?
Example
Balance 1 {id:1, amount:100} // aggregated result 100
Balance 2 {id:2, amount:200} // 300
Balance 3 {id:1, amount:400} // 600 after removing 100 and adding 400
I could achieve a simple use where every time add. But i was not able to achieve the aggregation where existing value needs to be subtracted and new value has to be added.
rollingAggregation(AggregatorOperations.summingDouble(<login to add remove>))
.drainTo(Sinks.logger()).
Balance 1,2,3 are sequnce of messages
The comment shows whats the aggregated value at each message performed by jet.
My aim is to add new amount (if id comes for the first time) and subtract amount if an updated balance comes i. e. Id is same as earlier.
You can try a custom aggregate operation which will emit the previous and currently seen values like this:
public static <T> AggregateOperation1<T, ?, Tuple2<T, T>> previousAndCurrent() {
return AggregateOperation
.withCreate(() -> new Object[2])
.<T>andAccumulate((acc, current) -> {
acc[0] = acc[1];
acc[1] = current;
})
.andExportFinish((acc) -> tuple2((T) acc[0], (T) acc[1]));
}
The output should be a Tuple of the form (previous, current). Then you can apply rolling aggregate again to the output. To simplify the problem as input I have a pair of (id, amount) pairs.
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Integer, Long>mapJournal("map", START_FROM_OLDEST)) // (id, amount)
.groupingKey(Entry::getKey)
.rollingAggregate(previousAndCurrent(), (key, val) -> val)
.rollingAggregate(AggregateOperations.summingLong(e -> {
long prevValue = e.f0() == null ? 0 : e.f0().getValue();
long newValue = e.f1().getValue();
return newValue - prevValue;
}))
.drainTo(Sinks.logger());
JetConfig config = new JetConfig();
config.getHazelcastConfig().addEventJournalConfig(new EventJournalConfig().setMapName("map"));
JetInstance jet = Jet.newJetInstance(config);
IMapJet<Object, Object> map = jet.getMap("map");
map.put(0, 1L);
map.put(0, 2L);
map.put(1, 10L);
map.put(1, 40L);
jet.newJob(p).join();
This should produce as output: 1, 2, 12, 42.

Store countByKey result into Cassandra

I want to count the number of IndicatePresence messages for each user for any given day (out of a Cassandra table), and then store this in a separate Cassandra table to drive some dashboard pages. I managed to get the 'countByKey' working, but now cannot figure out how to use the Spark-Cassandra 'saveToCassandra' method with a Map (it only takes RDD).
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> indicatePresenceTable = javaFunctions(sc).cassandraTable("mykeyspace", "indicatepresence");
JavaPairRDD<UserDate, CassandraRow> keyedByUserDate = indicatePresenceTable.keyBy(new Function<CassandraRow, UserDate>() {
private static final long serialVersionUID = 1L;
#Override
public UserDate call(CassandraRow cassandraIndicatePresenceRow) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
return new UserDate(cassandraIndicatePresenceRow.getString("userid"), sdf.format(cassandraIndicatePresenceRow.getDate("date")));
}
});
Map<UserDate, Object> countByKey = keyedByUserDate.countByKey();
writerBuilder("analytics", "countbykey", ???).saveToCassandra();
Is there a way use a Map directly in a writerBuilder? Or should I write my own custom reducer, that returns an RDD, but essentially does the same thing as the countByKey method? Or, should I convert each entry in the Map into a new POJO (eg UserDateCount, with user, date, and count) and use 'parallelize' to turn the list into an RDD and then store that?
The best thing to do would be to never return the result to the driver (by using countByKey). Instead do a reduceByKey to get another RDD back in the form of (key, count). Map that RDD to the row format of your table and then call saveToCassandra on that.
The most important strength of this approach is we never serialize the data back to the driver application. All the information is kept on the cluster and saved from their directly to C* rather than running through the bottleneck of the driver application.
Example (Very Similar to a Map Reduce Word Count):
Map each element to (key, 1)
Call reduceByKey to change (key, 1) -> (key, count)
Map each element to something writeable to C* (key,count)-> WritableObject
Call save to C*
In Scala this would be something like
keyedByUserDate
.map(_.1, 1) // Take the Key portion of the tuple and replace the value portion with 1
.reduceByKey( _ + _ ) // Combine the value portions for all elements which share a key
.map{ case (key, value) => your C* format} // Change the Tuple2 to something that matches your C* table
.saveToCassandra(ks,tab) // Save to Cassandra
In Java it is a little more convoluted (Insert your types in for K and V)
.mapToPair(new PairFunction<Tuple2<K,V>,K,Long>>, Tuple2<K, Long>(){
#Override
public Tuple2<K, Long> call(Tuple2<K, V> input) throws Exception {
return new Tuple2(input._1(),1)
}
}.reduceByKey(new Function2(Long,Long,Long)(){
#Override
public Long call(Long value1, Long value2) throws Exception {
return value1 + value2
}
}.map(new Function1(Tuple2<K, Long>, OutputTableClass)(){
#Override
public OutputTableClass call(Tuple2<K,Long> input) throws Exception {
//Do some work here
return new OutputTableClass(col1,col2,col3 ... colN)
}
}.saveToCassandra(ks,tab, mapToRow(OutputTableClass.class))

I am getting an InvalidTypeException whenever I am using the row.getToken("fieldname")?

for the following piece of code I am getting an InvalidTypeException whenever I am using the row.getToken("fieldname").
Record RowToRecord(Row rw) {
ColumnDefinitions cd = rw.getColumnDefinitions();
Record rec = new Record();
int i;
for(i = 0; i < cd.size(); i++) {
rec.fields.add(cd.getName(i));
System.out.println(cd.getName(i));
//System.out.println((rw.getToken(cd.getName(i))).getValue());
Token tk = rw.getToken(cd.getName(i)); //// InvalidTypeException on this line.
//System.out.println(tk.getValue()+" "+tk.getType().toString());
rec.values.add(tk.getValue());
rec.types.add(tk.getType().toString());
//Token tk = new Token();
}
return rec;
}
getToken is meant to be called on a column that contains a Cassandra token. In 99% of cases, that will be the result of a call to the token() CQL function, for example the first column in this query:
select token(id), col1 from my_table where id = ...
Your code is calling it for all columns, which will fail as soon as you have a column that doesn't match the CQL type for tokens.
That CQL type depends on the partitioner used in your cluster:
murmur3 partitioner (the default): token(...) will return a BIGINT
random partitioner: VARINT
ordered partitioner: BLOB
In theory you can call getToken on any column with this type (although in practice it probably only makes sense for columns that are the result of a token() call, as explained above).

How to use Map<String,Long> for datastax

I am using datastax and want to retrieve Map from Cassandra
Cassandra has map with String,bigint , bigint maps to long in Java but I cannot have Map with primitive types. So how to use Map ?
Below is my table creation script
CREATE TABLE IF NOT EXISTS dealsstorea.deals1(
otla text,
dtla text ,
**types Map < text,bigint >**,
travelers bigint,
pkgRatePlanCode text,
pkgRoomTypeCode text,
numRooms bigint,
PRIMARY KEY ((...),
..));
If I understand your question correctly, you are using the Java driver to Select from a CQL table that has a column of type Map and you want to load it into a Java object.
If that is the case use a Java TreeMap
Map<String, Long> myMap = new TreeMap<String, Long>();
Java types String and Long map to CQL text and bigint respectively.
Below code i used to convert the map data type into string.
--------Code-----------------------
public static String getMaptoSring (DataType dataType, Row row, String columnName) throws ClassNotFoundException
{
String mapstr = "";
if(row.getColumnDefinitions().getType(columnName).getName().name().equalsIgnoreCase("map"))
{
Map<?, ?> map = row.getMap(columnName, Class.forName("java.lang.String"), Class.forName("java.lang.String"));
Set<?> set = map.keySet();
Iterator setIterator = set.iterator();
while(setIterator.hasNext()){
Object key = setIterator.next();
String value = map.get(key).toString();
if(mapstr =="" )
mapstr="{"+key.toString()+":"+value;
else
mapstr = mapstr+","+key.toString()+":"+value;
}
mapstr = mapstr+"}";
//System.out.println("Map Key and value are as following: Key :: "+ key.toString() + ", value :: "+ value);
}
return mapstr;
}

need a counter query which give all counter for a primary key using hector API

I am using hector API for cassandra.
i create a counter table like follows
private void addColumnFamilyCounter(ThriftCluster cluster, String cfName, int rowCacheKeysToSave) {
String cassandraKeyspace = this.env.getProperty("cassandra.keyspace");
ThriftCfDef cfd =
new ThriftCfDef(cassandraKeyspace, cfName, ComparatorType.UTF8TYPE);
cfd.setRowCacheKeysToSave(rowCacheKeysToSave);
cfd.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName());
cluster.addColumnFamily(cfd);
}
and call the above method like follows
addColumnFamilyCounter(cluster, COUNTER_CF, 0);
The format of the table is like follows
Primary key columns
Munich jingle : 1
mingle : 2
tingle : 1
pingle : 5
Now i want to execute a query to get all the columns and their values under Munich. is there any way i can get all the columns.
What i knw till now is the following query but it gives me value for only a combination of primary key and a column key.
#Override
public long getTagCounter(String domain, String tag) {
CounterQuery<String, String> counter =
new ThriftCounterColumnQuery<String, String>(keyspaceOperator,
StringSerializer.get(),
StringSerializer.get());
counter.setColumnFamily(TAG_COUNTER_CF).setKey("p_key").setName("name");
return counter.execute().get().getValue();
}
okay i found answer by myself.I hope it will be helpful to other
CounterSlice<String> query = HFactory.createCounterSliceQuery(keyspaceOperator , StringSerializer.get(), StringSerializer.get())
.setColumnFamily("CF")
.setKey("PK")
.setRange(null, null, false, Integer.MAX_VALUE).execute().get();
for(HCounterColumn<String> col : query.getColumns()){
log.info(col.getName());
log.info(col.getvalue());
}

Resources