I am using datastax and want to retrieve Map from Cassandra
Cassandra has map with String,bigint , bigint maps to long in Java but I cannot have Map with primitive types. So how to use Map ?
Below is my table creation script
CREATE TABLE IF NOT EXISTS dealsstorea.deals1(
otla text,
dtla text ,
**types Map < text,bigint >**,
travelers bigint,
pkgRatePlanCode text,
pkgRoomTypeCode text,
numRooms bigint,
PRIMARY KEY ((...),
..));
If I understand your question correctly, you are using the Java driver to Select from a CQL table that has a column of type Map and you want to load it into a Java object.
If that is the case use a Java TreeMap
Map<String, Long> myMap = new TreeMap<String, Long>();
Java types String and Long map to CQL text and bigint respectively.
Below code i used to convert the map data type into string.
--------Code-----------------------
public static String getMaptoSring (DataType dataType, Row row, String columnName) throws ClassNotFoundException
{
String mapstr = "";
if(row.getColumnDefinitions().getType(columnName).getName().name().equalsIgnoreCase("map"))
{
Map<?, ?> map = row.getMap(columnName, Class.forName("java.lang.String"), Class.forName("java.lang.String"));
Set<?> set = map.keySet();
Iterator setIterator = set.iterator();
while(setIterator.hasNext()){
Object key = setIterator.next();
String value = map.get(key).toString();
if(mapstr =="" )
mapstr="{"+key.toString()+":"+value;
else
mapstr = mapstr+","+key.toString()+":"+value;
}
mapstr = mapstr+"}";
//System.out.println("Map Key and value are as following: Key :: "+ key.toString() + ", value :: "+ value);
}
return mapstr;
}
Related
I am trying to get a cql string given a Dataframe. I came across this function
Where I can do something like this
TableDef.fromDataFrame(df, "test", "hello", ProtocolVersion.NEWEST_SUPPORTED).cql()
It looks to me that the library uses first column as Partition Key and does not care about Clustering Key so how do I specify to use particular set of columns of a Dataframe as a PartitionKey and ParticularSet of columns as a Clustering Key ?
Looks like I can create a new TableDef however I have to do the entire mapping by myself and in some cases the necessary functions like ColumnType are not accessible in Java. for Example I tried to create a new ColumnDef like below
new ColumnDef("col5", new PartitionKeyColumn(), ColumnType is not accessible in Java)
Objective: To get a CQL create Statement from a Spark DataFrame.
Input My dataframe can have any number of columns with their respective Spark Types. so say I have a Spark Dataframe with 100 columns where my col8, col9 of my dataframe corresponds to cassandra partitionKey columns and my column10 corresponds to cassandra clustering Key column
col1| col2| ...|col100
Now I want to use spark-cassandra-connector library to give me a CQL create table statement given the info above.
Desired Output
create table if not exists test.hello (
col1 bigint, (whatever column1 type is from my dataframe I just picked bigint randomly)
col2 varchar,
col3 double,
...
...
col100 bigint,
primary key(col8,col9)
) WITH CLUSTERING ORDER BY (col10 DESC);
Because required components (PartitionKeyColumn & instances of ColumnType) are objects in Scala, you need to use following syntax to access their intances:
// imports
import com.datastax.spark.connector.cql.ColumnDef;
import com.datastax.spark.connector.cql.PartitionKeyColumn$;
import com.datastax.spark.connector.types.TextType$;
// actual code
ColumnDef a = new ColumnDef("col5",
PartitionKeyColumn$.MODULE$, TextType$.MODULE$);
See code for ColumnRole & PrimitiveTypes to find full list of names of objects/classes.
Update after additional requirements: Code is lengthy, but should work...
SparkSession spark = SparkSession.builder()
.appName("Java Spark SQL example").getOrCreate();
Set<String> partitionKeys = new TreeSet<String>() {{
add("col1");
add("col2");
}};
Map<String, Integer> clustereingKeys = new TreeMap<String, Integer>() {{
put("col8", 0);
put("col9", 1);
}};
Dataset<Row> df = spark.read().json("my-test-file.json");
TableDef td = TableDef.fromDataFrame(df, "test", "hello",
ProtocolVersion.NEWEST_SUPPORTED);
List<ColumnDef> partKeyList = new ArrayList<ColumnDef>();
List<ColumnDef> clusterColumnList = new ArrayList<ColumnDef>();
List<ColumnDef> regColulmnList = new ArrayList<ColumnDef>();
scala.collection.Iterator<ColumnDef> iter = td.allColumns().iterator();
while (iter.hasNext()) {
ColumnDef col = iter.next();
String colName = col.columnName();
if (partitionKeys.contains(colName)) {
partKeyList.add(new ColumnDef(colName,
PartitionKeyColumn$.MODULE$, col.columnType()));
} else if (clustereingKeys.containsKey(colName)) {
int idx = clustereingKeys.get(colName);
clusterColumnList.add(new ColumnDef(colName,
new ClusteringColumn(idx), col.columnType()));
} else {
regColulmnList.add(new ColumnDef(colName,
RegularColumn$.MODULE$, col.columnType()));
}
}
TableDef newTd = new TableDef(td.keyspaceName(), td.tableName(),
(scala.collection.Seq<ColumnDef>) partKeyList,
(scala.collection.Seq<ColumnDef>) clusterColumnList,
(scala.collection.Seq<ColumnDef>) regColulmnList,
td.indexes(), td.isView());
String cql = newTd.cql();
System.out.println(cql);
I'm in trouble for losing data on insert in my Cassandra.
I am doing great bulk inserts from csv files which I read via Stream. The data is duplicated into two tables, because of different queries. Every 30,000th element I split my data to new partition (chunkCounter).
private PersistenceInformation persist(final String period, final String tradePartner, final Integer version, Stream<Transaction> transactions) {
int elementsInChunkCounter = 0;
int chunkCounter = 1;
int elementCounter = 0;
Iterator<Transaction> iterator = transactions.filter(beanValidator).iterator();
List<List<?>> listImportData = new ArrayList<>(30000);
List<List<?>> listGtins = new ArrayList<>(30000);
while (iterator.hasNext()) {
Transaction tr = iterator.next();
List<Object> importTemp = new ArrayList<>(9);
importTemp.add(period);
importTemp.add(tradePartner);
importTemp.add(version);
importTemp.add(chunkCounter);
importTemp.add(tr.getMdhId());
importTemp.add(tr.getGtin());
importTemp.add(tr.getQuantity());
importTemp.add(tr.getTransactionId());
importTemp.add(tr.getTimestamp());
listImportData.add(importTemp);
List<Object> gtinTemp = new ArrayList<>(8);
gtinTemp.add(period);
gtinTemp.add(tradePartner);
gtinTemp.add(version);
gtinTemp.add(chunkCounter);
gtinTemp.add(tr.getMdhId());
gtinTemp.add(tr.getGtin());
gtinTemp.add(tr.getQuantity());
gtinTemp.add(tr.getTimestamp());
listGtins.add(gtinTemp);
elementsInChunkCounter++;
elementCounter++;
if (elementsInChunkCounter == 30000) {
elementsInChunkCounter = 0;
chunkCounter++;
ingestImportData(listImportData);
listImportData.clear();
ingestGtins(listGtins);
listGtins.clear();
}
}
if (!listImportData.isEmpty()) {
ingestImportData(listImportData);
}
if (!listGtins.isEmpty()) {
ingestGtins(listGtins);
}
return new PersistenceInformation();
}
private void ingestImportData(List<List<?>> list) {
String cqlIngest = "INSERT INTO import_data (pd, tp , ver, chunk, mdh_id, gtin, qty, id, ts) VALUES (?,?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
private void ingestGtins(List<List<?>> list) {
String cqlIngest = "INSERT INTO gtins (pd, tp, ver, chunk, mdh_id, gtin, qty, ts) VALUES (?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
This worked pretty well until I noticed that sometimes a dataset goes missing. There is an entry in the second table (gtins) but the data set in the main table was not inserted. The application counted it but the database did not write it.
The table is built this way:
CREATE TABLE import_data (
tp text,
pd text,
ver int,
chunk int,
mdh_id uuid,
gtin text,
qty float,
id text,
ts timestamp
PRIMARY KEY ((tp, pd, ver, chunk), ts, mdh_id)) WITH CLUSTERING ORDER BY (ts DESC);
The mdh_id is a UUID from my application, so that every data set has a unique key and is not accidentally overridden.
The Cassandra log files didn't even show a warning.
At the moment I am evaluating BatchStatement but I need to insert every 8th dataset because of the 5kb limit, otherwise the database lost even more entries.
Any suggestions whats going wrong in my application is highly appreciated. Thanks a lot.
I have problem when create table with UUID type field using CassandraAdminOperations.createTable. I defined field in table with type uuid, but when the table was created using CassandraAdminOperations.createTable, the field was created as timeuuid. Is there any way to force the field to uuid instead of timeuuid ?
Here is the field definition in class,
#PrimaryKeyColumn(name = "id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
private UUID id;
#Column(value = "url")
private String url;
But when call CassandraAdminOperations.createTable to create the table, the log shows the following and the id filed was created as timeuuid,
00:43:23.877 [localhost-startStop-1] DEBUG o.s.d.c.core.CassandraAdminTemplate - CREATE TABLE IF NOT EXISTS id_to_url_map (id timeuuid, url text, PRIMARY KEY (id));
Thanks!
I had the same problem. But after some investigation, I found that you can add #CassandraType annotation, and explicitly specify the class.
import com.datastax.driver.core.DataType;
#PrimaryKeyColumn(name = "id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
#CassandraType(type = DataType.Name.UUID)
private UUID id;
I want to count the number of IndicatePresence messages for each user for any given day (out of a Cassandra table), and then store this in a separate Cassandra table to drive some dashboard pages. I managed to get the 'countByKey' working, but now cannot figure out how to use the Spark-Cassandra 'saveToCassandra' method with a Map (it only takes RDD).
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> indicatePresenceTable = javaFunctions(sc).cassandraTable("mykeyspace", "indicatepresence");
JavaPairRDD<UserDate, CassandraRow> keyedByUserDate = indicatePresenceTable.keyBy(new Function<CassandraRow, UserDate>() {
private static final long serialVersionUID = 1L;
#Override
public UserDate call(CassandraRow cassandraIndicatePresenceRow) throws Exception {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
return new UserDate(cassandraIndicatePresenceRow.getString("userid"), sdf.format(cassandraIndicatePresenceRow.getDate("date")));
}
});
Map<UserDate, Object> countByKey = keyedByUserDate.countByKey();
writerBuilder("analytics", "countbykey", ???).saveToCassandra();
Is there a way use a Map directly in a writerBuilder? Or should I write my own custom reducer, that returns an RDD, but essentially does the same thing as the countByKey method? Or, should I convert each entry in the Map into a new POJO (eg UserDateCount, with user, date, and count) and use 'parallelize' to turn the list into an RDD and then store that?
The best thing to do would be to never return the result to the driver (by using countByKey). Instead do a reduceByKey to get another RDD back in the form of (key, count). Map that RDD to the row format of your table and then call saveToCassandra on that.
The most important strength of this approach is we never serialize the data back to the driver application. All the information is kept on the cluster and saved from their directly to C* rather than running through the bottleneck of the driver application.
Example (Very Similar to a Map Reduce Word Count):
Map each element to (key, 1)
Call reduceByKey to change (key, 1) -> (key, count)
Map each element to something writeable to C* (key,count)-> WritableObject
Call save to C*
In Scala this would be something like
keyedByUserDate
.map(_.1, 1) // Take the Key portion of the tuple and replace the value portion with 1
.reduceByKey( _ + _ ) // Combine the value portions for all elements which share a key
.map{ case (key, value) => your C* format} // Change the Tuple2 to something that matches your C* table
.saveToCassandra(ks,tab) // Save to Cassandra
In Java it is a little more convoluted (Insert your types in for K and V)
.mapToPair(new PairFunction<Tuple2<K,V>,K,Long>>, Tuple2<K, Long>(){
#Override
public Tuple2<K, Long> call(Tuple2<K, V> input) throws Exception {
return new Tuple2(input._1(),1)
}
}.reduceByKey(new Function2(Long,Long,Long)(){
#Override
public Long call(Long value1, Long value2) throws Exception {
return value1 + value2
}
}.map(new Function1(Tuple2<K, Long>, OutputTableClass)(){
#Override
public OutputTableClass call(Tuple2<K,Long> input) throws Exception {
//Do some work here
return new OutputTableClass(col1,col2,col3 ... colN)
}
}.saveToCassandra(ks,tab, mapToRow(OutputTableClass.class))
I am using hector API for cassandra.
i create a counter table like follows
private void addColumnFamilyCounter(ThriftCluster cluster, String cfName, int rowCacheKeysToSave) {
String cassandraKeyspace = this.env.getProperty("cassandra.keyspace");
ThriftCfDef cfd =
new ThriftCfDef(cassandraKeyspace, cfName, ComparatorType.UTF8TYPE);
cfd.setRowCacheKeysToSave(rowCacheKeysToSave);
cfd.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName());
cluster.addColumnFamily(cfd);
}
and call the above method like follows
addColumnFamilyCounter(cluster, COUNTER_CF, 0);
The format of the table is like follows
Primary key columns
Munich jingle : 1
mingle : 2
tingle : 1
pingle : 5
Now i want to execute a query to get all the columns and their values under Munich. is there any way i can get all the columns.
What i knw till now is the following query but it gives me value for only a combination of primary key and a column key.
#Override
public long getTagCounter(String domain, String tag) {
CounterQuery<String, String> counter =
new ThriftCounterColumnQuery<String, String>(keyspaceOperator,
StringSerializer.get(),
StringSerializer.get());
counter.setColumnFamily(TAG_COUNTER_CF).setKey("p_key").setName("name");
return counter.execute().get().getValue();
}
okay i found answer by myself.I hope it will be helpful to other
CounterSlice<String> query = HFactory.createCounterSliceQuery(keyspaceOperator , StringSerializer.get(), StringSerializer.get())
.setColumnFamily("CF")
.setKey("PK")
.setRange(null, null, false, Integer.MAX_VALUE).execute().get();
for(HCounterColumn<String> col : query.getColumns()){
log.info(col.getName());
log.info(col.getvalue());
}