Cassandra Hector: how to insert null as a column value? - cassandra

An often use-case with Cassandra is storing the data in the column names of the dynamically created column family. In this situation the row values themselves are not needed, and a usual practice is to store nulls there.
However, when dealing with Hector, it seems like there is no way to insert null value, because Hector HColumnImpl does an explicit null-check in the column's constructor:
public HColumnImpl(N name, V value, long clock, Serializer<N> nameSerializer,
Serializer<V> valueSerializer) {
this(nameSerializer, valueSerializer);
notNull(name, "name is null");
notNull(value, "value is null");
this.column = new Column(nameSerializer.toByteBuffer(name));
this.column.setValue(valueSerializer.toByteBuffer(value));
this.column.setTimestamp(clock);
}
Are there any ways to insert nulls via Hector? If not, what is the best practice in the situation when you don't care about column values and need only their names?

Try using an empty byte[], i.e. new byte[0];

Related

How to add column to a DataFrame where value is fetched from a map with other column from row as key

I'm new to Spark, and trying to figure out how I can add a column to a DataFrame where its value is fetched from a HashMap, where the key is another value on the same row which where the value is being set.
For example, I have a map defined as follows:
var myMap: Map<Integer,Integer> = generateMap();
I want to add a new column to my DataFrame where its value is fetched from this map, with the key a current column value. A solution might look like this:
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", lit(myMap.get(col("EXISTING_COLUMN"))))
My issue with this code is that using the col function doesn't return a type of Int, like the keys in my HashMap.
Any suggestions?
I would create a dataframe from the map. Then do a join operation. It should be faster and can be reused.
A UDF (user-defined function) can also be used but they are black boxes to Catalyst, so I would be prudent in using them. Depending on where the content of the map is, it may also be complicated to pass it to a UDF.
As of the next version of Kotlin API for Apache Spark you will be able to simply create a udf which will be usable in almost this way.
val mapUDF by udf { input: Int -> myMap[input] }
dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
You need to use UDF.
val mapUDF = udf((i:Int)=>myMap.getOrElse(i,0))
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))

Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.
As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

Setting a NULL value in a BoundStatement

I'm using Cassandra Driver 2.0.0-beta2 with Cassandra 2.0.1.
I want to set a NULL value to a column of type 'int', in a BoundStatement. I don't think I can with setInt.
This is the code I'm using:
String insertStatementString = "insert into subscribers(subscriber,start_date,subscriber_id)";
PreparedStatement insertStatement = session.prepare(insertStatementString);
BoundStatement bs = new BoundStatement(insertStatement);
bs.setString("subscriber",s.getSubscriberName());
bs.setDate("start_date",startDate);
bs.setInt("subscriber_id",s.getSubscriberID());
The last line throws a null pointer exception, which can be explained because s.getSubscriberID() return an Integer and the BoundStatement accepts only ints, so when the id is null, it can't be converted, thus the exception.
The definition in my opinion should change to:
BoundStatement.setInt(String name, Integer v);
The way it is right now, I can't set NULL values for numbers.
Or am I missing something?
Is there other way to achieve this?
In cqlsh, setting null to a column of type 'int' is possible.
There is no need to bind values where the value will be empty or null. Therefore a null check might be useful, e.g.,
if(null != s.getSubscriberID()){
bs.setInt("subscriber_id",s.getSubscriberID());
}
As to the question of multiple instantiations of BoundStatement, the creation of multiple BoundStatement will be cheap in comparison with PreparedStatements (see the CQL documentation on prepared statements). Therefore the benefit is more clear when you begin to reuse the PreparedStatement, e.g., with a loop
String insertStatementString = "insert into subscribers(subscriber,start_date,subscriber_id)";
PreparedStatement insertStatement = session.prepare(insertStatementString);
// Inside a loop for example
for(Subscriber s: subscribersCollection){
BoundStatement bs = new BoundStatement(insertStatement);
bs.setString("subscriber",s.getSubscriberName());
bs.setDate("start_date",startDate);
if(null != s.getSubscriberID()){
bs.setInt("subscriber_id",s.getSubscriberID());
}
session.execute(bs);
}
I decided not to set the value at all. By default, it is null. It's a weird workaround.
But now I have to instantiate the BoundStatement before every call, because otherwise I risk having a value different than null from a previous call.
It would be great if they added a more comprehensive 'null' support.

How to update multiple rows using Hector

Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.

Using hector, how to delete a range of super columns?

I have a super column family for which over the time need to remove a range of super columns. I searched around, didn't seem to find a solution for that using hector. Can anyone please help?
You'll have to do a column slice first to get the columns you want to delete, then loop through and generate a list of mutations. You can then send all these mutations to Cassandra in one Hector call:
Mutator<..> mutator = HFactory.createMutator(keyspace, serializer);
SuperSlice<..> result = HFactory.createSuperSliceQuery(keyspace, ... serializers ...)
.setColumnFamily(cf)
.setKey(key)
.setRange("", "", false, Integer.MAX_VALUE)
.execute()
.get();
for (HSuperColumn<..> col in result.getSuperColumns())
mutator.addDeletion(key, cf, col.getName(), serializer);
mutator.execute();

Resources