Storm Cassandra Integeration - cassandra

I am newbie in both Storm and Cassandra. I want to use a Bolt to write the strings emitted by a Spout, in a column family in Cassandra. I have read the example here which seems a little bit complex for me, as it uses different classes for writing in the Cassandra DB. Furthermore, I want to know how many times the strings are written in the Cassandra DB. In the example, for me, it is not clear how we can control the number of strings entered in the Cassandra DB?
Simply, I need a Bolt to write the emitted strings by a Spout to a Cassandra column family e.g., 200 records?
Thanks in advance!

You can either use Datastax Cassandra Driver or your can you the storm-cassandra library you posted earlier.
Your requirements is unclear. You only want to store 200 tuples?
Any way, run the topology with sample data and after the stream is finished, query Cassandra and see what is there.
Apache Storm and Apache Cassandra are quite deep and extensive projects. There is no walk around learning them and do sample projects in order to learn.

hope this will help.
/*Main Class */
TopologyBuilder builder = new TopologyBuilder();
Config conf = new Config();
conf.put("cassandra.keyspace", "Storm_Output"); //Key_space name
conf.put("cassandra.nodes","ip-address-of-cassandra-machine");
conf.put("cassandra.port",9042);
//port on which cassandra is running (Default:9042)
builder.setSpout("generator", new RandomSentenceSpout(), 1);
builder.setBolt("counter", new CassandraInsertionBolt(), 1).shuffleGrouping("generator");
builder.setBolt("CassandraBolt",new CassandraWriterBolt(
async(
simpleQuery("INSERT INTO Storm_Output.tanle_name (field1,field2 ) VALUES (?,?);")
.with(
fields("field1","field2 ")
)
)
), 1).globalGrouping("counter");
// Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(1);
StormSubmitter.submitTopologyWithProgressBar("Cassnadra-Insertion", conf, builder.createTopology());
/*Bolt sending data for insertion into cassandra */
/*CassandraWriter Bolt */
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
Random rand=new Random();
basicOutputCollector.emit(new Values(rand.nextInt(20),rand.nextInt(20)));
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
// TODO Auto-generated method stub
outputFieldsDeclarer.declare(new Fields("field1","field2"));
}
}

Related

Schema disagreements with Cassandra 4.0 using the Java driver

we have a 3-node dev Cassandra cluster running 3.11.13 that we have upgraded to 4.0.7, and we’ve been basically sending DDL statements through our Java applications using spring-data-cassandra:3.4.6 which uses the DataStax Java Driver version 4.14.1, and ever since we hadn’t had faced any issues with it until the upgrade to 4.0.7
The main issue with 4.0.7 that we’re facing is the schema disagreements that we’ve been seeing due to the tables created programmatically that has been a non-issue for us since 3.11.x. Although DDL statements made through cqlsh is working as expected, it’s only through the programmatic creation that we’re seeing the schema disagreements.
We’ve tried different cluster setups, C* versions, and Ubuntu versions, but we still face the same issue:
3-node, single-rack DC (Ubuntu 18.04, 20.04, 22.04) (4.0.x, 4.1.x)
3-node, 3-rack DC (Ubuntu 18.04, 20.04, 22.04) (4.0.x, 4.1.x) — This is the setup we’ve been using since 3.11.x
We’ve also tried fiddling with the driver configurations like adjusting the timeouts and disabling debouncing, but with no luck, face the same issue.
advanced.control-connection {
schema-agreement {
interval = 500 milliseconds
timeout = 10 seconds
warn-on-failure = true
}
},
advanced.metadata {
topology-event-debouncer {
window = 1 milliseconds
max-events = 1
}
schema {
request-timeout = 5 seconds
debouncer {
window = 1 milliseconds
max-events = 1
}
}
}
We’re creating tables programmatically through the following snippets:
#Override
protected abstract List<String> getStartupScripts();
#Bean
SessionFactoryInitializer sessionFactoryInitializer(SessionFactory sessionFactory) {
SessionFactoryInitializer initializer = new SessionFactoryInitializer();
initializer.setSessionFactory(sessionFactory);
final ResourceKeyspacePopulator resourceKeyspacePopulator = new ResourceKeyspacePopulator();
getStartupScripts().forEach(script ->
{
resourceKeyspacePopulator.addScript(scriptOf(script));
});
initializer.setKeyspacePopulator(resourceKeyspacePopulator);
return initializer;
}
And create one like:
#Override
protected List<String> getStartupScripts() {
return Arrays.asList(testTable());
}
private String testTable() {
return "CREATE TABLE IF NOT EXISTS test_table ("
+ "test text, "
+ "test2 text, "
+ "createdat bigint, "
+ "PRIMARY KEY(test, test2))";
}
But we end up in a loop until it timeouts due to the schema disagreement with the following errors:
DEBUG com.datastax.oss.driver.internal.core.metadata.SchemaAgreementChecker - [s1] Schema agreement not reached yet ([09989a2c-7348-3117-8b4a-d5cad549bc09, f4c8755d-6fec-38fe-984f-4083f4a0a0a0]), rescheduling in 500 ms
WARN org.springframework.context.support.GenericApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'sessionFactoryInitializer' defined in com.bitcoin.wallet.config.CassandraConfig: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.data.cassandra.core.cql.session.init.SessionFactoryInitializer]: Factory method 'sessionFactoryInitializer' threw exception; nested exception is org.springframework.data.cassandra.core.cql.session.init.ScriptStatementFailedException: Failed to execute CQL script statement #1 of Byte array resource [resource loaded from byte array]: CREATE TABLE IF NOT EXISTS test_table (test text,test2 text,createdat bigint,PRIMARY KEY(test, test2)); nested exception is com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT10S
So two things come to mind when reading through this:
Schema disagreements are often a symptom of some larger issue.
Does the node have its CPU pegged at 100%? Schema disagreement. Inefficient network routing? Schema disagreement. Disk IOPS maxed-out causing write back-pressure? Schema disagreement.
I'd have a look at the activity on the nodes and see if any of the above stand out.
Programmatic schema changes are often problematic.
Each node needs to store the complete schema, so each schema change gets sent to all nodes, essentially making schema changes running at an asynchronous ALL level of consistency. Because of that, there's no margin for error. And programmatic schema changes are often sent from within an application much faster than Cassandra can reconcile them.
My recommendations for making any schema changes:
Execute during off-peak times.
Only run when all nodes are UN.
Run them using cqlsh (not from application code).
Verify each individual change using nodetool describecluster.

Datastax Node.js Cassandra driver When to use a Mapper vs. Query

I'm working with the Datastax Node.js driver and I can't figure out when to use a mapper vs. query. Both seem to be able to perform the same CRUD operations.
With a query:
const q = SELECT * FROM mykeyspace.mytable WHERE id='12345';
client.execute(q).then(result => console.log('This is the data', result);
With mapper:
const tableRow = await tableMapper.find({ id: '12345' });
When should I use the mapper over a query and vice versa?
Mapper is a feature from cassandra-driver released in 2018. Using mapper, cassandra-driver can make a map from your cassandra table to an object in nodejs and you can handle in your nodejs application like a set of document.
Using mapper you can make selects or inserts in your database like said in this article:
https://www.datastax.com/blog/2018/12/introducing-datastax-nodejs-mapper-apache-cassandra
With query method, if you need to use or reuse any property from your json you will need to make a Json.Parse().
The short answer is: whatever you find more comfortable.
The Mapper lets you deal with database data as documents (JavaScript objects), builds the CQL query for you, executes the query and maps the results.
On the other hand, the core driver only supports executing CQL queries that you have to write yourself.

Is there any way to find out which node has been used by SELECT statement in Cassandra?

I have written a custom LoadBalancerPolicy for spark-cassandra-connector and now I want to ensure that it really works!
I have a Cassandra cluster with 3 nodes and a keyspace with a replication factor of 2, so when we want to retrieve a record, there will be only two nodes on cassandra which hold the data.
The thing is that I want to ensure the spark-cassandra-connector (with my load-balancer-policy) is still token-aware and will choose the right node as coordinator for each "SELECT" statement.
Now, I'm thinking if we can write a trigger on the SELECT statement for each node, in case of the node does not hold the data, the trigger will create a log and I realize the load-balancer-policy does not work properly. How can we write a trigger On SELECT in Cassandra? Is there any better way to accomplish that?
I already checked the documentation for creating the triggers and those are too limited:
Official documentation
Documentation at DataStax
Example implementation in official repo
You can do it from the program side, if you get routing key for your bound statement (you must use prepared statements), find the replicas for it via Metadata class, and then compare if this host is in the ExecutionInfo that you can get from ResultSet.
According to what Alex said, we can do it as below:
After creating SparkSession, we should make a connector:
import com.datastax.spark.connector.cql.CassandraConnector
val connector = CassandraConnector.apply(sparkSession.sparkContext.getConf)
Now we can define a preparedStatement and do the rest:
connector.withSessionDo(session => {
val selectQuery = "select * from test where id=?"
val prepareStatement = session.prepare(selectQuery)
val protocolVersion = session.getCluster.getConfiguration.getProtocolOptions.getProtocolVersion
// We have to explicitly bind the all of parameters that partition key is based on them, otherwise the routingKey will be null.
val boundStatement = prepareStatement.bind(s"$id")
val routingKey = boundStatement.getRoutingKey(protocolVersion, null)
// We can get tha all of nodes that contains the row
val replicas = session.getCluster.getMetadata.getReplicas("test", routingKey)
val resultSet = session.execute(boundStatement)
// We can get the node which gave us the row
val host = resultSet.getExecutionInfo.getQueriedHost
// Final step is to check whether the replicas contains the host or not!!!
if (replicas.contains(host)) println("It works!")
})
The important thing is that we have to explicitly bind the all of parameters that partition key is based on them (i.e. we cannot set them har-codded in the SELECT statement), otherwise the routingKey will be null.

Read cassandra matrics using JMX in Java

How can i produce live metrics of Cassandra in Java using JMX/Metrics? I want to run cassandra JMX command to collect cassandra matrices.Examples will be much appreciated.
All Cassandra's metrics exposed via JMX are documented in official documentation. And because it uses the Metrics library, you may not need to use JMX to capture metrics - see the note at the end of the referenced page for more information (and conf/metrics-reporter-config-sample.yaml example file from Cassandra's distribution).
P.S. Maybe I misunderstood the question - can you provide more details? Are you looking for commands to collect that metrics from Cassandra? Or code snippets in Java?
From Java you can access the particular metrics with something like this:
JMXServiceURL url = new JMXServiceURL(
"service:jmx:rmi:///jndi/rmi://[127.0.0.1]:7199/jmxrmi");
JMXConnector jmxc = JMXConnectorFactory.connect(url, null);
MBeanServerConnection mbsc = jmxc.getMBeanServerConnection();
Set<ObjectInstance> objs = mbsc.queryMBeans(ObjectName
.getInstance("org.apache.cassandra.metrics:type=ClientRequest,scope=Read-ALL,name=TotalLatency"), null);
for (ObjectInstance obj : objs) {
Object proxy = JMX.newMBeanProxy(mbsc, obj.getObjectName(),
CassandraMetricsRegistry.JmxCounterMBean.class);
if (proxy instanceof CassandraMetricsRegistry.JmxCounterMBean) {
System.out.println("TotalLatency = " + ((CassandraMetricsRegistry.JmxCounterMBean) proxy).getCount());
}
}
jmxc.close();
More detailed example you can find at JmxCollector from cassandra-metrics-collector project...

Simplest way to insert data into a fresh Cassandra database using the Hector API?

I've followed numerous examples on inserting data into a Cassandra database and every time I get an exception about unconfigured column families.
Exception in thread "main" me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:unconfigured columnfamily TestColumnFamily)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:252)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
at CassandraInterface.main(CassandraInterface.java:101)
Caused by: InvalidRequestException(why:unconfigured columnfamily TestColumnFamily)
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:19477)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:1035)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:1009)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:246)
... 4 more
So I looked up how to configure them and found
BasicColumnFamilyDefinition cfdef = new BasicColumnFamilyDefinition();
cfdef.setKeyspaceName(keyspaceName);
cfdef.setName(columnFamilyName);
cfdef.setKeyValidationClass(ComparatorType.UTF8TYPE.getClassName());
cfdef.setComparatorType(ComparatorType.UTF8TYPE);
That didn't configure the column family.
All of the examples I have found are fragments without any context, so I don't know what to import or set up. In addition, some examples appear to mix the Hector API v2 and the original Hector API, so when I use them, I get "class not found" or "function not found" compiler errors.
Hector CassandraClusterTest.java
#Test
public void testAddDropColumnFamily() throws Exception {
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition("Keyspace1", "DynCf");
cassandraCluster.addColumnFamily(cfDef);
String cfid2 = cassandraCluster.dropColumnFamily("Keyspace1", "DynCf");
assertNotNull(cfid2);
// Let's wait for agreement
cassandraCluster.addColumnFamily(cfDef, true);
cfid2 = cassandraCluster.dropColumnFamily("Keyspace1", "DynCf", true);
assertNotNull(cfid2);
}
Long story short, keyspace and column family need to exist before you try and insert data into them. You can either manage this in your code, to check to see if they exist, using the example above as a nice reference -- or modify via the command line interface (cassandra-cli)
Hector Unit Tests
Hopefully you've been able to do this by now but this is how I've done it.
I have a cassandra install (using 1.1.4) and assuming you have all the necessary directories created:
/var/lib/cassandra
/var/lib/casandra/data
/var/lib/cassnadra/commitlogs
/var/lib/cassandra/saved_caches
I start it using:
bin/cassandra -f
I create a simple script called schema_create.txt:
CREATE KEYSPACE TEST
WITH strategy_class = 'org.apache.cassandra.locator.SimpleStrategy'
AND strategy_options:replication_factor='1';
use TEST;
CREATE COLUMNFAMILY TestColumnFamily(
userid varchar,
firstname varchar,
lastname varchar,
PRIMARY KEY (userid));
Then from the command line you can run this script using the new CQL tool that comes with cassandra as follows:
bin/cqlsh --cql3 < schema_createt.txt
This will install a keyspace named test with a column family named testcolumnfamily into cassandra.
Now from within your java application you can simply create a test class that has a main method (i will assume your development environment has all necessary dependencies if using maven):
try{
Mutator mutator = HFactory.createMutator(kweyspace, stringSerializer.get());
mutator.addInsertion("iamauser", "tescolumnfamily", HFactory.createStringColumn("firstname", "John"));
mutator.addInsertion("iamauser", "testcolumnfamily", HFactory.createStringColumn("lastname", "Smith"));
mutator.execute();
}
catch(HectorException Hex){ Hex.printStackTrace(); }
finally{ cluster.getConnectionManger().shutdown(); }
Now go back to the command line and enter into cassandra using:
$bin/cqlsh --cql3
use test;
select * from testcolumnfamily;
This will insert a row of data into your cassandra db with the key iamauser, and name as John Smith and you can verify as shown above using the cqlsh tool.
Hope this helps.

Resources