Storm Topology reads data from kafka and write into cassandra tables
In Storm i am creating cassandra cluster connection and session in prepare method.
cassandraCluster = Cluster.builder().withoutJMXReporting().withoutMetrics()
.addContactPoints(nodes)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ExponentialReconnectionPolicy(100L,
TimeUnit.MINUTES.toMillis(5)))
.withLoadBalancingPolicy(
new TokenAwarePolicy(new RoundRobinPolicy()))
.build();
session = cassandraCluster.connect(keyspace);
In execute method i can process the tuple and save it in cassandra table
Suppose if i want to write data from single tuple into multiple table
Writing separate bolt for each table will be good choice. But i have to create cluster connection and session each table in each bolt.
But in this link single connection per cluster will be a good idea for performance
http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra
Did any of you have any idea on creating cluster connection in one bolt and use this connection in other bolt?
It depends on how storm allocates the bolts and spouts to the workers. You can't assume that you can can share connections between bolts because they might be running in different workers (read: JVMs) or on different nodes entirely.
See my answer here: Mongo connection pooling for Storm topology
Might look something like this pseudocode:
public class CassandraBolt extends BaseRichBolt {
private static final long serialVersionUID = 1L;
private static Logger LOG = LoggerFactory.getLogger(CassandraBolt.class);
OutputCollector _collector;
// whatever your cassandra session is
// has to be transient because session is not serializable
protected transient CassandraSession _session;
#SuppressWarnings("rawtypes")
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
_collector = collector;
// maybe get properties from stormConf instead of hard coding them
cassandraCluster = Cluster.builder().withoutJMXReporting().withoutMetrics()
.addContactPoints(nodes)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ExponentialReconnectionPolicy(100L,
TimeUnit.MINUTES.toMillis(5)))
.withLoadBalancingPolicy(
new TokenAwarePolicy(new RoundRobinPolicy()))
.build();
_session = cassandraCluster.connect(keyspace);
}
#Override
public void execute(Tuple input) {
try {
// use _session to talk to cassandra
} catch (Exception e) {
LOG.error("CassandraBolt error", e);
_collector.reportError(e);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// TODO Auto-generated method stub
}
}
Related
Is it possible to use a custom datastax session for Spring-Data?
Hi, I know Spring-Data for Cassandra uses datastax session internally. However I have a custom datastax session object (given by another service) that I would like Spring-Data to use instead of the one prewired. Assuming the versions of both datastax sessions are the same, is this possible?
Yes, it's possible.
Depending on your setup, there are a couple of approaches. Let me explain the two most common scenarios:
Direct usage of Template API
Session yourSession = …;
CqlTemplate cqlTemplate = new CqlTemplate(yourSession);
CassandraTemplate cassandraTemplate = new CassandraTemplate(yourSession);
Exposing the Session as #Bean
This one might require a bit more setup as configuration support expects usage of CassandraSessionFactoryBean and CassandraClusterFactoryBean.
Take a look at AbstractCassandraConfiguration to see what supporting beans (CassandraConverter, CassandraMappingContext) are configured to configure Spring Data's Cassandra support.
#Configuration
class MyCassandraConfig {
private final Session mySession;
public MyCassandraConfig(Session mySession) {
this.mySession = mySession;
}
#Bean
public CassandraConverter cassandraConverter() {
MappingCassandraConverter mappingCassandraConverter = new MappingCassandraConverter(cassandraMapping());
mappingCassandraConverter.setCustomConversions(customConversions());
return mappingCassandraConverter;
}
#Bean
public CassandraMappingContext cassandraMapping() {
Cluster cluster = mySession.getCluster();
String keyspace = mySession.getLoggedKeyspace();
CassandraMappingContext mappingContext = new CassandraMappingContext(
new SimpleUserTypeResolver(cluster, keyspace), new SimpleTupleTypeFactory(cluster));
CustomConversions customConversions = customConversions();
mappingContext.setCustomConversions(customConversions);
mappingContext.setSimpleTypeHolder(customConversions.getSimpleTypeHolder());
return mappingContext;
}
#Bean
public CustomConversions customConversions() {
return new CassandraCustomConversions(Collections.emptyList());
}
#Bean
public CassandraTemplate cassandraTemplate() {
return new CassandraTemplate(mySession, cassandraConverter());
}
}
Using spring integration and zookeeper, one can implement a leader to perform activities such as polling.
However how do we distribute the leader responsibility to all nodes in the cluster to load balance?
Given below code, once the application starts, I see that the same node is maintaining the leader role and fetching events. I want to distribute this activity to every node in the cluster to better load balance.
Is there any way I can schedule each node in the cluster to gain leadership and revoke in round robin manner?
#Bean
public LeaderInitiatorFactoryBean fooLeaderInitiator(CuratorFramework client) {
new LeaderInitiatorFactoryBean()
.setClient(client)
.setPath("/foofeed")
.setRole("foo");
}
#Bean
#InboundChannelAdapter(channel = "fooIncomingEvents", autoStartup = "false", poller = #Poller(fixedDelay = "5000"))
#Role("foo")
public FooTriggerMessageSource fooInboundChannelAdapter() {
new FooMessageSource("foo")
}
I could simulate load balancing using below code. Not sure if this is the correct approach. I could see fetching events log statement only from one node at a time in the cluster. This code yields leadership after performing gaining leadership and performing its job.
#Bean
public LeaderInitiator fooLeaderInitiator(CuratorFramework client,
FooPollingCandidate fooPollingCandidate) {
LeaderInitiator leader = new LeaderInitiator(client, fooPollingCandidate, zooKeeperNamespace)
leader.start()
leader
}
#Component
class FooPollingCandidate extends DefaultCandidate {
final Logger log = LoggerFactory.getLogger(this.getClass());
FooPollingCandidate() {
super("fooPoller", "foo")
}
#Override
void onGranted(Context ctx) {
log.debug("Leadership granted {}", ctx)
pullEvents()
ctx.yield();
}
#Override
void onRevoked(Context ctx) {
log.debug("Leadership revoked")
}
#Override
void yieldLeadership() {
log.debug("yielding Leadership")
}
//pull events and drop them on any channel needed
void pullEvents() {
log.debug("fetching events")
//simulate delay
sleep(5000)
}
}
What you are suggesting is an abuse of the leader election technology, which is intended for warm failover when the current leader fails, manually yielding leadership after each event is an anti-pattern
What you probably want is competing pollers where all pollers are active, but use a shared store to prevent duplicate processing.
For example, if you are polling a shared directory for files to process, you would use a FileSystemPersistentFileListFilter with a shared MetadataStore (such as the zookeeper implementation) to prevent multiple instances from processing the same file.
You can use the same technique (shared metadata store) for any polled message source.
As the spark docs says,it support kafka as data streaming source.but I use ZeroMQ,And there is not a ZeroMQUtils.so how can I use it? and generally,how about other MQs. I am totally new to spark and spark streaming, so I am sorry if the question is stupid.Could anyone give me a solution.Thanks
BTW,I use python.
Update, I finally did it in java with a Custom Receiver. Below is my solution
public class ZeroMQReceiver extends Receiver<T> {
private static final ObjectMapper mapper = new ObjectMapper();
public ZeroMQReceiver() {
super(StorageLevel.MEMORY_AND_DISK_2());
}
#Override
public void onStart() {
// Start the thread that receives data over a connection
new Thread(this::receive).start();
}
#Override
public void onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private void receive() {
String message = null;
try {
ZMQ.Context context = ZMQ.context(1);
ZMQ.Socket subscriber = context.socket(ZMQ.SUB);
subscriber.connect("tcp://ip:port");
subscriber.subscribe("".getBytes());
// Until stopped or connection broken continue reading
while (!isStopped() && (message = subscriber.recvStr()) != null) {
List<T> results = mapper.readValue(message,
new TypeReference<List<T>>(){} );
for (T item : results) {
store(item);
}
}
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again");
} catch(Throwable t) {
// restart if there is any other error
restart("Error receiving data", t);
}
}
}
I assume you are talking about Structured Streaming.
I am not familiar with ZeroMQ, but an important point in Spark Structured Streaming sources is replayability (in order to ensure fault tolerance), which, if I understand correctly, ZeroMQ doesn't deliver out-of-the-box.
A practical approach would be buffering the data either in Kafka and using the KafkaSource or as files in a (local FS/NFS, HDFS, S3) directory and using the FileSource for reading. Cf. Spark Docs. If you use the FileSource, make sure not to append anything to an existing file in the FileSource's input directory, but move them into the directory atomically.
I am evaluating spark with marklogic database. I have read a csv file, now i have a JavaRDD object which i have to dump into marklogic database.
SparkConf conf = new SparkConf().setAppName("org.sparkexample.Dataload").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("/root/ml/workArea/data.csv");
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<Record> rdd_records = data.map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3],fields[4]);
return sd;
}
});
This JavaRDD object i want to write to marklogic database.
Is there any spark api available for faster writing to the marklogic database ?
Lets say, If we could not write JavaRDD directly to marklogic then what is the currect approach to achieve this ?
Here is the code which i am using to write the JavaRDD data to marklogic database, let me know if it is wrong way to do that.
final DatabaseClient client = DatabaseClientFactory.newClient("localhost",8070, "MLTest");
final XMLDocumentManager docMgr = client.newXMLDocumentManager();
rdd_records.foreachPartition(new VoidFunction<Iterator<Record>>() {
public void call(Iterator<Record> partitionOfRecords) {
while (partitionOfRecords.hasNext()) {
Record record = partitionOfRecords.next();
System.out.println("partitionOfRecords - "+record.toString());
String docId = "/example/"+record.getID()+".xml";
JAXBContext context = JAXBContext.newInstance(Record.class);
JAXBHandle<Record> handle = new JAXBHandle<Record>(context);
handle.set(record);
docMgr.writeAs(docId, handle);
}
}
});
client.release();
I have used java client api to write the data, but i am getting below exception even though POJO class Record is implementing Serializable interface. Please let me know what could be the reason & how to solve that.
org.apache.spark.sparkexception task not Serializable .
The easiest way to get data into MarkLogic is via HTTP and the client REST API - specifically the /v1/documents endpoints - http://docs.marklogic.com/REST/client/management .
There are a variety of ways to optimize this, such as via a write set, but based on your question, I think the first thing to decide is - what kind of document do you want to write for each Record? Your example shows 5 columns in the CSV - typically, you'll write either a JSON or XML document with 5 fields/elements, each named based on the column index. So you'd need to write a little code to generate that JSON/XML, and then use whatever HTTP client you prefer (and one option is the MarkLogic Java Client API) to write that document to MarkLogic.
That addresses your question of how to write a JavaRDD to MarkLogic - but if your goal is to get data from a CSV into MarkLogic as fast as possible, then skip Spark and use mlcp - https://docs.marklogic.com/guide/mlcp/import#id_70366 - which involves zero coding.
Modified example from spark streaming guide, Here you will have to implement connection and writing logic specific to database.
public void send(JavaRDD<String> rdd) {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> partitionOfRecords) {
// ConnectionPool is a static, lazily initialized pool of
Connection connection = ConnectionPool.getConnection();
while (partitionOfRecords.hasNext()) {
connection.send(partitionOfRecords.next());
}
ConnectionPool.returnConnection(connection); // return to the pool
// for future reuse
}
});
}
I'm wondering if you just need to make sure everything you access inside your VoidFunction that was instantiated outside it is serializable (see this page). DatabaseClient and XMLDocumentManager are of course not serializable, as they're connected resources. You're right, however, to not instantiate DatabaseClient inside your VoidFunction as that would be less efficient (though it would work). I don't know if the following idea would work with spark. But I'm guessing you could create a class that keeps hold of a singleton DatabaseClient instance:
public static class MLClient {
private static DatabaseClient singleton;
private MLClient() {}
public static DatabaseClient get(DatabaseClientFactory.Bean connectionInfo) {
if ( connectionInfo == null ) {
throw new IllegalArgumentException("connectionInfo cannot be null");
}
if ( singleton == null ) {
singleton = connectionInfo.newClient();
}
return singleton;
}
}
then you just create a serializable DatabaseClientFactory.Bean outside your VoidFunction so your auth info is still centralized
DatabaseClientFactory.Bean connectionInfo =
new DatabaseClientFactory.Bean();
connectionInfo.setHost("localhost");
connectionInfo.setPort(8000);
connectionInfo.setUser("admin");
connectionInfo.setPassword("admin");
connectionInfo.setAuthenticationValue("digest");
Then inside your VoidFunction you could get that singleton DatabaseClient and new XMLDocumentManager like so:
DatabaseClient client = MLClient.get(connectionInfo);
XMLDocumentManager docMgr = client.newXMLDocumentManager();
I am using HornetQ in distributed transaction environment with MDBs. I read from the JMS documentation that we should not create Connection instance frequently, rather we should reuse the connection and create JMS sessions as and when required. So I wrote a code which creates JMS connection and then reuse it. But I have encountered the following exception while reusing the JMS connection object.
Could not create a session: Only allowed one session per connection.
See the J2EE spec, e.g. J2EE1.4 Section 6.6
I read few blogs on this but they all are specific to seam framework.
Here is my code
public class DefaultService implements IMessageService {
private static final long serialVersionUID = 1L;
private static final Logger logger = LogManager.getLogger(DefaultService.class);
private static final String connectionFactoryJndiName = "java:/JmsXA";
private static volatile Connection connection = null;
private Session session = null;
#Override
public void sendMessage(String destinationStr, Serializable object) {
try {
Destination destination = jmsServiceLocator.getDestination(destinationStr);
ObjectMessage message = session.createObjectMessage();
message.setObject(object);
MessageProducer messageProducer = session.createProducer(destination);
messageProducer.send(destination, message);
messageProducer.close();
logger.trace("Sent JMS Messagae for: " + object.getClass().getName());
}
catch (NamingException e) {
throw new RuntimeException("Couldn't send jms message", e);
}
catch (JMSException e) {
throw new RuntimeException("Couldn't send jms message", e);
}
}
#Override
public void close() {
try {
if (session != null) {
session.close();
}
}
catch (Exception e) {
logger.error("Couldn't close session", e);
}
}
}
I am using JBoss EAP 6.
Did I miss any settings here?
On JCA connection (i.e. connection where you used the PooledConnectionFactory) you are supposed to create one Session only per connection. That is part of the EE specification. (It has always been).
This is because these connections are pooled and it would be impossible to put them back on the pool if you were using more than one session per connection.
If you switch for non pooled connection factories (the ones that are meant for remote clients) you would have it working the way you wanted but then you would miss pooling from the application server. EE components are usually short lived and opening / closing JMS Connections (any connection to be more precise) it's an expensive operation.