Register Java Class in Flink Cluster - cassandra

I am running my Fat Jar in Flink Cluster which reads Kafka and saves in Cassandra, the code is,
final Properties prop = getProperties();
final FlinkKafkaConsumer<String> flinkConsumer = new FlinkKafkaConsumer<>
(kafkaTopicName, new SimpleStringSchema(), prop);
flinkConsumer.setStartFromEarliest();
final DataStream<String> stream = env.addSource(flinkConsumer);
DataStream<Person> sensorStreaming = stream.flatMap(new FlatMapFunction<String, Person>() {
#Override
public void flatMap(String value, Collector<Person> out) throws Exception {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
logger.error("Json Processing Exception", e);
}
}
});
savePersonDetails(sensorStreaming);
env.execute();
and The Person POJO contains,
#Column(name = "event_time")
private Instant eventTime;
There is codec required to store Instant as below for Cassandra side,
final Cluster cluster = ClusterManager.getCluster(cassandraIpAddress);
cluster.getConfiguration().getCodecRegistry().register(InstantCodec.instance);
When i run standalone works fine, but when i run local cluster throws me an error as below,
Caused by: com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [timestamp <-> java.time.Instant]
at com.datastax.driver.core.CodecRegistry.notFound(CodecRegistry.java:679)
at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:526)
at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:506)
at com.datastax.driver.core.CodecRegistry.access$200(CodecRegistry.java:140)
at com.datastax.driver.core.CodecRegistry$TypeCodecCacheLoader.load(CodecRegistry.java:211)
at com.datastax.driver.core.CodecRegistry$TypeCodecCacheLoader.load(CodecRegistry.java:208)
I read the below document for registering,
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/custom_serializers.html
but InstantCodec is 3rd party one. How can i register it?

I solved the problem, there was LocalDateTime which was emitting from and when i was converting with same type, there was above error. I changed the type into java.util Date type then it worked.

Related

Message history not getting deserialized

I was trying to send message history via kafka backed message channel, and i am getting an error like below:
Caused by: java.lang.IllegalArgumentException: Incorrect type specified for header 'history'. Expected [class org.springframework.integration.history.MessageHistory] but actual type is [class org.springframework.kafka.support.DefaultKafkaHeaderMapper$NonTrustedHeaderType]
at org.springframework.messaging.MessageHeaders.get(MessageHeaders.java:216)
at org.springframework.integration.history.MessageHistory.write(MessageHistory.java:96)
Environment:
Java version: JDK8
Kafka version: 3.1.0
Spring-boot-starter-integration: 2.6.2 (integration core:5.5.7)
The message is getting deserialized properly without message history, but unable to do so with message history.
Here is the configuration that I am setting:
Consumer:
public ConsumerFactory consumerFactory(String groupId, String clientId) {
Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapAddress);
consumerProperties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, AUTO_OFFSET_RESET_CONFIG);
consumerProperties
.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
consumerProperties.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG,MAX_POLL_INTERVAL_MS_CONFIG);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG,MAX_POLL_RECORDS_CONFIG);
DefaultKafkaConsumerFactory defaultKafkaConsumerFactory = new DefaultKafkaConsumerFactory(
consumerProperties);
JsonDeserializer jsonDeserializer = new JsonDeserializer(GenericMessage.class, JacksonJsonUtils.messagingAwareMapper());
jsonDeserializer.addTrustedPackages("*");
defaultKafkaConsumerFactory.setValueDeserializer(jsonDeserializer);
return defaultKafkaConsumerFactory;
}
Producer:
public ProducerFactory<String, String> producerFactory() {
Map<String, Object> producerConfigMap = new HashMap<>();
producerConfigMap.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapAddress);
producerConfigMap.put(ProducerConfig.LINGER_MS_CONFIG, PRODUCER_LINGER_MS_CONFIG);
producerConfigMap.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, PRODUCER_COMPRESSION_TYPE_CONFIG);
producerConfigMap.put(ProducerConfig.BATCH_SIZE_CONFIG, PRODUCER_BATCH_SIZE_CONFIG);
producerConfigMap
.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
JsonSerializer<GenericMessage> jsonSerializer = new JsonSerializer(JacksonJsonUtils.messagingAwareMapper());
DefaultKafkaProducerFactory defaultKafkaProducerFactory = new DefaultKafkaProducerFactory<>(
producerConfigMap);
defaultKafkaProducerFactory.setValueSerializer(jsonSerializer);
return defaultKafkaProducerFactory;
}
ConcurentKafkaContainerListenerFactory:
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(
String groupId, String clientId) {
DefaultKafkaHeaderMapper kafkaHeaderMapper = new DefaultKafkaHeaderMapper();
kafkaHeaderMapper.addTrustedPackages("org.springframework.integration.history");
MessagingMessageConverter messagingMessageConverter = new MessagingMessageConverter();
messagingMessageConverter.setHeaderMapper(kafkaHeaderMapper);
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory(groupId, clientId));
factory.setMessageConverter(messagingMessageConverter);
return factory;
}
I have tried adding trusted packages in every possible way, but still, I am getting the above error.
Looks at the exception again: org.springframework.kafka.support.DefaultKafkaHeaderMapper$NonTrustedHeaderType. It doesn't say that something is wrong with your (de)serializers. It is a DefaultKafkaHeaderMapper mapper feature to ban that type from you.
You need to supply a DefaultKafkaHeaderMapper with the addTrustedPackages("*") on the consumer side. If you use KafkaMessageDrivenChannelAdapter, see its setMessageConverter(MessageConverter messageConverter) to be populated with the MessagingMessageConverter. And that one has an option for the setHeaderMapper(KafkaHeaderMapper headerMapper), where you already can set that DefaultKafkaHeaderMapper.
Please, raise a GH, so we can add that org.springframework.integration.history to trusted packages for a default DefaultKafkaHeaderMapper in the KafkaMessageDrivenChannelAdapter.

How to read messages in MQs using spark streaming,i.e ZeroMQ,RabbitMQ?

As the spark docs says,it support kafka as data streaming source.but I use ZeroMQ,And there is not a ZeroMQUtils.so how can I use it? and generally,how about other MQs. I am totally new to spark and spark streaming, so I am sorry if the question is stupid.Could anyone give me a solution.Thanks
BTW,I use python.
Update, I finally did it in java with a Custom Receiver. Below is my solution
public class ZeroMQReceiver extends Receiver<T> {
private static final ObjectMapper mapper = new ObjectMapper();
public ZeroMQReceiver() {
super(StorageLevel.MEMORY_AND_DISK_2());
}
#Override
public void onStart() {
// Start the thread that receives data over a connection
new Thread(this::receive).start();
}
#Override
public void onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private void receive() {
String message = null;
try {
ZMQ.Context context = ZMQ.context(1);
ZMQ.Socket subscriber = context.socket(ZMQ.SUB);
subscriber.connect("tcp://ip:port");
subscriber.subscribe("".getBytes());
// Until stopped or connection broken continue reading
while (!isStopped() && (message = subscriber.recvStr()) != null) {
List<T> results = mapper.readValue(message,
new TypeReference<List<T>>(){} );
for (T item : results) {
store(item);
}
}
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again");
} catch(Throwable t) {
// restart if there is any other error
restart("Error receiving data", t);
}
}
}
I assume you are talking about Structured Streaming.
I am not familiar with ZeroMQ, but an important point in Spark Structured Streaming sources is replayability (in order to ensure fault tolerance), which, if I understand correctly, ZeroMQ doesn't deliver out-of-the-box.
A practical approach would be buffering the data either in Kafka and using the KafkaSource or as files in a (local FS/NFS, HDFS, S3) directory and using the FileSource for reading. Cf. Spark Docs. If you use the FileSource, make sure not to append anything to an existing file in the FileSource's input directory, but move them into the directory atomically.

adding Cassandra as sink in Flink error : All host(s) tried for query failed

I was following up with an example at https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/cassandra.html to connect Cassandra as sink in Flink
My code for is shown below
public class writeToCassandra {
private static final String CREATE_KEYSPACE_QUERY = "CREATE KEYSPACE test WITH replication= {'class':'SimpleStrategy', 'replication_factor':1};";
private static final String createTable = "CREATE TABLE test.cassandraData(id varchar, heart_rate varchar, PRIMARY KEY(id));" ;
private final static Collection<String> collection = new ArrayList<>(50);
static {
for (int i = 1; i <= 50; ++i) {
collection.add("element " + i);
}
}
public static void main(String[] args) throws Exception {
//setting the env variable to local
StreamExecutionEnvironment envrionment = StreamExecutionEnvironment.createLocalEnvironment(1);
DataStream<Tuple2<String, String>> dataStream = envrionment
.fromCollection(collection)
.map(new MapFunction<String, Tuple2<String, String>>() {
final String mapped = " mapped ";
String[] splitted;
#Override
public Tuple2<String, String> map(String s) throws Exception {
splitted = s.split("\\s+");
return Tuple2.of(
UUID.randomUUID().toString(),
splitted[0] + mapped + splitted[1]
);
}
});
CassandraSink.addSink(dataStream)
.setQuery("INSERT INTO test.cassandraData(id,heart_rate) values (?,?);")
.setHost("127.0.0.1")
.build();
envrionment.execute();
} //main
} //writeToCassandra
I am getting the following error
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
Not sure if this is always required, but the way that I set up my CassandraSink is like this:
CassandraSink
.addSink(dataStream)
.setClusterBuilder(new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return Cluster.builder()
.addContactPoints(myListOfCassandraUrlsString.split(","))
.withPort(portNumber)
.build();
}
})
.build();
I have annotated POJOs that are returned by the dataStream so I don't need the query, but you would just include ".setQuery(...)" after the ".addSink(...)" line.
The exception simply indicates that the example program cannot reach the C* database.
flink-cassandra-connector offers streaming API to connect to designated C* database. Thus, you need to have a C* instance running.
Each streaming job is pushed/serialized to the node that Task Manager runs at. In your example, you assume C* is running on the same node as the TM node. An alternative is to change the C* address from 127.0.0.1 to a public address.

Kafka-Streams throwing NullPointerException when consuming

I have this problem:
When I'm consuming from a topic using the Processor API, when inside the processor the method context().forward(K, V), Kafka Streams throws a null pointer exception.
This is the stacktrace of it:
Exception in thread "StreamThread-1" java.lang.NullPointerException
at org.apache.kafka.streams.processor.internals.StreamTask.forward(StreamTask.java:336)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:187)
at org.apache.kafka.streams.processor.ProcessorContext$forward.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:133)
at com.bnsf.ltf.processor.ConversionProcessor.process(ConversionProcessor.groovy:23)
at com.bnsf.ltf.processor.ConversionProcessor.process(ConversionProcessor.groovy)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:68)
at org.apache.kafka.streams.processor.internals.StreamTask.forward(StreamTask.java:338)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:187)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:64)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:174)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:320)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:218)
My Gradle dependencies look like this:
compile('org.codehaus.groovy:groovy-all')
compile('org.apache.kafka:kafka-streams:0.10.0.0')
Update: I tried with version 0.10.0.1 and it still throws the same error.
This is the code of the Topology I'm building...
topologyBuilder.addSource('inboundTopic', stringDeserializer, stringDeserializer, conversionConfiguration.inTopic)
.addProcessor('conversionProcess', new ProcessorSupplier() {
#Override
Processor get() {
return conversionProcessor
}
}, 'inboundTopic')
.addSink('outputTopic', conversionConfiguration.outTopic, stringSerializer, stringSerializer, 'conversionProcess')
stream = new KafkaStreams(topologyBuilder, streamConfig)
stream.start()
My processor looks like this:
#Override
void process(String key, String message) {
// Call to a service and the return of the service is set on the
// converted local variable named converted
context().forward(key, converted)
context().commit()
}
Provide your Processor directly.
.addProcessor('conversionProcess', () -> new MyProcessor(), 'inboundTopic')
MyProcessor should, in turn, inherit from AbstractProcessor.

How to write JavaRDD to marklogic database

I am evaluating spark with marklogic database. I have read a csv file, now i have a JavaRDD object which i have to dump into marklogic database.
SparkConf conf = new SparkConf().setAppName("org.sparkexample.Dataload").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("/root/ml/workArea/data.csv");
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<Record> rdd_records = data.map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3],fields[4]);
return sd;
}
});
This JavaRDD object i want to write to marklogic database.
Is there any spark api available for faster writing to the marklogic database ?
Lets say, If we could not write JavaRDD directly to marklogic then what is the currect approach to achieve this ?
Here is the code which i am using to write the JavaRDD data to marklogic database, let me know if it is wrong way to do that.
final DatabaseClient client = DatabaseClientFactory.newClient("localhost",8070, "MLTest");
final XMLDocumentManager docMgr = client.newXMLDocumentManager();
rdd_records.foreachPartition(new VoidFunction<Iterator<Record>>() {
public void call(Iterator<Record> partitionOfRecords) {
while (partitionOfRecords.hasNext()) {
Record record = partitionOfRecords.next();
System.out.println("partitionOfRecords - "+record.toString());
String docId = "/example/"+record.getID()+".xml";
JAXBContext context = JAXBContext.newInstance(Record.class);
JAXBHandle<Record> handle = new JAXBHandle<Record>(context);
handle.set(record);
docMgr.writeAs(docId, handle);
}
}
});
client.release();
I have used java client api to write the data, but i am getting below exception even though POJO class Record is implementing Serializable interface. Please let me know what could be the reason & how to solve that.
org.apache.spark.sparkexception task not Serializable .
The easiest way to get data into MarkLogic is via HTTP and the client REST API - specifically the /v1/documents endpoints - http://docs.marklogic.com/REST/client/management .
There are a variety of ways to optimize this, such as via a write set, but based on your question, I think the first thing to decide is - what kind of document do you want to write for each Record? Your example shows 5 columns in the CSV - typically, you'll write either a JSON or XML document with 5 fields/elements, each named based on the column index. So you'd need to write a little code to generate that JSON/XML, and then use whatever HTTP client you prefer (and one option is the MarkLogic Java Client API) to write that document to MarkLogic.
That addresses your question of how to write a JavaRDD to MarkLogic - but if your goal is to get data from a CSV into MarkLogic as fast as possible, then skip Spark and use mlcp - https://docs.marklogic.com/guide/mlcp/import#id_70366 - which involves zero coding.
Modified example from spark streaming guide, Here you will have to implement connection and writing logic specific to database.
public void send(JavaRDD<String> rdd) {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> partitionOfRecords) {
// ConnectionPool is a static, lazily initialized pool of
Connection connection = ConnectionPool.getConnection();
while (partitionOfRecords.hasNext()) {
connection.send(partitionOfRecords.next());
}
ConnectionPool.returnConnection(connection); // return to the pool
// for future reuse
}
});
}
I'm wondering if you just need to make sure everything you access inside your VoidFunction that was instantiated outside it is serializable (see this page). DatabaseClient and XMLDocumentManager are of course not serializable, as they're connected resources. You're right, however, to not instantiate DatabaseClient inside your VoidFunction as that would be less efficient (though it would work). I don't know if the following idea would work with spark. But I'm guessing you could create a class that keeps hold of a singleton DatabaseClient instance:
public static class MLClient {
private static DatabaseClient singleton;
private MLClient() {}
public static DatabaseClient get(DatabaseClientFactory.Bean connectionInfo) {
if ( connectionInfo == null ) {
throw new IllegalArgumentException("connectionInfo cannot be null");
}
if ( singleton == null ) {
singleton = connectionInfo.newClient();
}
return singleton;
}
}
then you just create a serializable DatabaseClientFactory.Bean outside your VoidFunction so your auth info is still centralized
DatabaseClientFactory.Bean connectionInfo =
new DatabaseClientFactory.Bean();
connectionInfo.setHost("localhost");
connectionInfo.setPort(8000);
connectionInfo.setUser("admin");
connectionInfo.setPassword("admin");
connectionInfo.setAuthenticationValue("digest");
Then inside your VoidFunction you could get that singleton DatabaseClient and new XMLDocumentManager like so:
DatabaseClient client = MLClient.get(connectionInfo);
XMLDocumentManager docMgr = client.newXMLDocumentManager();

Resources