Kafka Sink Connector for Cassandra failed - cassandra

I'm creating Kafka Sink Conector for Cassandra via Lenses. My configuration is:
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
connect.cassandra.key.space=space1
connect.cassandra.contact.points=cassandra1
tasks.max=1
topics=demo-1206-enriched-clicks-v0.1
connect.cassandra.port=9042
connect.cassandra.kcql=INSERT INTO space1.CLicks_Test SELECT ClicksId from demo-1206-enriched-clicks-v0.1
name=test_cassandra
but, I'm getting this error:
org.apache.kafka.common.config.ConfigException: Mandatory `topics` configuration contains topics not set in connect.cassandra.kcql: Set(demo-1206-enriched-clicks-v0.1)
at com.datamountaineer.streamreactor.connect.config.Helpers$.checkInputTopics(Helpers.scala:107)
at com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector.start(CassandraSinkConnector.scala:65)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:100)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:125)
at org.apache.kafka.connect.runtime.WorkerConnector.transitionTo(WorkerConnector.java:182)
at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:210)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector(DistributedHerder.java:872)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.processConnectorConfigUpdates(DistributedHerder.java:324)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:296)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:199)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Any ideas why?

Jelena, as discussed already the problem is with Kafka Connect framework not being able to parse .1 in topics=demo-1206-enriched-clicks-v0.1. A bug should be reported to Kafka GitHub.
#Alex what you see there is Not KSQL. The configuration SQL we use for all our connectors is KCQL(Kafka connect query language).
Furthermore we have our equivalent SQL for Apache Kafka.called LSQL: http://www.landoop.com/docs/lenses/

This was discussed at https://launchpass.com/datamountaineers and KCQL and the Connector was verified that properly supports the . character in the KCQL configuration. The root cause, was identified to be an issue upstream in Kafka Connect framework

Related

How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

I am trying to read data from a secured Kafka cluster using spark structured streaming.
Also I am using the below library to read the data - "spark-sql-kafka-0-10_2.12":"3.0.0-preview" since it has the feature to specify our custom group id (instead of spark setting its own custom group id)
Dependency used in code:
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview</version>
I am getting the below error - even after specifying the required JAAS configuration in spark options.
Caused by: java.lang.IllegalArgumentException: requirement failed: Delegation token must exist for this connector.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
Following document specifies that we can disable the feature of obtaining delegation token - https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
I tried setting this property spark.security.credentials.kafka.enabled to false in spark config, but it is still failing with the same error.
Apparently there seems to be a bug on the preview release and has been fixed on the GA Spark 3.x release.
Reference :
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-30495
Now, we can specify our custom consumer group name while fetching the data from Kafka (Even though it's not recommended and we will see a warning message while specifying it).

Apache Spark on k8s: securing RPC communication between driver and executors is not working

I have been trying Spark 2.4 deployment on k8s and want to establish a secured RPC communication channel between driver and executors. Was using the following configuration parameters as part of spark-submit
spark.authenticate true
spark.authenticate.secret good
spark.network.crypto.enabled true
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1
spark.network.crypto.saslFallback false
The driver and executors were not able to communicate on a secured channel and were throwing the following errors.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
Can someone guide me on this?
Disclaimer: I do not have a very deep understanding of spark implementation, so, be careful when using the workaround described below.
AFAIK, spark does not have support for auth/encryption for k8s in 2.4.0 version.
There is a ticket, which is already fixed and likely will be released in a next spark version: https://issues.apache.org/jira/browse/SPARK-26239
The problem is that spark executors try to open connection to a driver, and a configuration will be sent only using this connection. Although, an executor creates the connection with default config AND system properties started with "spark.".
For reference, here is the place where executor opens the connection: https://github.com/apache/spark/blob/5fa4384/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L201
Theoretically, if you would set spark.executor.extraJavaOptions=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true ..., it should help, although driver checks that there are no spark parameters set in extraJavaOptions.
Although, there is a workaround (a little bit hacky): you can set spark.executorEnv.JAVA_TOOL_OPTIONS=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true .... Spark does not check this parameter, but JVM uses this env variable to add this parameter to properties.
Also, instead of using JAVA_TOOL_OPTIONS to pass secret, I would recommend to use spark.executorEnv._SPARK_AUTH_SECRET=<secret>.

KairosDB failed to discover other Cassandra nodes in ring with Hector client

I have a multi-node Casssandra cluster (2.2.6) and a separate KairosDB server (1.1.1-1). In KairosDB, I configured it with two Cassandra seed nodes and have it auto discover other Cassandra nodes in the ring.
After tuning KairosDB log level to DEBUG, I see that only those two seed nodes are in host pool (and working well). Hector discovery process failed with an NPE. At the end only these two seed nodes are used by KairosDB.
There might be a few solutions:
Add all nodes to kairos properties, but it's harder to maintain.
Custom build a new KairosDB binary to have later version of Hector 2.0.0, but I prefer to go with official releases if possible.
Do you know a way to get around this? Thanks.
08-04|18:54:57.755 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:50] - Node discovery running...
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:74] - using existing hosts [cassandra-seed1(172.16.109.43):9160, cassandra-seed2(172.16.108.51):9160]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] ERROR [NodeDiscovery.java:105] - Discovery Service failed attempt to connect CassandraHost
java.lang.NullPointerException: null
at me.prettyprint.cassandra.connection.NodeDiscovery.discoverNodes(NodeDiscovery.java:79) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeDiscovery.doAddNodes(NodeDiscovery.java:52) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService.doAddNodes(NodeAutoDiscoverService.java:45) [hector-core-1.1-4.jar:na]
at me.prettyprint.cassandra.connection.NodeAutoDiscoverService$QueryRing.run(NodeAutoDiscoverService.java:51) [hector-core-1.1-4.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101]
08-04|18:54:57.756 [Hector.me.prettyprint.cassandra.connection.NodeAutoDiscoverService-1] DEBUG [NodeDiscovery.java:62] - Node discovery run complete.

Connecting to Cassandra 3.0.3 from WSO2 and JDBC

I am trying to connect to Cassandra DB from Wso2 ESB & wso2 DSS, in both the approaches getting the same error.
1) Approach 1: Connecting to Cassandra from WSO2 ESB
We are trying to connect Cassandra DB 3.0.3 from Wso2 ESB 4.9
Below are the jars copied to components/lib folder.
(jar listing)
In master-datasource.xml below is the configuration added.
<datasource>
<name>CassandraDB</name>
<description>The datasource used for cassandra</description>
<jndiConfig>
<name>cassandraWSO2DB</name>
</jndiConfig>
<definition type="RDBMS">
<configuration>
<url>jdbc:cassandra://127.0.0.1:9042/sample</url>(tried with port 9160)
<username>cassandra</username>
<password>cassandra</password>
<driverClassName>org.apache.cassandra.cql.jdbc.CassandraDriver</driverClassName>
<maxActive>50</maxActive>
<maxWait>60000</maxWait>
<testOnBorrow>true</testOnBorrow>
<validationQuery>SELECT COUNT(*) from sample.users</validationQuery>
<validationInterval>30000</validationInterval>
<defaultAutoCommit>true</defaultAutoCommit>
</configuration>
</definition>
</datasource>
2) Approach 2: Connecting to Cassandra from WSO2 DSS
We are trying to connect Cassandra DB from Wso2 DSS 3.5.0
Below are the jars copied to components/lib folder.
Created the Data service and added the data source below is the configuration for the same:
<config enableOData="false" id="CassandraSampleId">
<property name="url">jdbc:cassandra://127.0.0.1:9042/sample</property>
<property name="driverClassName">org.apache.cassandra.cql.jdbc.CassandraDriver</property>
<query id="SampleQuery" useConfig="CassandraSampleId">
<expression>select * from users</expression>
In the above configuration “sample” is the keyspace created in Cassandra.
In the both the cases i.e. 1 & [2] facing the same below error. Can you please suggest to resolve the issue.
java.sql.SQLNonTransientConnectionException: org.apache.thrift.transport.TTransportException: Read a negative frame size (-2080374784)!
at org.wso2.carbon.dataservices.core.description.config.RDBMSConfig.<init>(RDBMSConfig.java:45)
at org.wso2.carbon.dataservices.core.description.config.ConfigFactory.getRDBMSConfig(ConfigFactory.java:92)
at org.wso2.carbon.dataservices.core.description.config.ConfigFactory.createConfig(ConfigFactory.java:60)
at org.wso2.carbon.dataservices.core.DataServiceFactory.createDataService(DataServiceFactory.java:150)
at org.wso2.carbon.dataservices.core.DBDeployer.createDBService(DBDeployer.java:785)
at org.wso2.carbon.dataservices.core.DBDeployer.processService(DBDeployer.java:1139)
at org.wso2.carbon.dataservices.core.DBDeployer.deploy(DBDeployer.java:195)
... 8 more
Caused by: java.sql.SQLNonTransientConnectionException: org.apache.thrift.transport.TTransportException: Read a negative frame size (-2080374784)!
at org.apache.cassandra.cql.jdbc.CassandraConnection.<init>(CassandraConnection.java:159)
at org.apache.cassandra.cql.jdbc.CassandraDriver.connect(CassandraDriver.java:92)
at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:278)
at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:182)
at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:701)
at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:635)
at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:188)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:127)
at org.wso2.carbon.dataservices.core.description.config.SQLConfig.createConnection(SQLConfig.java:187)
at org.wso2.carbon.dataservices.core.description.config.SQLConfig.createConnection(SQLConfig.java:173)
at org.wso2.carbon.dataservices.core.description.config.SQLConfig.initSQLDataSource(SQLConfig.java:151)
at org.wso2.carbon.dataservices.core.description.config.RDBMSConfig.<init>(RDBMSConfig.java:43)
... 14 more
Caused by: org.apache.thrift.transport.TTransportException: Read a negative frame size (-2080374784)!
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:133)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_describe_cluster_name(Cassandra.java:1101)
at org.apache.cassandra.thrift.Cassandra$Client.describe_cluster_name(Cassandra.java:1089)
at org.apache.cassandra.cql.jdbc.CassandraConnection.<init>(CassandraConnection.java:130)
<query id="SampleQuery" useConfig="CassandraSampleId">
<expression>select * from users</expression>
<validationQuery>SELECT COUNT(*) from sample.users</validationQuery>
First thing I noticed...unbound queries are not a good idea to do in Cassandra. You should always query with a partition key. This won't fix your problem, but you definitely do not want to do that in production with millions of rows.
We are trying to connect Cassandra DB 3.0.3 from Wso2 ESB 4.9
<definition type="RDBMS">
<configuration>
<url>jdbc:cassandra://127.0.0.1:9042/sample</url>(tried with port 9160)
org.apache.thrift.transport.TTransportException:
Ok, a couple of things I noticed here. I've never used WSO2 (actually, I have no idea what it is) but it looks like it's making a couple of (bad) assumptions here.
I don't know what your options for "type" are, but Cassandra is definitely not a RDBMS.
I see you're using JDBC. There are MANY drivers out there that interact with Cassandra from Java. JDBC was originally designed for relational databases, and augmented to be used with Cassandra. You will have the highest chance for success by using the DataStax Java Driver, especially if you are running Cassandra 3.x.
I see it throwing a thrift.transport.TTransportException. Cassandra 3.0.3 installs with the thrift (9160) protocol disabled by default. Changing the port to 9042 isn't enough...you need to be using the native binary protocol for that to work.
The first thing to try, is to re-enable Thrift inside your Cassandra node. Inside your cassandra.yaml, find the start_rpc property, and set it to true:
start_rpc: true
At the very least, that will get Thrift running on 9160, and you'll be ready to try connecting again. However, I don't know if JDBC will even work with Cassandra 3.x. And if it does, you certainly won't get all of the available features.
The biggest problem I see here, is that you are trying to use a new version of Cassandra with technologies that it really wasn't designed to interact with. But try starting Thrift in Cassandra, and see if that helps. If anything, it should get you to the next problem.

Couldn't find leaders for Set([TOPICNNAME,0])) When we are uisng in Apache Spark

We are using the Apache Spark 1.5.1 and kafka_2.10-0.8.2.1 and Kafka DirectStream API to fetch data from Kafka using Spark.
We created the topics in Kafka with the following settings
ReplicationFactor :1 and Replica : 1
When all of the Kafka instances are running, the Spark job works fine. When one of the Kafka instances in the cluster is down, however, we get the exception reproduced below. After some time, we restarted the disabled Kafka instance and tried to finish the Spark job, but Spark was had already terminated because of the exception. Because of this, we could not read the remaining messages in the Kafka topics.
ERROR DirectKafkaInputDStream:125 - ArrayBuffer(org.apache.spark.SparkException: Couldn't find leaders for Set([normalized-tenant4,0]))
ERROR JobScheduler:96 - Error generating jobs for time 1447929990000 ms
org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: Couldn't find leaders for Set([normalized-tenant4,0]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:245)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:245)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Thanks in advance. Please help to resolve this issue.
This is expected behaviour. You have requested that each topic be stored on one machine by setting ReplicationFactor to one. When the one machine that happens to store the topic normalized-tenant4 is taken down, the consumer cannot find the leader of the topic.
See http://kafka.apache.org/documentation.html#intro_guarantees.
One of the reason for this type of error where leader cannot be found for specified topic is Problem with one's Kafka server configs.
Open your Kafka server configs :
vim ./kafka/kafka-<your-version>/config/server.properties
In the "Socket Server Settings" section , provide IP for your host if its missing :
listeners=PLAINTEXT://{host-ip}:{host-port}
I was using Kafka setup provided with MapR sandbox and was trying to access the kafka via spark code. I was getting the same error while accessing my kafka since my configuration was missing the IP.

Resources