I am looking for a debezium mysql connector to stream CDC records to kafka with key as string (not avro for key) and value as avro record. By default it is making key as avro record. Any suggestions ?
you can try to set key.converter to org.apache.kafka.connect.storage.StringConverter and value.converter keep set to the Avro one.
Or you can use the JSON converter as it also serializes to text.
J.
Related
Is it possible to extract schema of kafka input data in spark streaming??
Even if I was able to extract schema from rdd, streaming works fine when there is data in kafka topics, but fails to work when there is an empty rdd.
Data in Kafka is stored as JSON.
JSON is another format for data that is written to Kafka. You can use the built-in from_json function along with the expected schema to convert a binary value into a Spark SQL struct.
I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.
Does Spark Structured Streaming's Kafka Writer supports writing data to particular partition? In Spark Structured Streaming Documentation, it is no where mentioned that writing data to specific partition is not supported.
Also I can't see an option to pass "partition id" in section
"Writing Data to Kafka"
If it is not supported, any future plans to support or reasons why this is not supported.
The keys determine which partition to write to - no, you can't hard-code a partition value within Spark's write methods.
Spark does allow you to configure kafka.partitioner.class, though, which would allow you to define the partition number based on the keys of the data
Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option("kafka.bootstrap.servers", "host:port"). For possible kafka parameters, see ... Kafka producer config docs for parameters related to writing data.
I have managed to setup the Open Source Confluent Platform to work with Cassandra using the Cassandra Sink and it worked to send some simple data from Kafka-Rest to Cassandra. However, I would like to send data that contains a timestamp. It did not work from Kafka-Rest to have a schema with timestamp, neither did it with a string field instead. Is it possible to send timestamp data like that and if yes what should be modified? The KCQL or the Avro message?
It would be preferable to send only data that is not a timestamp and then Kafka or Cassandra would insert the current timestamp for the timestamp field.
I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.