I have managed to setup the Open Source Confluent Platform to work with Cassandra using the Cassandra Sink and it worked to send some simple data from Kafka-Rest to Cassandra. However, I would like to send data that contains a timestamp. It did not work from Kafka-Rest to have a schema with timestamp, neither did it with a string field instead. Is it possible to send timestamp data like that and if yes what should be modified? The KCQL or the Avro message?
It would be preferable to send only data that is not a timestamp and then Kafka or Cassandra would insert the current timestamp for the timestamp field.
Related
The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.
I am using Kafka sink connector to write data from Kafka to s3. The output data is partitioned into hourly buckets - year=yyyy/month=MM/day=dd/hour=hh. This data is used by a batch job downstream. So, before starting the downstream job, I need to be sure that no additional data will arrive in a given partition once the processing for that partition has started.
What is the best way to design this? How can I mark a partition as complete? i.e. no additional data will be written to it once marked as complete.
EDIT: I am using RecordField as timestamp.extractor. My kafka messages are guaranteed to be sorted within partitions by the partition field
Depends on which Timestamp Extractor you are using in the Sink config.
You would have to guarantee the no records can have a timestamp earlier than the time you consume it.
AFAIK, the only way that's possible is using the WallClock Timestamp Extractor. Otherwise, you are consuming a Kafka Record timestamp, or some timestamp within each message. Both of which can be overwritten on the Producer end to some event in the past
I am using spark Structured streaming to send records to a kafka topic. The kafka topic is created with the config - message.timestamp.type=CreateTime
This is done so that the target Kafka topic records have the same timestamp as the original Records.
My kafka streaming code :
kafkaRecords.selectExpr("CAST(key AS STRING)", "CAST(value AS BINARY)","CAST(timestamp AS TIMESTAMP)")
.write
.format("kafka")
.option("kafka.bootstrap.servers","IP Of kafka")
.option("topic",targetTopic)
.option("kafka.max.in.flight.requests.per.connection", "1")
.option("checkpointLocation",checkPointLocation)
.save()
However, this does not preserve the original timestamp that is 2018/11/04, instead the timestamp reflects the latest date 2018/11/9.
On another note, just to confirm that kafka config is functioning, when I explicitly create a Kafka Producer and producer records having the timestamp and send that across, the original timestamp is preserved.
How can I get the same behaviour in Kafka Structured Streaming as well.
The CreateTime config of a topic would mean when the records are created, that is the time you get.
It's not clear where you're reading the data and seeing the timestamps, if you are running the producer code "today", that's the time they get, not before.
If you want timestamps of the past, you'll need to actually make your ProducerRecords contain that timestamp by using the constructor that includes a timestamp parameter, but Spark does not expose it.
If you put just the timestamp in the payload value, as you're doing, that's the time you'll want to be doing analysis on, probably, not a ConsumerRecord.timestamp()
If you want to exactly copy data from one topic to another, Kafka uses MirrorMaker to accomplish this. Then you only need config files, not writing&deploying Spark code
I am using spark-sql_2.11-2.3.1 version with Cassandra 3.x.
I need to provide a validation feature which has
column_family_name text,
oracle_count bigint,
cassandra_count bigint,
create_timestamp timestamp,
last_update_timestamp timestamp,
update_user text
For the same I need to count the successfully inserted record count i.e. cassandra_count to be populated , for that I want to make use of spark accumulator. But unfortunately I am not able to find required API samples with spark-sql_2.11-2.3.1 version.
Below is my saving to cassandra snippet
o_model_df.write.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> columnFamilyName, "keyspace" -> keyspace ))
.mode(SaveMode.Append)
.save()
Here how to implement accumulator increment for each row being successfully saved into Cassandra ...
Any help would be highly thankful.
Spark's accumulators are usually used in the transformations that you write, don't expect that spark cassandra connector will provide you something like.
But overall - if your job had finished without error, then it means that the data is written correctly into database.
If you want to check how many rows are really in the database, then you need to count data in the database - you can use cassandraCount method of the spark cassandra connector. The main reason for that - you may have in your DataFrame multiple rows that could be mapped into single Cassandra row (for example, if you incorrectly defined primary key, so multiple rows have it).
I am looking for a debezium mysql connector to stream CDC records to kafka with key as string (not avro for key) and value as avro record. By default it is making key as avro record. Any suggestions ?
you can try to set key.converter to org.apache.kafka.connect.storage.StringConverter and value.converter keep set to the Avro one.
Or you can use the JSON converter as it also serializes to text.
J.