Auto-Table generation in Cassandra with kafka connect cassandra sink - cassandra

I am using confluent.connect.cassandra.CassandraSinkConnector, for kafka connect cassandra sink.
I wanted to know if it was possible to auto-generate cassandra tables from the kafka topic using io.confluent.connect.cassandra.CassandraSinkConnector as connector.
If it is possible, can you please suggest what configuration to set to enable this feature. I have tried all the configurations mentioned in the documentation, but I was not successful in creating a table.
This is the configuration i am using:
{
"name": "cassandra-test4",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "3",
"topics": "orders-topic2",
"cassandra.contact.points": "my_ip",
"cassandra.keyspace": "test_cas",
"cassandra.write.mode": "Insert",
"cassandra.table.manage.enabled": "true",
"cassandra.sink.route": "test_cas.orders",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"flush.size": "1",
"cassandra.keyspace.create.enabled": "true",
"name": "cassandra-test4"
},
"tasks": [
{
"connector": "cassandra-test4",
"task": 0
},
{
"connector": "cassandra-test4",
"task": 1
},
{
"connector": "cassandra-test4",
"task": 2
}
],
"type": null
}

This is should be done by setting cassandra.keyspace.create.enabled & cassandra.table.manage.enabled properties to true. See documentation.
But be really careful - it's very easy to get schema disagreement in your cluster, and then you need to do additional steps to recover from it. It's better pre-create tables before starting connector...

Related

Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine

I am trying to query delta tables from my AWS Glue Catalog on Databricks SQL Engine. They are stored in Delta Lake format. I have glue crawlers automating schemas. The catalog is setup & functioning with non Delta Tables. The setup via databricks loads the available tables per database via the catalog & but the query fails due to databricks using hive instead of delta to read.
Incompatible format detected.
A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
SQL Warehouse settings => Data Access Configuration
spark.databricks.hive.metastore.glueCatalog.enabled : true
The crawler using DELTA LAKE setup from AWS produces the following table metadata
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "string",
"comment": ""
},
{
"name": "media",
"type": "string",
"comment": ""
},
{
"name": "media_type",
"type": "string",
"comment": ""
},
{
"name": "title",
"type": "string",
"comment": ""
},
{
"name": "type",
"type": "smallint",
"comment": ""
},
{
"name": "clubmessage_id",
"type": "string",
"comment": ""
}
]
},
"location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
"inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"parameters": {}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
}
}
I am facing the same problem. It seems like you can not use Spark SQL to query a delta table in Glue, because setting
spark.databricks.hive.metastore.glueCatalog.enabled : true
implies that the table will be a hive table.
You will need to access the table in S3 directly, losing the advantages of the meta data catalog.
You can read from it though, by blocking your cluster from accessing the _delta_log folder with the following IAM policy:
{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } }
I was able to query a delta table created by glue crawlers after updating the location. In your case it would need to be changed from:
s3://COMPANY/club/attachment/_symlink_format_manifest
to
s3://COMPANY/club/attachment
This is because delta on spark doesn't look at _symlink_format_manifest like hive and presto. It just needs to know the root directory.
The command in databricks to update the location is something like this:
ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"
Note: your database location has to be set as well in order for that command to work

Spark streaming Dynamic Schema Evolution from Kafka Eventhub on Microbatch

We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data.
Note: The data read from Kafka topic will be a whole JSON string. Hence,
When we apply schema and convert to a dataframe, we lose the fields' values with mismatch datatype or newly added fields.
When we do spark.read.json, the entire field values are converted to String.
We encounter a situation where the Source data has some schema changes. Some of the scenarios we faced are :
The datatype changes at the parent level
The datatype changes at the nested level
There are duplicate keys in a different case
There are the addition of new fields
A sample Source data with the Actual schema
{
"Id": "101",
"Name": "John",
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": "10001"
},
"Experience": 2,
"Organization": [
{
"Id": "Org101",
"Name": "Google"
},
{
"Id": "Org102",
"Name": "Microsoft"
}
]
}
A sample Source data addressing the 4 points mentioned above
{
"Id": "102",
"name": "Rambo", --- Point 3
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": 10001 ---- Point 2
},
"Experience": "2", --- Point 1
"Organization": [
{
"Id": "Org101",
"Name": "Google",
"Experience": "2", --- Point 4
},
{
"Id": "Org102",
"Name": "Microsoft",
"Experience": "2",
}
]
}
We need a solution to overcome the above issues. Though it's difficult to embed the new schema to the existing delta table, at least we should be able to separate the records with schema changes without losing the original data.

Debezium MongoDB source JSON sink to Cassandra (LENSES.IO)

I need to sink a JSON object from MongoDB to a column in Cassandra. I'm using ExtractNewDocumentState, AvroConverter. But it seems I'm wrong. And AvroConverter used in source or sink? If I used it in the source then I used it in the sink too?
{
"name": "mongodb_source_connector",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "1",
"mongodb.hosts": "rs0/mongo:27017",
"mongodb.name": "dbserver1",
"mongodb.user": "scorpion",
"mongodb.password": "123123123",
"database.whitelist": "ladiform",
"database.history.kafka.bootstrap.servers": "kafka_3:9093",
"transforms": "route,unwrap",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.unwrap.type": "io.debezium.connector.mongodb.transforms.ExtractNewDocumentState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "drop",
"transforms.unwrap.operation.header": "true",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
}
{
"name": "cassandra_sink_connector",
"config": {
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"tasks.max": "1",
"topics": "test34",
"connect.cassandra.port": "9042",
"connect.cassandra.key.space": "test",
"connect.cassandra.contact.points": "cassandra",
"connect.cassandra.username": "cassandra",
"connect.cassandra.password": "123123123",
"connect.cassandra.kcql": "INSERT INTO test33 SELECT id, data FROM test34"
}
the converters must be the same on both source and sink side as serialization is done in the source side and deserialization on the sink side.

Azure Digital Twins: what does "GetOntologies" response means?

I am trying to understand the provisioning process of Digital Twins and I am reading this doc: https://learn.microsoft.com/en-us/azure/digital-twins/tutorial-facilities-setup
But I can not follow the point in this section
however, I can not understand the response for "dotnet run GetOntologies"
Anyone can help me to understand better what are those values? and how are they related to "models are available"?
In Azure Digital Twins, the Ontology entity contains a set of all types and subtypes that can be used in your application. In your example the "Required" and "Default" ontology are enabled (this is by default). If you use the REST API to see what the "Default" ontology contains you get the following:
{
"id": 2,
"name": "Default",
"loaded": true,
"types": [
{
"id": 17,
"category": "SensorDataType",
"name": "Humidity",
"disabled": false,
"logicalOrder": 0
},
{
"id": 18,
"category": "SensorDataType",
"name": "Temperature",
"disabled": false,
"logicalOrder": 0
},
{
"id": 19,
"category": "SensorDataSubtype",
"name": "RoomHumidity",
"disabled": false,
"logicalOrder": 0,
"friendlyName": "Room Humidity"
}, // etc etc
As you can see in the example above, the ontology has basic definitions for the types of sensors/spaces/data types for things related to Smart Building scenarios. The BACnet and Advanced ontologies just add different and more specific types. When you set an ontology to 'enabled', you can start using those types/subtypes. You can check them out in the REST API with:
https://your-url.your-region.azuresmartspaces.net/management/api/v1.0/ontologies/3?includes=Types

kafka-connect : Error in sink cassandra connector

I get a runtime Error for a sink cassandra connector, I try to pick data form kafka and store it in cassandra, You find the error stack below :
{
"name": "cassandraSinkConnector2",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "localhost:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Key must be a struct or map. This connector requires that records from Kafka contain the keys for the Cassandra table. Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.\n\tat io.confluent.connect.cassandra.CassandraSinkTask.put(CassandraSinkTask.java:94)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)\n\t... 10 more\n"
}
],
"type": "sink"
}
I used the distributed configuration below for my connector:
{
"name": "cassandraSinkConnector2",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "1",
"topics": "appartenance_de",
"cassandra.contact.points": "localhost",
"cassandra.kcql": "INSERT INTO app_test SELECT * FROM app_de",
"cassandra.port": "9042",
"cassandra.keyspace": "dev_dkks",
"cassandra.username": "superuser",
"cassandra.password": "superuser",
"cassandra.write.mode": "upsert",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"name": "cassandraSinkConnector2"
},
"tasks": [
{
"connector": "cassandraSinkConnector2",
"task": 0
}
],
"type": "sink"
Per my answer here, the error you're seeing is
org.apache.kafka.connect.errors.DataException:
Record with a null key was encountered. This connector requires that records from Kafka contain the keys for the Cassandra table.
Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.
I'd suggest using a Single Message Transform as suggested in the error to correctly key your data. You can see an example of doing this here and the documentation for the transform here.

Resources