Debezium MongoDB source JSON sink to Cassandra (LENSES.IO) - cassandra

I need to sink a JSON object from MongoDB to a column in Cassandra. I'm using ExtractNewDocumentState, AvroConverter. But it seems I'm wrong. And AvroConverter used in source or sink? If I used it in the source then I used it in the sink too?
{
"name": "mongodb_source_connector",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "1",
"mongodb.hosts": "rs0/mongo:27017",
"mongodb.name": "dbserver1",
"mongodb.user": "scorpion",
"mongodb.password": "123123123",
"database.whitelist": "ladiform",
"database.history.kafka.bootstrap.servers": "kafka_3:9093",
"transforms": "route,unwrap",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.unwrap.type": "io.debezium.connector.mongodb.transforms.ExtractNewDocumentState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "drop",
"transforms.unwrap.operation.header": "true",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
}
{
"name": "cassandra_sink_connector",
"config": {
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"tasks.max": "1",
"topics": "test34",
"connect.cassandra.port": "9042",
"connect.cassandra.key.space": "test",
"connect.cassandra.contact.points": "cassandra",
"connect.cassandra.username": "cassandra",
"connect.cassandra.password": "123123123",
"connect.cassandra.kcql": "INSERT INTO test33 SELECT id, data FROM test34"
}

the converters must be the same on both source and sink side as serialization is done in the source side and deserialization on the sink side.

Related

Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine

I am trying to query delta tables from my AWS Glue Catalog on Databricks SQL Engine. They are stored in Delta Lake format. I have glue crawlers automating schemas. The catalog is setup & functioning with non Delta Tables. The setup via databricks loads the available tables per database via the catalog & but the query fails due to databricks using hive instead of delta to read.
Incompatible format detected.
A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
SQL Warehouse settings => Data Access Configuration
spark.databricks.hive.metastore.glueCatalog.enabled : true
The crawler using DELTA LAKE setup from AWS produces the following table metadata
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "string",
"comment": ""
},
{
"name": "media",
"type": "string",
"comment": ""
},
{
"name": "media_type",
"type": "string",
"comment": ""
},
{
"name": "title",
"type": "string",
"comment": ""
},
{
"name": "type",
"type": "smallint",
"comment": ""
},
{
"name": "clubmessage_id",
"type": "string",
"comment": ""
}
]
},
"location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
"inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"parameters": {}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"UPDATED_BY_CRAWLER": "CRAWLER_NAME",
"CrawlerSchemaSerializerVersion": "1.0",
"CrawlerSchemaDeserializerVersion": "1.0",
"classification": "parquet"
}
}
I am facing the same problem. It seems like you can not use Spark SQL to query a delta table in Glue, because setting
spark.databricks.hive.metastore.glueCatalog.enabled : true
implies that the table will be a hive table.
You will need to access the table in S3 directly, losing the advantages of the meta data catalog.
You can read from it though, by blocking your cluster from accessing the _delta_log folder with the following IAM policy:
{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } }
I was able to query a delta table created by glue crawlers after updating the location. In your case it would need to be changed from:
s3://COMPANY/club/attachment/_symlink_format_manifest
to
s3://COMPANY/club/attachment
This is because delta on spark doesn't look at _symlink_format_manifest like hive and presto. It just needs to know the root directory.
The command in databricks to update the location is something like this:
ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"
Note: your database location has to be set as well in order for that command to work

Spark streaming Dynamic Schema Evolution from Kafka Eventhub on Microbatch

We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data.
Note: The data read from Kafka topic will be a whole JSON string. Hence,
When we apply schema and convert to a dataframe, we lose the fields' values with mismatch datatype or newly added fields.
When we do spark.read.json, the entire field values are converted to String.
We encounter a situation where the Source data has some schema changes. Some of the scenarios we faced are :
The datatype changes at the parent level
The datatype changes at the nested level
There are duplicate keys in a different case
There are the addition of new fields
A sample Source data with the Actual schema
{
"Id": "101",
"Name": "John",
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": "10001"
},
"Experience": 2,
"Organization": [
{
"Id": "Org101",
"Name": "Google"
},
{
"Id": "Org102",
"Name": "Microsoft"
}
]
}
A sample Source data addressing the 4 points mentioned above
{
"Id": "102",
"name": "Rambo", --- Point 3
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": 10001 ---- Point 2
},
"Experience": "2", --- Point 1
"Organization": [
{
"Id": "Org101",
"Name": "Google",
"Experience": "2", --- Point 4
},
{
"Id": "Org102",
"Name": "Microsoft",
"Experience": "2",
}
]
}
We need a solution to overcome the above issues. Though it's difficult to embed the new schema to the existing delta table, at least we should be able to separate the records with schema changes without losing the original data.

How to store Dataframe value as rowkey as well as column using Hortonworks-spark shc?

I am using shc-core to write spark Dataset to hbase, for more details see here.
This is my current shc catalog:
def catalog = s"""{
|"table":{"namespace":"default", "name":"table1"},
|"rowkey":"key",
|"columns":{
|"col0":{"cf":"rowkey", "col":"key", "type":"string"},
|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
|"col2":{"cf":"cf2", "col":"col2", "type":"double"},
|"col3":{"cf":"cf3", "col":"col3", "type":"float"},
|"col4":{"cf":"cf4", "col":"col4", "type":"int"},
|"col5":{"cf":"cf5", "col":"col5", "type":"bigint"},
|"col6":{"cf":"cf6", "col":"col6", "type":"smallint"},
|"col7":{"cf":"cf7", "col":"col7", "type":"string"},
|"col8":{"cf":"cf8", "col":"col8", "type":"tinyint"}
|}
|}""".stripMargin
Because the sof rule code cannot be too long,I can only give you part of it:
This is my HBase catalog :
{
"columns": {
"RXSJ": {
"col": "RXSJ",
"cf": "info",
"type": "bigint"
},
"LATITUDE": {
"col": "LATITUDE",
"cf": "info",
"type": "float"
},
"ZJHM": {
"col": "ZJHM",
"cf": "rowkey",
"type": "string"
},
"AGE": {
"col": "AGE",
"cf": "info",
"type": "int"
}
},
"rowkey": "ZJHM",
"table": {
"namespace": "default",
"name": "mongo_hbase_spark_out"
}
}
The other fields output normally, but the rowkey column is not output.
How can I output the rowkey additionaly as a column?
You will not get the rowkey visible in the same way as the other columns. In the description of the HBase Catalog it is mentioned:
Note that the rowkey also has to be defined in details as a column (col0), which has a specific cf (rowkey).
Therefore, it will not show up although you have specified it in the columns section of your catalog.
The rowkey is only visible as actual rowkey as your screenshot also shows.
After testing, I solved the problem.
The whole idea is to output the same column twice
This is my new generated SHC catalog:
{
"columns": {
"rowkey_ZJHM": {
"col": "ZJHM",
"cf": "rowkey",
"type": "string"
},
"ZJHM": {
"col": "ZJHM",
"cf": "info",
"type": "string"
},
"AGE": {
"col": "AGE",
"cf": "info",
"type": "int"
}
},
"rowkey": "ZJHM",
"table": {
"namespace": "default",
"name": "mongo_hbase_spark_out"
}
}
I think rowkey column is Hortonworks-spark shc special column,it always output first column. Only think other ways to output to other cf.
Let me know if you have any better Suggestions
Thanks!

kafka-connect : Error in sink cassandra connector

I get a runtime Error for a sink cassandra connector, I try to pick data form kafka and store it in cassandra, You find the error stack below :
{
"name": "cassandraSinkConnector2",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "localhost:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Key must be a struct or map. This connector requires that records from Kafka contain the keys for the Cassandra table. Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.\n\tat io.confluent.connect.cassandra.CassandraSinkTask.put(CassandraSinkTask.java:94)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)\n\t... 10 more\n"
}
],
"type": "sink"
}
I used the distributed configuration below for my connector:
{
"name": "cassandraSinkConnector2",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "1",
"topics": "appartenance_de",
"cassandra.contact.points": "localhost",
"cassandra.kcql": "INSERT INTO app_test SELECT * FROM app_de",
"cassandra.port": "9042",
"cassandra.keyspace": "dev_dkks",
"cassandra.username": "superuser",
"cassandra.password": "superuser",
"cassandra.write.mode": "upsert",
"value.converter.schemas.enable": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"name": "cassandraSinkConnector2"
},
"tasks": [
{
"connector": "cassandraSinkConnector2",
"task": 0
}
],
"type": "sink"
Per my answer here, the error you're seeing is
org.apache.kafka.connect.errors.DataException:
Record with a null key was encountered. This connector requires that records from Kafka contain the keys for the Cassandra table.
Please use a transformation like org.apache.kafka.connect.transforms.ValueToKey to create a key with the proper fields.
I'd suggest using a Single Message Transform as suggested in the error to correctly key your data. You can see an example of doing this here and the documentation for the transform here.

Auto-Table generation in Cassandra with kafka connect cassandra sink

I am using confluent.connect.cassandra.CassandraSinkConnector, for kafka connect cassandra sink.
I wanted to know if it was possible to auto-generate cassandra tables from the kafka topic using io.confluent.connect.cassandra.CassandraSinkConnector as connector.
If it is possible, can you please suggest what configuration to set to enable this feature. I have tried all the configurations mentioned in the documentation, but I was not successful in creating a table.
This is the configuration i am using:
{
"name": "cassandra-test4",
"config": {
"connector.class": "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max": "3",
"topics": "orders-topic2",
"cassandra.contact.points": "my_ip",
"cassandra.keyspace": "test_cas",
"cassandra.write.mode": "Insert",
"cassandra.table.manage.enabled": "true",
"cassandra.sink.route": "test_cas.orders",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"flush.size": "1",
"cassandra.keyspace.create.enabled": "true",
"name": "cassandra-test4"
},
"tasks": [
{
"connector": "cassandra-test4",
"task": 0
},
{
"connector": "cassandra-test4",
"task": 1
},
{
"connector": "cassandra-test4",
"task": 2
}
],
"type": null
}
This is should be done by setting cassandra.keyspace.create.enabled & cassandra.table.manage.enabled properties to true. See documentation.
But be really careful - it's very easy to get schema disagreement in your cluster, and then you need to do additional steps to recover from it. It's better pre-create tables before starting connector...

Resources