I want to setup the kafka connect from kafka topic to cassandra
The problem is simple: saying I have a demo topic in kafka with json data like
{"id":"1", "name":"Alex", "clicks":2}
I would like automatically to push it into the cassanra table with columns id, name, clicks.
I'm looking into kafka-connect-cassandra, but the only example I can find is to reading from cassandra and writing to another cassandra table via kafka in the middle.
My question is how can I make it read from kafka and not cassandra?
I'm looking for some connector open source with example for doing that.
The example you are referring to is showcasing both source and sink features of connector together. If your use-case is to push data from Kafka topic to a Cassandra table then all you need is a sink. Follow these steps
Create your own sink properties file. Use this as an example. Save it as my-sink.properties
Go to the home directory of installation and execute the command CLASSPATH=<<path-to-connector-jar>> ./bin/connect-standalone connect-standalone.properties my-sink.properties
If you are interested in example of more detailed steps, see here:https://github.com/yaravind/kafka-connect-jenkins#standalone-mode (Full disclosure: I maintain that connector for Jenkins.)
I had the same issue, and I followed what's on https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/. I'm using the DataMountaineer driver (http://docs.datamountaineer.com/en/latest/cassandra-sink.html), and setting it up in distributed mode.
Once you have it set, your configuration Json for the cassandra connector (uploaded via REST API for confluent-connect) should look something like:
{
"name": "cassandra.sink.yourConfigName",
"config": {
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"tasks.max": "1",
"topics": "<your topic>",
"connect.cassandra.sink.kcql": "INSERT INTO <your_table> SELECT * FROM <your_kafka_topic>;",
"connect.cassandra.contact.points": "<cassandra nodes>",
"connect.cassandra.port": "<cassandra port>",
"connect.cassandra.key.space": "<cassandra keyspace>",
"connect.cassandra.username": "cassandra",
"connect.cassandra.password": "cassandra"
}}
Related
I'm setting up my logserver. I'm forwarding logs using Fluentd to Kafka and then storing them in Cassandra for later use. For this I'm using kafka-cassandra sink connector. I have to store data chronologically for which I need to add timestamp to my messages in cassandra. How can this be done?
Datamountaineer connector uses kcql which i think doesn't support inserting timestamp to a log.
My connector configuration is as follows:
name=cassandra-sink
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=test_AF1
connect.cassandra.kcql=INSERT INTO test_event1 SELECT now() as id, message as msg FROM test_AF1 TIMESTAMP=sys_time()
connect.cassandra.port=9042
connect.cassandra.contact.points=localhost
connect.cassandra.key.space=demo
Kafka Connect's Single Message Transform can do this. Here's an example:
{
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector",
"topics": "test_AF1",
…
"transforms": "addTS",
"transforms.addTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.addTS.timestamp.field": "op_ts"
}'
This adds a field to the message payload called op_ts with the timestamp of the Kafka message.
I don't know how this interacts with KCQL; you might want to check out the other two Cassandra sinks that I'm aware of:
https://www.confluent.io/hub/confluentinc/kafka-connect-cassandra
https://www.confluent.io/hub/datastax/kafka-connect-dse
After trying some methods for monitoring structured streaming performance, input/output metrics, I see that a reliable way is to attach streamingQueryListener to output the streamingQueryProgress to get the input/output number.
Besides the SparkUI,
Is there any better way to monitor structured streaming performance?
What's the best way to output the queryProgress into a file or Kafka?
What's the efficient way to compare performance (speed, input, output record) between the spark streaming and spark structured streaming?
One of the ways is to use ELK stack.
Spark application can sink the jmx to Logstash, which is able to aggregate the data, and send the data to ElasticSearch for indexing.
Kibana is able to display the data from ElasticSearch with the visualization capabilities.
1) You need to either include spark-sql-kafka dependency in build.sbt,
(at this moment, I am using Spark 2.2.0 with spark-sql-kafka 0.10)
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.2.0"
or includes --packages when doing spark-submit,
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
2) In order for spark application to output the jmx, all the lines related to jmx need to be uncommented in the file metrics.properties,
And, during the spark-submit, pointing the file directory like below,
--files=metrics.properties --conf spark.metrics.conf=metrics.properties
3) Install ElasticSearch, Logstash, Kibana.
If you are in Window, the way to start the ELK stack like below,
C:\elasticsearch> .\bin\elasticsearch.bat
C:\logstash> bin\logstash.bat -f .\jmx.conf
C:\kibana> .\bin\kibana.bat
In the jmx.conf, jmx path and the polling frequency need to be configured.
input {
jmx {
path => "C:/logstash/config/jmx/"
polling_frequency => 15
type => "jmx"
nb_thread => 2
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
In the folder of jmx path, a json file need to be created to list the object_names and attributes that you want the Logstash to retrieve.
(Logstash will read this json file based on the polling_frequency, so any later update in this json file when the Spark applications are running, Logstash will pick up, meaning that no need to restart the Logstash)
You can list the available object_names and attributes from jconsole after you submit the Spark applications,
The sample file is as follows,
{
"host" : "localhost",
"port" : 9000,
"alias" : "spark.jmx.sample1",
"queries" : [
{
"object_name" : "kafka.consumer:type=consumer-metrics,client-id=*",
"attributes" : ["incoming-byte-rate","outgoing-byte-rate"],
"object_alias" : "byteRate"
},
{
"object_name" : "metrics:name=local-1528139073457.driver.spark.streaming.e6c4b9da-a7d1-479f-b28f-ba2b9b1397d0.inputRate-total",
"attrivutes" : ["Value"],
"object_alias" : "somethingTeste1"
}
]}
4) Finally, you will access the Kibana via http://localhost:5601
And, set up the index pattern first. (You should see the date of data index)
Then, go to visualization page to create the metrics with your object_names and attributes that you list from the jmx json file.
How to migrate data from SQL Server to Neo4j without using LOAD CSV?
Other tools do not give a complete transformation as per schema but just a sample (ex: ETL). If there is something that could be done with a node js application that would be great.
Also how to keep the data in sync between SQL Server and Neo4j?
You have a few options here:apoc.load.jdbc is one option, there is also neo4j-etl-components project.
apoc.load.jdbc
There is capability for connecting to a relational database and streaming the results of SQL queries to Neo4j from within Cypher available in the APOC procedure library.
For example:
// Create Product nodes
CALL apoc.load.jdbc("jdbc:mysql://localhost:3306/northwind?user=root","SELECT * FROM products") YIELD row
CREATE (p:Product {ProductID: row.ProductID})
SET p.ProductName = row.ProductName,
p.CategoryID = row.CategoryID,
p.SupplierID = row.SupplierID
More info is available in the docs.
neo4j-etl-components
neo4j-etl-components is a tool that will
Inspect an existing relational database schema
Apply rules to this schema to convert to a property graph model
Optionally, provides a GUI interface for editing how this translation should be applied
Stream export from the relational database to create a new Neo4j database (offline batch) or online incremental insert into Neo4j.
neo4j-etl-components can be used as a command line tool. For example, to run an initial batch import:
~: $NEO4J_HOME/bin/neo4j-etl export \
--rdbms:url jdbc:oracle:thin:#localhost:49161:XE \
--rdbms:user northwind --rdbms :password northwind \
--rdbms:schema northwind \
--using bulk:neo4j-import \
--import-tool $NEO4J_HOME/bin \
--csv-directory /tmp/northwind \
--options-file /tmp/northwind/options.json \
--quote '"' --force
See the docs here and Github page for the project here.
Keeping in sync
Once you've done the initial import of course you need to keep the data in sync. One option is to handle this at the application layer, write to a queue with workers that are responsible for writing to both databases, or run incremental versions of your import with apoc.load.jdbc or neo4j-etl-components.
Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.
I'm trying to figure out how to use the new DataFrameWriter to write data back to a JDBC database. I can't seem to find any documentation for this, although looking at the source code it seems like it should be possible.
A trivial example of what I'm trying looks like this:
sqlContext.read.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar")
).select("some_column", "another_column")
.write.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar2")
).save("foo.bar2")
This doesn't work — I end up with this error:
java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
I'm not sure if I'm doing something wrong (why is it resolving to DefaultSource instead of JDBCRDD for example?) or if writing to an existing MySQL database just isn't possible using Spark's DataFrames API.
Update
Current Spark version (2.0 or later) supports table creation on write.
The original answer
It is possible to write to an existing table but it looks like at this moment (Spark 1.5.0) creating table using JDBC data source is not supported yet*. You can check SPARK-7646 for reference.
If table already exists you can simply use DataFrameWriter.jdbc method:
val prop: java.util.Properties = ???
df.write.jdbc("jdbc:mysql://localhost/foo", "foo.bar2", prop)
* What is interesting PySpark seems to support table creation using jdbc method.