Our team want to achieve a solution for real time series data coming from the sensor data. We need to stream that data. What are the viable options for streaming. we also need to do some transformation before the data is stored with raw data.
I have came through Apache Kafka as a solution and Kafka streams can help us to transform data as well.
Please let me know other viable options which can be integrated with Microsoft Azure as our Machine learning models are being built on that
Related
I have data in two kafka topics from mysql using debezium-connector-mysql-plugin.
now i want to aggregate this data at daily level and store in to another mysql table.
please suggest.
Thanks.
You've not really laid out your requirements, other than commenting that you don't want to use Confluent Platform (but not said why).
In general, with data in Kafka (regardless of where it comes from) you have different options for processing it:
Bespoke consumer (probably a bad idea, given the availability of stream processing frameworks)
KSQL (use SQL to do your joins etc) - part of Confluent Platform
Kafka Streams - a Java library for doing stream processing. Part of Apache Kafka.
Flink, Spark Streaming, Samza, Heron, etc etc etc
It's up to you which you use, and it's going to come down to factors such as
Existing technology in use (no point deploying a Spark cluster if you don't need to; conversely, if you already use Spark and have lots of developers trained on it then it could make sense to use it)
Language familiarity of developers - does it have to be a Java API, or is SQL more accessible
Capabilities of the framework/tool - do you need tight security integration, exactly-once processing, CEP, etc etc. Some of these will rule in or out the tool that you use.
Once you've joined and aggregated your data, a good pattern to follow is to write it back to Kafka (thus more loosely decoupling your design, and enabling separation of responsibilities of the components) and from there write it to MySQL using Kafka Connect and the JDBC Sink. Kafka Connect is part of Apache Kafka.
One final consideration : if you're taking data from MySQL, to process it and then write it back into MySQL… do you even need Kafka? Is there an appropriate reason to be using it and not just doing this processing in mySQL itself?
Disclaimer: I work for Confluent.
My requirement is
I have log files that I need to process, also I would like to enrich the log information with some data which I have in postgres db.
Step 1. I plan to feed data from above two sources (log file and database) to kafka topics, using logstash
Step 2. I plan to use kafka stream to join data on different kafka topics and push them to elastic search via API calls.
My doubt is about step 2,
Is kafka stream is the way to go ? or can I use Apache spark which I believe can be used for same.
Any help on this is appreciated.
Step 1. I plan to feed data from above two sources (log file and database) to kafka topics, using logstash
If you're already using Apache Kafka, then note that you can use Kafka Connect for integrating systems, including databases, into Kafka. For information on integrating databases, see this article.
Step 2. I plan to use kafka stream to join data on different kafka topics and push them to elastic search via API calls.
My doubt is about step 2, Is kafka stream is the way to go ? or can I use Apache spark which I believe can be used for same. Any help on this is appreciated.
Yes, Kafka Streams is a good fit for this. It can enrich events as they flow through a topic, using data from other topics. These topics can be sourced from any system, including log files, databases, etc. Here is example code of such join, and the documentation for it.
BTW you might want to also check out KSQL. KSQL is built on Kafka Streams so you get the same scalability and elasticity functionality, but with a SQL abstraction that you can run directly (no coding needed). For an example of using KSQL to enrich streams of data see this talk or this article
(Disclosure: I work for Confluent, who lead the open-source KSQL project)
For my project, I need to harvest data from Twitter.
I am currently facing two design choices:
What is the best software architecture? I read that spark has Twitter support but I am not familiar with Scala. On the other hand, Apache Spark seems a good option, but then I'm not sure on how to save data to a common sink
I have some budget constraints. I surely need one server to do the sink and the processing. However, for the data harvesting, I don't know if several VM/containers offer a better performance / cost ratio than a bunch of Raspberry PI running Kafka producers.
Take a look at Confluent platform and especially Kafka Connect [1].
There is a Twitter connector out of the box. All the twitter data will be streamed to Kafka.
[1] https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
Agree with #leshkin that Kafka Connect is the most natural fit. However, the Twitter connector (available on github here) does not require Confluent Platform, simply Kafka Connect which is a standard part of the Apache Kafka distribution. https://kafka.apache.org/documentation/#connect
If you choose, you can run Kafka connect workers in distributed mode to divide the load across several VMs/containers/boxes and these don't have to be the same boxes you run your kafka brokers (they only need some relevant libs from kafka and the libs for the connector and Java of course)
Dears
I need to get data from graylog2 server into druid (e.g. CPU, memory, disk utilisation of several machines).
I've searched for plugins at the graylog marketplace and tranquility documentation, and I did not found any solution to retrieve data from graylog2.
I believe the solution is to use the REST API from graylog2, but how can this be "automated" from the druid/tranquility side?
Looked quickly into the Graylog2, i couldn't find any doc or like about the REST API that you have mention, neither a documented way about how to extract data in general from Graylog2. But if you can come up with something that can read from Graylog2 and dump to kafka you can point the druid cluster to that kafka topics and ingest the data.
I am using spark streaming to stream data from kafka broker. I am performing transformations on the data using spark streaming. Can someone suggest a visualization tool which I can use to show real-time graphs and charts which update as data streams in?
You could store your results in ElasticSearch and then use Kibana to perform visualizations.
Apart from looking at spark's own streaming UI tab, I highly recommend use of graphite sinks. Spark streaming is a long running application so for monitoring purposes this can be really handy.
In no time using graphite dashboards you will kick start monitoring your spark streaming application.
The best literature I know is, here in section monitoring. and [here too] (https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/)
It provides configuration and other details. Some of the dashboards you will find ready-made in json format on some or other github links but again I found these two posts most useful in my production application.
I hope this will help you for visualizing and monitoring your application internals in spark streaming application.
you have use Websockets for building real-time streaming Graphs.
As such there are no BI tools but there are JS libraries which can help in building real-time graphs - http://www.pubnub.com/blog/tag/d3-js/
Check out Lightning: A Data Visualization Server
http://lightning-viz.org/
The server is designed to for making web-based interactive visualizations using D3. It is designed for large data sets and continuously updating data streams.
You can use Pro BI Tools like Tableau, Power BI or even MS Excel.. For testing, I use MS Excel with 1 min auto refresh.
You can also write python code for this.