How can I do real-time log analysis using Spark-streaming?? (I attach architecture image) - apache-spark

Spark streaming application receives data in real time from a lot of IoT devices.
But They are all small amounts of data.
Overall flow looks like it -> Iot -> Kafka(1 topic/ all data) -> Spark-streaming(Filtering error log) -> DB(save) -> Alert screen
Is there a good way to do real-time log analysis using spark or python?

Apparantly, You can use the spark- Kafka connectors to stream the data from Kafka queue.
This doc has some reference to start with structured streaming with Kafka - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Once you get the streaming dataframe from kafka,you can apply filter() function from spark to filter your incoming data set.
Also this document from databricks has some nice reference on how we can implement log analysis application using spark streaming.
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/app/index.html
You can use above as a reference!

Related

Design Question Kafka Consumer/Producer vs Kafka Stream

I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers
The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.
buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

SPARK STREAMING: I want to do some streaming exercises, how to get a good stream data source?

I want to do some streaming exercises, how to get a good stream data source ?
I am looking for both structure streaming data source and non structured streaming data source.
Will twitter work?
Local files can be used as sources in structured streaming, e.g.:
stream = spark.readStream.schema(mySchema).option("maxFilesPerTrigger", 1).json("/my/data")
With this you can experiment on data transformation and output very easily, and there are many sample datasets online, e.g. on kaggle
If you want to have something production-like, twitter api is a good option. You will need some sort of a messaging middleware though, like Kafka or Azure Event Hub - a simple app can send tweets there and you will be able to pick them up easily from Spark. You can also generate data yourself on the input side instead of depending on Twitter.

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Pyspark Streaming - How to set up custom logging?

I have a pyspark streaming application that runs on yarn in a Hadoop cluster. The streaming application reads from a Kafka queue every n seconds and makes a REST call.
I have a logging service in place to provide an easy way to collect and store data, send data to Logstash and visualize data in Kibana. The data needs to conform to a template (JSON with specific keys) provided by this service.
I want to send logs from the streaming application to Logstash using this service. For this, I need to do two things:
- Collect some data while the streaming app is reading from Kafka and making the REST call.
- Format it according to the logging service template.
- Forward the log to logstash host.
Any guidance related to this would be very helpful.
Thanks!

Resources