Pyspark Streaming - How to set up custom logging? - apache-spark

I have a pyspark streaming application that runs on yarn in a Hadoop cluster. The streaming application reads from a Kafka queue every n seconds and makes a REST call.
I have a logging service in place to provide an easy way to collect and store data, send data to Logstash and visualize data in Kibana. The data needs to conform to a template (JSON with specific keys) provided by this service.
I want to send logs from the streaming application to Logstash using this service. For this, I need to do two things:
- Collect some data while the streaming app is reading from Kafka and making the REST call.
- Format it according to the logging service template.
- Forward the log to logstash host.
Any guidance related to this would be very helpful.
Thanks!

Related

Read from multiple sources using Nifi, group topics in Kafka and subscribe with Spark

We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?
#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.

periodic refresh of static data in Structure Streaming and Stateful Streaming

I am trying to implement 5 min batch monitoring using spark structured streaming where read from kafka and look up on (1 huge and 1 smaller) diff static datasets as part of ETL logic and call rest API to send final results to an external application (out of billions of records from kafka only less than 100 will be out to rest API after ETL).
How to achieve refreshing static look ups with out restarting the whole streaming application ? (StreamingQueryListener using StreamingQueryManager.addListener method to have our own logic of refreshing/recreating static df via StreamingQuery.AwaitTermination ? or use persist and unpersis cache ? or any other better ideas ?)
Note : Went through below article but not sure if hbase is better option as its an old one.
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
Once a record is enriched with look up information and applied some rules/conditions , we need to start keep track of it to send updates until it completed its lifecycle of an event as per custom logic via rest API. So hoping flatmapwithGroupState implementation helps here to keep track of event state. Please suggest best options here.
Managing group state with in HDFS vs using HBase. Please suggest best options from an operationalization and monitoring point of view in production environment where support team has minimal knowledge of Spark. If we use HDFS for state maintenance, how to keep it up with event state tracking in case of rest API fails to send updates to end user/system?

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

How to link logstash output to spark input

I am processing some logs, I am using logstash to read the logs from log files and filter them before pushing to elastic search db.
However I would like to enrich log information with some data that I am storing in postgres db, so I am thinking of using spark in between.
Is it possible to feed logstash output to spark, then enrich my data and then push it to elastic search
Any help is appreciated.
Use Logstash's Kafka output plugin and read data from Kafka into Spark Kafka receiver and enrich your data. After enrichment you can call the elastic search bulk post documents or single document and index them using REST API.

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Resources