How to link logstash output to spark input - apache-spark

I am processing some logs, I am using logstash to read the logs from log files and filter them before pushing to elastic search db.
However I would like to enrich log information with some data that I am storing in postgres db, so I am thinking of using spark in between.
Is it possible to feed logstash output to spark, then enrich my data and then push it to elastic search
Any help is appreciated.

Use Logstash's Kafka output plugin and read data from Kafka into Spark Kafka receiver and enrich your data. After enrichment you can call the elastic search bulk post documents or single document and index them using REST API.

Related

Read from multiple sources using Nifi, group topics in Kafka and subscribe with Spark

We use Apache Nifi to get data from multiple sources like Twitter and Reddit in specific interval (for example 30s). Then we would like to send it to Apache Kafka and probably it should somehow group both Twitter and Reddit messages into 1 topic so that Spark would always receive data from both sources for given interval at once.
Is there any way to do that?
#Sebastian What you describe is basic NiFI routing. You would just route both Twitter and Redis to the same downstream Kafka Producer and same Topic. After you get data into NiFi from each service, you should run it to UpdateAttribute and set attribute topicName to what you want for each source. If there are additional steps per Data Source do them after Update Attribute and before PublishKafka.
If you code all the upstream routes as above, you could route all the different Data Sources to PublishKafka processor using ${topicName} dynamically.

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

Best way to send Spark dataframe as JSON body over HTTP POST

I have a Spark dataframe that I need to send as body of HTTP POST request. The storage system is Apache Solr. We are creating Spark dataframe by reading Solr collection. I can use Jackson library to create JSON and send it over HTTP POST. Also, dataframe may have millions of records so preferred way is to send them in batches over HTTP.
Below are the two approaches I can think.
We can use foreach/foreachPartition operations of Spark dataframe and call HTTP POST which means that HTTP call will happen within each executor (If I am not wrong). Is this approach right? Also, it means if I have 3 executors then there will be 3 HTTP calls that we can make in parallel. Right? But opening and closing HTTP connection so many times, will it not cause issue?
After getting the Spark dataframe, we can save it in some other SOLR collection (using Spark) and then data from that collection will be read to get the data in batches using SOLR API (using rows, start parameters), create JSON out of it and send it over HTTP request.
I would like to know which one of the above two approaches is preferred?
After getting the Spark dataframe, we can save it in some other SOLR
collection (using Spark) and then data from that collection will be
read to get the data in batches using SOLR API (using rows, start
parameters), create JSON out of it and send it over HTTP request.
out of your 2 approaches 2nd approach is best since you have paging feature in solrj
1) save your dataframe as solr documents with indexes
2) use solrj is api which will interact with your solr collections and will return solr documents based on your criteria.
3) you can convert them in to json using any parser and present in uis or user queries.
Infact this is not new approach, people who are using hbase with solr will do in the same way (since querying from hbase is really slow compared to querying from solr collections), where each hbase table is solr collection and can be queried via solrj and present to dashborads like angular js.
more illustrative diagram like below..

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Pyspark Streaming - How to set up custom logging?

I have a pyspark streaming application that runs on yarn in a Hadoop cluster. The streaming application reads from a Kafka queue every n seconds and makes a REST call.
I have a logging service in place to provide an easy way to collect and store data, send data to Logstash and visualize data in Kibana. The data needs to conform to a template (JSON with specific keys) provided by this service.
I want to send logs from the streaming application to Logstash using this service. For this, I need to do two things:
- Collect some data while the streaming app is reading from Kafka and making the REST call.
- Format it according to the logging service template.
- Forward the log to logstash host.
Any guidance related to this would be very helpful.
Thanks!

Resources