Apache Livy REST API (/batches/) - How to return data back to the client - apache-spark

We are using Apache Livy 0.6.0-incubating and using its REST API to make calls to custom spark jar using /batches/ API.
The custom spark code reads data from HDFS and does some processing. This code is successful and the REST response is also 'SUCCESS'. We want the data to be returned back to the client, the way /sessions/ API returns the data. Is there a way to do this?
Note: /sessions/ API can only accept spark scala code.

I have some similiar set up, the way I return the data is by writing the spark result to HDFS. and when I recieve a SUCCESS I read from the client machine the HDFS to get the result.

Related

periodic refresh of static data in Structure Streaming and Stateful Streaming

I am trying to implement 5 min batch monitoring using spark structured streaming where read from kafka and look up on (1 huge and 1 smaller) diff static datasets as part of ETL logic and call rest API to send final results to an external application (out of billions of records from kafka only less than 100 will be out to rest API after ETL).
How to achieve refreshing static look ups with out restarting the whole streaming application ? (StreamingQueryListener using StreamingQueryManager.addListener method to have our own logic of refreshing/recreating static df via StreamingQuery.AwaitTermination ? or use persist and unpersis cache ? or any other better ideas ?)
Note : Went through below article but not sure if hbase is better option as its an old one.
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
Once a record is enriched with look up information and applied some rules/conditions , we need to start keep track of it to send updates until it completed its lifecycle of an event as per custom logic via rest API. So hoping flatmapwithGroupState implementation helps here to keep track of event state. Please suggest best options here.
Managing group state with in HDFS vs using HBase. Please suggest best options from an operationalization and monitoring point of view in production environment where support team has minimal knowledge of Spark. If we use HDFS for state maintenance, how to keep it up with event state tracking in case of rest API fails to send updates to end user/system?

How to implement Change Data Capture (CDC) using apache spark and kafka?

I am using spark-sql-2.4.1v with java 1.8. and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0.
I need to join streaming data with meta-data which is stored in RDS.
but RDS meta data could be added/changed.
If I read and load RDS table data in application , it would be stale for joining with streaming data.
I understood ,need to use Change Data Capture (CDC).
How can I implement Change Data Capture (CDC) in my scenario?
any clues or sample way to implement Change Data Capture (CDC) ?
thanks a lot.
You can stream a database into Kafka so that the contents of a table plus every subsequent change is available on a Kafka topic. From here it can be used in stream processing.
You can do CDC in two different ways:
Query-based: poll the database for changes, using Kafka Connect JDBC Source
Log-based: extract changes from the database's transaction log using e.g. Debezium
For more details and examples see http://rmoff.dev/ksny19-no-more-silos

Best way to send Spark dataframe as JSON body over HTTP POST

I have a Spark dataframe that I need to send as body of HTTP POST request. The storage system is Apache Solr. We are creating Spark dataframe by reading Solr collection. I can use Jackson library to create JSON and send it over HTTP POST. Also, dataframe may have millions of records so preferred way is to send them in batches over HTTP.
Below are the two approaches I can think.
We can use foreach/foreachPartition operations of Spark dataframe and call HTTP POST which means that HTTP call will happen within each executor (If I am not wrong). Is this approach right? Also, it means if I have 3 executors then there will be 3 HTTP calls that we can make in parallel. Right? But opening and closing HTTP connection so many times, will it not cause issue?
After getting the Spark dataframe, we can save it in some other SOLR collection (using Spark) and then data from that collection will be read to get the data in batches using SOLR API (using rows, start parameters), create JSON out of it and send it over HTTP request.
I would like to know which one of the above two approaches is preferred?
After getting the Spark dataframe, we can save it in some other SOLR
collection (using Spark) and then data from that collection will be
read to get the data in batches using SOLR API (using rows, start
parameters), create JSON out of it and send it over HTTP request.
out of your 2 approaches 2nd approach is best since you have paging feature in solrj
1) save your dataframe as solr documents with indexes
2) use solrj is api which will interact with your solr collections and will return solr documents based on your criteria.
3) you can convert them in to json using any parser and present in uis or user queries.
Infact this is not new approach, people who are using hbase with solr will do in the same way (since querying from hbase is really slow compared to querying from solr collections), where each hbase table is solr collection and can be queried via solrj and present to dashborads like angular js.
more illustrative diagram like below..

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Pyspark Streaming - How to set up custom logging?

I have a pyspark streaming application that runs on yarn in a Hadoop cluster. The streaming application reads from a Kafka queue every n seconds and makes a REST call.
I have a logging service in place to provide an easy way to collect and store data, send data to Logstash and visualize data in Kibana. The data needs to conform to a template (JSON with specific keys) provided by this service.
I want to send logs from the streaming application to Logstash using this service. For this, I need to do two things:
- Collect some data while the streaming app is reading from Kafka and making the REST call.
- Format it according to the logging service template.
- Forward the log to logstash host.
Any guidance related to this would be very helpful.
Thanks!

Resources