I have checked Hadoop documentation but was not able to find anything to fetch the number of completed batches of spark jobs.
Is anyone aware of an API that could help me?
Related
Small question regarding an integration between Splunk and Apache Spark.
Currently, I am doing a search query in Splunk. The result is quite big. And I am exporting this result as a CSV to share with several teams for downstream work.
Each of the downstream work ended up loading the CSV as part of a Apache Spark job, converting it to DataSet, and doing map reduce on it.
The Spark jobs from each and every teams are different. Therefore, simply plugin each and every teams computation in Splunk directly is not quite scalable.
This is leading us to ask a question, instead of each teams having to download a copy of the CSV, may I ask, if there is an API, or a way to connect to Splunk search result from Apache Spark directly?
Thank you
Splunk does not have an API specifically for Spark. There is a REST API, a few SDKs, and (perhaps best for you) support for ODBC. With an ODBC/JDBC driver installed on your Spark server and a few saved searches defined on Splunk, you should be able to export results from Splunk to Spark for analysis. See https://www.cdata.com/kb/tech/splunk-jdbc-apache-spark.rst for more information.
I want get the lineage information of the spark job in a way that spark show's the lineage when we use toDebugString().
I don't want to use the web ui to see the dag , but I would want the same information may in a log file etc.
I am querying spark rest api (http://:18080/api/v1) to get some metrics , I would want to get hold of the spark job's lineage information but there is no api for it.
Does spark log the linage information some where ?
Is there a way to get hold of the linage information of a completed or failed job in some what like toDebugString() kind of format , apart from using the web ui ?
I am using spark java api.
How can i see the number of executors and partitions for a spark ingestion job in cloudera manager. I just cant seem to find this information.
I looked under yarn/applications found the job but did not find the information also looked under the log files.
I also looked in hue under the workflows for the specific job I just couldnt find it maybe im just missing it somewhere.
Thanks
You can find all the details in Spark history webUI ex: www.hostname:18088/history
can anyone please let me know how to delete the Spark completed applications from the Spark master url?
More properties on History server cleaner can be found here:https://spark.apache.org/docs/1.6.2/monitoring.html#viewing-after-the-fact
I have a Hadoop job running on HDInsight and source data from Azure DocumentDB. This job runs once a day and as new data comes in the DocumentDB everyday, my hadoop job filters out old records and only process the new ones (this is done by storing a time stamp somewhere). However, as the Hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not? How does the throttling mechanism in DocumentDB play roles here?
as the hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not?
The answer to this depends on what phase or step the hadoop job is in. Data gets pulled once at the beginning. Documents added while data is getting pulled will be included in the Hadoop job results. Documents added after data is finished getting pulled will not included in the Hadoop job results.
Note: ORDER BY _ts is needed for consistent behavior - as the Hadoop job simple follows the continuation token when paging through query results.
"How the throttling mechanism in DocumentDB play roles here?"
The DocumentDB Hadoop connector will automatically retry when throttled.