spark-submit python app with named argument via spark REST api - apache-spark

Using the following as the body for a request to spark native REST api that executes a python app with a named argument, --filename, via spark-submit, the driver is successfully created, but the execution fails with very little information. Is the --filename argument causing this? Any ideas where/how to get more information about the failure?
{ "action": "CreateSubmissionRequest",
"appArgs": [
"/opt/bitnami/spark/hostfiles/bronze.py --filename 'filename.json'"
],
"appResource": "file:/opt/bitnami/spark/hostfiles/bronze.py",
"clientSparkVersion": "3.2.0",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"mainClass": "org.apache.spark.deploy.SparkSubmit",
"sparkProperties": {
"spark.driver.supervise": "false",
"spark.app.name": "Spark REST API - Bronze Load",
"spark.submit.deployMode": "client",
"spark.master": "spark://spark:6066",
"spark.eventLog.enabled":"true"
}
}```
These are the logs from the worker container for the driver.
> 22/01/18 23:37:48 INFO Worker: Asked to launch driver driver-20220118233748-0006
> 22/01/18 23:37:48 INFO DriverRunner: Copying user jar file:/opt/bitnami/spark/hostfiles/bronze.py to /opt/bitnami/spark/work/driver-20220118233748-0006/bronze.py
> 22/01/18 23:37:48 INFO Utils: Copying /opt/bitnami/spark/hostfiles/bronze.py to /opt/bitnami/spark/work/driver-20220118233748-0006/bronze.py
> 22/01/18 23:37:48 INFO DriverRunner: Launch Command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.eventLog.enabled=true" "-Dspark.app.name=Spark REST API - Bronze Load" "-Dspark.driver.supervise=false" "-Dspark.master=spark://spark:7077" "-Dspark.submit.deployMode=client" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#192.168.32.4:35803" "/opt/bitnami/spark/work/driver-20220118233748-0006/bronze.py" "org.apache.spark.deploy.SparkSubmit" "/opt/bitnami/spark/hostfiles/bronze.py --filename 'filename.json'"
> 22/01/18 23:37:51 WARN Worker: Driver driver-20220118233748-0006 exited with failure
Meanwhile, the spark-submit successfully executes the app.
``` bin/spark-submit --deploy-mode "client" file:///opt/bitnami/spark/hostfiles/bronze.py --filename "filename.json"```

The following change to the body of the api request and processing the python application argument value without requiring '--filename' in the python script results the python app being executed.
updated appArgs and appResource to use file:/// for referencing the python app.
"appArgs": [
"file:///opt/bitnami/spark/hostfiles/bronze.py", "filename.json"
],
"appResource": "file:///opt/bitnami/spark/hostfiles/bronze.py",
"clientSparkVersion": "3.2.0",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"mainClass": "org.apache.spark.deploy.SparkSubmit",
"sparkProperties": {
"spark.driver.supervise": "false",
"spark.app.name": "Spark REST API - Bronze Load",
"spark.submit.deployMode": "client",
"spark.master": "spark://spark:7077",
"spark.eventLog.enabled":"true"
}
}

Related

Log all failed attempts in testcafe quarantine mode?

I have quarantine mode enabled in my testcafe configuration.
"ci-e2e": {
"browsers": [
"chrome:headless"
],
"debugOnFail": false,
"src": "./tests/e2e/*.test.ts",
"concurrency": 1,
"quarantineMode": true,
"reporters": [
{
"name": "nunit3",
"output": "results/e2e/testResults.xml"
},
{
"name": "spec"
}
],
"screenshots": {
"takeOnFails": true,
"path": "results/ui/screenshots",
"pathPattern": "${DATE}_${TIME}/${FIXTURE}/${TEST}/Screenshot-${QUARANTINE_ATTEMPT}.png"
},
"video": {
"path": "results/ui/video",
"failedOnly": true,
"pathPattern": "${DATE}_${TIME}/${FIXTURE}/${TEST}/Video-${QUARANTINE_ATTEMPT}"
}
},
Now when some attempt fails I have entry in log (nunit xml logfile) with information about failed runs and only one stack-trace. I have screenshot for each failed run.
<failure>
<message>
<![CDATA[ ❌ AssertionError: ... Run 1: Failed Run 2: Failed Run 3: Failed ]]>
</message>
<stack-trace>
here we have stack-trace for only one failed run
</stack-trace>
</failure>
I want to have log entry with stack-trace for each failed run for each failed test. Is it possible to configure testcafe this way? If not what I need to do?
There is a mistake in the config file. The name of the option for reporters should be reporter, but it is reporterS. It means that Testcafe doesn't use these reporters at all and maybe now you just see an outdated file with results.

Azure container fails to configure and then 'terminated'

I have a Docker container with an ASP.NET (.NET 4.7) web application. The Docker image works perfectly using our local docker deployment, but will not start on Azure and I cannot find any information or diagnostics on why that might be.
From the log stream I get
31/05/2019 11:05:34.487 INFO - Site: ip-app-develop-1 - Creating container for image: 3tcsoftwaredockerdevelop.azurecr.io/irs-plus-app:latest-develop.
31/05/2019 11:05:34.516 INFO - Site: ip-app-develop-1 - Create container for image: 3tcsoftwaredockerdevelop.azurecr.io/irs-plus-app:latest-develop succeeded. Container Id 1ea16ee9f5f128f14246fefcd936705bb8a655dc6cdbce184fb11970ef7b1cc9
31/05/2019 11:05:40.151 INFO - Site: ip-app-develop-1 - Start container succeeded. Container: 1ea16ee9f5f128f14246fefcd936705bb8a655dc6cdbce184fb11970ef7b1cc9
31/05/2019 11:05:43.745 INFO - Site: ip-app-develop-1 - Application Logging (Filesystem): On
31/05/2019 11:05:44.919 INFO - Site: ip-app-develop-1 - Container ready
31/05/2019 11:05:44.919 INFO - Site: ip-app-develop-1 - Configuring container
31/05/2019 11:05:57.448 ERROR - Site: ip-app-develop-1 - Error configuring container
31/05/2019 11:06:02.455 INFO - Site: ip-app-develop-1 - Container has exited
31/05/2019 11:06:02.456 ERROR - Site: ip-app-develop-1 - Container customization failed
31/05/2019 11:06:02.470 INFO - Site: ip-app-develop-1 - Purging pending logs after stopping container
31/05/2019 11:06:02.456 INFO - Site: ip-app-develop-1 - Attempting to stop container: 1ea16ee9f5f128f14246fefcd936705bb8a655dc6cdbce184fb11970ef7b1cc9
31/05/2019 11:06:02.470 INFO - Site: ip-app-develop-1 - Container stopped successfully. Container Id: 1ea16ee9f5f128f14246fefcd936705bb8a655dc6cdbce184fb11970ef7b1cc9
31/05/2019 11:06:02.484 INFO - Site: ip-app-develop-1 - Purging after container failed to start
After several restart attempts (manual or as a result of re-configuration) I will simply get:
2019-05-31T10:33:46 The application was terminated.
The application then refuses to even attempt to start regardless of whether I use the az cli or the portal.
My current logging configuration is:
{
"applicationLogs": {
"azureBlobStorage": {
"level": "Off",
"retentionInDays": null,
"sasUrl": null
},
"azureTableStorage": {
"level": "Off",
"sasUrl": null
},
"fileSystem": {
"level": "Verbose"
}
},
"detailedErrorMessages": {
"enabled": true
},
"failedRequestsTracing": {
"enabled": false
},
"httpLogs": {
"azureBlobStorage": {
"enabled": false,
"retentionInDays": 2,
"sasUrl": null
},
"fileSystem": {
"enabled": true,
"retentionInDays": 2,
"retentionInMb": 35
}
},
"id": "/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Web/sites/XXX/config/logs",
"kind": null,
"location": "North Europe",
"name": "logs",
"resourceGroup": "XXX",
"type": "Microsoft.Web/sites/config"
}
Further info on the app:
- deployed using a docker container
- docker base image mcr.microsoft.com/dotnet/framework/aspnet:4.7.2
- image entrypoint c:\ServiceMonitor.exe w3svc
- app developed in ASP.NET 4.7
- using IIS as a web server
Questions:
How can I get some diagnostics on what is going on to enable me to determine why the app is not starting?
Why does the app refuse to even attempt to restart after a few failed attempts?
We have had the same issue, at last we have seen that Appservice is mounting a directory with a "special cooked" version of servicemonitor.exe this version reads the events from the backend of the appservice. If you change your docker image for use this version of service monitor will work. We have created a small powershell and changed the entrypoint from this:
#WORKDIR /LogMonitor
SHELL ["C:\\LogMonitor\\LogMonitor.exe", "powershell.exe"]
# Start IIS Remote Management and monitor IIS
ENTRYPOINT Start-Service WMSVC; C:/ServiceMonitor.exe w3svc;
to this:
ENTRYPOINT ["C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe","-File","C:\\start-iis-environment.ps1"]
and we have created this powershell like:
if (Test-Path -Path 'C:\AppService\Util\ServiceMonitor.exe' -PathType Leaf) {
& C:\AppService\Util\ServiceMonitor.exe w3svc
}
else{
& C:\ServiceMonitor.exe w3svc
}

How to run spark + cassandra + mesos (dcos) with dynamic resource allocation?

On every slave node through Marathon we run Mesos External Shuffle Service . When we submit spark job via dcos CLI in coarse grained mode without dynamic allocation everything working as expected. But when we submit the same job with dynamic allocation it fails.
16/12/08 19:20:42 ERROR OneForOneBlockFetcher: Failed while starting block fetches
java.lang.RuntimeException: java.lang.RuntimeException: Failed to open file:/tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:234)
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
...
Caused by: java.io.FileNotFoundException: /tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index (No such file or directory)
Full description:
We installed Mesos (DCOS) with Marathon using Azure Portal.
Via Universe Packages we installed: Cassandra, Spark and Marathon-lb
We generated test data in Cassandra.
On laptop I installed dcos CLI
When I submit job as below everything is working as expected:
./dcos spark run --submit-args="--properties-file coarse-grained.conf --class portal.spark.cassandra.app.ProductModelPerNrOfAlerts http://marathon-lb-default.marathon.mesos:10018/jars/spark-cassandra-assembly-1.0.jar"
Run job succeeded. Submission id: driver-20161208185927-0043
cqlsh:sp> select count(*) from product_model_per_alerts_by_date ;
count
-------
476
coarse-grained.conf:
spark.cassandra.connection.host 10.32.0.17
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.cores 1
spark.executor.memory 1g
spark.executor.instances 2
spark.submit.deployMode cluster
spark.cores.max 4
portal.spark.cassandra.app.ProductModelPerNrOfAlerts:
package portal.spark.cassandra.app
import org.apache.spark.sql.{SQLContext, SaveMode}
import org.apache.spark.{SparkConf, SparkContext}
object ProductModelPerNrOfAlerts {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setAppName("cassandraSpark-ProductModelPerNrOfAlerts")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "asset_history", "keyspace" -> "sp"))
.load()
.select("datestamp","product_model","nr_of_alerts")
val dr = df
.groupBy("datestamp","product_model")
.avg("nr_of_alerts")
.toDF("datestamp","product_model","nr_of_alerts")
dr.write
.mode(SaveMode.Overwrite)
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "product_model_per_alerts_by_date", "keyspace" -> "sp"))
.save()
sc.stop()
}
}
Dynamic Allocation
Through Marathon we run Mesos External Shuffle Service:
{
"id": "spark-mesos-external-shuffle-service-tt",
"container": {
"type": "DOCKER",
"docker": {
"image": "jpavt/mesos-spark-hadoop:mesos-external-shuffle-service-1.0.4-2.0.1",
"network": "BRIDGE",
"portMappings": [
{ "hostPort": 7337, "containerPort": 7337, "servicePort": 7337 }
],
"forcePullImage":true,
"volumes": [
{
"containerPath": "/tmp",
"hostPath": "/tmp",
"mode": "RW"
}
]
}
},
"instances": 9,
"cpus": 0.2,
"mem": 512,
"constraints": [["hostname", "UNIQUE"]]
}
Dockerfile for jpavt/mesos-spark-hadoop:mesos-external-shuffle-service-1.0.4-2.0.1:
FROM mesosphere/spark:1.0.4-2.0.1
WORKDIR /opt/spark/dist
ENTRYPOINT ["./bin/spark-class", "org.apache.spark.deploy.mesos.MesosExternalShuffleService"]
Now when I submit job with dynamic allocation it fails:
./dcos spark run --submit-args="--properties-file dynamic-allocation.conf --class portal.spark.cassandra.app.ProductModelPerNrOfAlerts http://marathon-lb-default.marathon.mesos:10018/jars/spark-cassandra-assembly-1.0.jar"
Run job succeeded. Submission id: driver-20161208191958-0047
select count(*) from product_model_per_alerts_by_date ;
count
-------
5
dynamic-allocation.conf:
spark.cassandra.connection.host 10.32.0.17
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.cores 1
spark.executor.memory 1g
spark.submit.deployMode cluster
spark.cores.max 4
spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 5
spark.dynamicAllocation.cachedExecutorIdleTimeout 120s
spark.dynamicAllocation.schedulerBacklogTimeout 10s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 20s
spark.mesos.executor.docker.volumes /tmp:/tmp:rw
spark.local.dir /tmp
logs from mesos:
16/12/08 19:20:42 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 18.0 KB, free 366.0 MB)
16/12/08 19:20:42 INFO TorrentBroadcast: Reading broadcast variable 7 took 21 ms
16/12/08 19:20:42 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 38.6 KB, free 366.0 MB)
16/12/08 19:20:42 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
16/12/08 19:20:42 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker#10.32.0.4:45422)
16/12/08 19:20:42 INFO MapOutputTrackerWorker: Got the output locations
16/12/08 19:20:42 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 58 blocks
16/12/08 19:20:42 INFO TransportClientFactory: Successfully created connection to /10.32.0.11:7337 after 2 ms (0 ms spent in bootstraps)
16/12/08 19:20:42 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 13 ms
16/12/08 19:20:42 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.lang.RuntimeException: java.lang.RuntimeException: Failed to open file: /tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:234)
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
...
Caused by: java.io.FileNotFoundException: /tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index (No such file or directory)
logs from marathon spark-mesos-external-shuffle-service-tt:
...
16/12/08 19:20:29 INFO MesosExternalShuffleBlockHandler: Received registration request from app 704aec43-1aa3-4971-bb98-e892beeb2c45-0008-driver-20161208191958-0047 (remote address /10.32.0.4:49710, heartbeat timeout 120000 ms).
16/12/08 19:20:31 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=704aec43-1aa3-4971-bb98-e892beeb2c45-0008-driver-20161208191958-0047, execId=2} with ExecutorShuffleInfo{localDirs=[/tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
16/12/08 19:20:38 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 8157825166903585542
java.lang.RuntimeException: Failed to open file: /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:234)
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
Caused by: java.io.FileNotFoundException: /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index (No such file or directory)
...
but file exists on given slave box:
$ ls -l /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index
-rw-r--r-- 1 root root 1608 Dec 8 19:20 /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index
stat shuffle_0_55_0.index
File: 'shuffle_0_55_0.index'
Size: 1608 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 1805493 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-12-08 19:20:38.163188836 +0000
Modify: 2016-12-08 19:20:38.163188836 +0000
Change: 2016-12-08 19:20:38.163188836 +0000
Birth: -
I am not familiar with DCOS, Marathon and Azure though, I use dynamic resource allocation(Mesos external shuffle service) on Mesos and Aurora with Docker.
Each Mesos agent node has its own external shuffle service (that is, one external shuffle service for one mesos agent) ?
spark.local.dir setting is exactly same string and pointing same directory ? Your spark.local.dir for shuffle service is /tmp though, I don't know DCOS setting.
spark.local.dir directory can be readable/writable for both ? If both mesos agent and external shuffle service are launched by docker, spark.local.dir on host MUST be mounted to both containers.
EDIT
If SPARK_LOCAL_DIRS (mesos or standalone) environment variable is set, spark.local.dir will be overridden.
There was error in marathon external shuffle service config instead of path container.docker.volumes we should use container.volumes path.
Correct configuration:
{
"id": "mesos-external-shuffle-service-simple",
"container": {
"type": "DOCKER",
"docker": {
"image": "jpavt/mesos-spark-hadoop:mesos-external-shuffle-service-1.0.4-2.0.1",
"network": "BRIDGE",
"portMappings": [
{ "hostPort": 7337, "containerPort": 7337, "servicePort": 7337 }
],
"forcePullImage":true
},
"volumes": [
{
"containerPath": "/tmp",
"hostPath": "/tmp",
"mode": "RW"
}
]
},
"instances": 9,
"cpus": 0.2,
"mem": 512,
"constraints": [["hostname", "UNIQUE"]]
}

How to set PYTHONHASHSEED on AWS EMR

Is there any way to set an environment variable on all nodes of an EMR cluster?
I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it.
I have tried adding a variable to spark-env through the cluster configuration:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
but this doesn't work. I have also tried adding a bootstrap script:
#!/bin/bash
export PYTHONHASHSEED=123
but this also doesn't seem to do the trick.
I believe that the /usr/bin/python3 isn't picking up the environment variable PYTHONHASHSEED that you are defining in the cluster configuration under the spark-env scope.
You ought using python34 instead of /usr/bin/python3 and set the configuration as followed :
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
Now, let's test it. I define a bash script call both pythons :
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
The verdict :
[hadoop#ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1: I am using AMI release emr-4.8.2.
PS2: Snippet inspired from this answer.
EDIT: I have tested the following using pyspark.
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
Also created a simple application (simple_app.py):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
Which also seems to work perfectly :
[hadoop#ip-*** ~]$ spark-submit --master yarn simple_app.py
Output (truncated) :
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
As you can see it also works returning the same hash each time.
EDIT 2: From the comments, it seems like you are trying to compute hashes on the executors and not the driver, thus you'll need to set up spark.executorEnv.PYTHONHASHSEED, inside your spark application configuration so it can be propagated on the executors (it's one way to do it).
Note : Setting the environment variables for executors is the same with YARN client, use the spark.executorEnv.[EnvironmentVariableName].
Thus the following minimalist example with simple_app.py :
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
And now let's test it again. Here is the truncated output :
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
I think that this covers all.
From the spark docs
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.
Properties are listed here so I think you want this:
Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN.
spark.yarn.appMasterEnv.PYTHONHASHSEED="XXXX"
EMR docs for configuring spark-defaults.conf are here.
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.appMasterEnv.PYTHONHASHSEED: "XXX"
}
}
]
Just encountered the same problem, adding the following configuration solved it:
# Some settings...
Configurations=[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python34"
},
"Configurations": []
}
]
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYTHONHASHSEED": "0"
},
"Configurations": []
}
]
}
],
# Some more settings...
Be careful: we do not use yarn as a cluster manager, for the moment the cluster is only running Hadoop and Spark.
EDIT : Following Tim B comment, this seems to work also with yarn installed as a cluster manager.
You could probably do it via the bootstrap script but you'll need to do something like this:
echo "PYTHONHASHSEED=XXXX" >> /home/hadoop/.bashrc
(or possibly .profile)
So that it's picked up by the spark processes when they are launched.
Your configuration looks reasonable though, it might be worth setting it in the hadoop-env section instead?

Spark: Monitoring a cluster mode application

Right now I'm using spark-submit to launch an application in cluster mode. The response from the master server gives a json object with a submissionId which I use to identify the application and kill it if necessary. However, I haven't found a simple way to retrieve the worker rest url from the master server response or the driver id (probably could web scrape the master web ui but that would be ugly). Instead, I have to wait until the application finishes, then look up the application statistics from the history server.
Is there any way to use the driver-id to identify the worker url from an application deployed in cluster mode (usually at worker-node:4040)?
16/08/12 11:39:47 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.yyy:6066.
16/08/12 11:39:47 INFO RestSubmissionClient: Submission successfully created as driver-20160812114003-0001. Polling submission state...
16/08/12 11:39:47 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20160812114003-0001 in spark://192.yyy:6066.
16/08/12 11:39:47 INFO RestSubmissionClient: State of driver driver-20160812114003-0001 is now RUNNING.
16/08/12 11:39:47 INFO RestSubmissionClient: Driver is running on worker worker-20160812113715-192.xxx-46215 at 192.xxx:46215.
16/08/12 11:39:47 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20160812114003-0001",
"serverSparkVersion" : "1.6.1",
"submissionId" : "driver-20160812114003-0001",
"success" : true
}
EDIT: Here's what a typical output looks like with log4j console output at DEBUG
Spark-submit command:
./apps/spark-2.0.0-bin-hadoop2.7/bin/spark-submit --master mesos://masterurl:7077
--verbose --class MainClass --deploy-mode cluster
~/path/myjar.jar args
Spark-submit output:
Using properties file: null
Parsed arguments:
master mesos://masterurl:7077
deployMode cluster
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile null
driverMemory null
driverCores null
driverExtraClassPath null
driverExtraLibraryPath null
driverExtraJavaOptions null
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass MyApp
primaryResource file:/path/myjar.jar
name MyApp
childArgs [args]
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file null:
Main class:
org.apache.spark.deploy.rest.RestSubmissionClient
Arguments:
file:/path/myjar.jar
MyApp
args
System properties:
SPARK_SUBMIT -> true
spark.driver.supervise -> false
spark.app.name -> MyApp
spark.jars -> file:/path/myjar.jar
spark.submit.deployMode -> cluster
spark.master -> mesos://masterurl:7077
Classpath elements:
16/08/17 13:26:49 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://masterurl:7077.
16/08/17 13:26:49 DEBUG RestSubmissionClient: Sending POST request to server at http://masterurl:7077/v1/submissions/create:
{
"action" : "CreateSubmissionRequest",
"appArgs" : [ args ],
"appResource" : "file:/path/myjar.jar",
"clientSparkVersion" : "2.0.0",
"environmentVariables" : {
"SPARK_SCALA_VERSION" : "2.10"
},
"mainClass" : "SimpleSort",
"sparkProperties" : {
"spark.jars" : "file:/path/myjar.jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "MyApp",
"spark.submit.deployMode" : "cluster",
"spark.master" : "mesos://masterurl:7077"
}
}
16/08/17 13:26:49 DEBUG RestSubmissionClient: Response from the server:
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "2.0.0",
"submissionId" : "driver-20160817132658-0004",
"success" : true
}
16/08/17 13:26:49 INFO RestSubmissionClient: Submission successfully created as driver-20160817132658-0004. Polling submission state...
16/08/17 13:26:49 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20160817132658-0004 in mesos://masterurl:7077.
16/08/17 13:26:49 DEBUG RestSubmissionClient: Sending GET request to server at http://masterurl:7077/v1/submissions/status/driver-20160817132658-0004.
16/08/17 13:26:49 DEBUG RestSubmissionClient: Response from the server:
{
"action" : "SubmissionStatusResponse",
"driverState" : "RUNNING",
"serverSparkVersion" : "2.0.0",
"submissionId" : "driver-20160817132658-0004",
"success" : true
}
16/08/17 13:26:49 INFO RestSubmissionClient: State of driver driver-20160817132658-0004 is now RUNNING.
16/08/17 13:26:49 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "2.0.0",
"submissionId" : "driver-20160817132658-0004",
"success" : true
}
Does the master server's response not provide application-id?
I believe all you need is the master-URL and application-id of your application for this problem. Once you have the application-id, use the port 4040 at master-URL and append your intended endpoint to it.
For example, if your application id is application_1468141556944_1055
To get the list of all jobs
http://<master>:4040/api/v1/applications/application_1468141556944_1055/jobs
To get the list of stored RDDs
http://<master>:4040/api/v1/applications/application_1468141556944_1055/storage/rdd
However if you don't have application-id, I would probably start with following:
Set verbose mode (--verbose) while launching spark job to get application id on console. You can then parse for application-id in log output. The log output usually looks like:
16/08/12 08:50:53 INFO Client: Application report for application_1468141556944_3791 (state: RUNNING)
thus, application-id is application_1468141556944_3791
You can also find master-url and application-id through tracking URL in the log output, which looks like
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.50.0.33
ApplicationMaster RPC port: 0
queue: ns_debug
start time: 1470992969127
final status: UNDEFINED
tracking URL: http://<master>:8088/proxy/application_1468141556944_3799/
These messages are at INFO log level so make sure you set log4j.rootCategory=INFO, console in log4j.properties file so that you can see them.
Had to scrape the spark master web ui for an application id that's close (within the same minute & same suffix e.g. 20161010025XXX-0005 with X as wildcard), then look for the worker url in the link tag after it. Not pretty, reliable, or secure, but for now it'll work. Leaving open for a bit in case someone has another approach.

Resources