Using pyspark to run a job on premises spark cluster - apache-spark

I have a tiny on premises Spark 3.2.0 cluster, with one machine being master, and another 2 being slaves. The cluster is deployed on "bare metal" and everything works fine when I run pyspark from the master machine.
The problem happens when I try to run anything from another machine. Here is my code:
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("extrair_comex").config("spark.executor.memory", "1g").master("spark://srvsparkm-dev:7077").getOrCreate()
link = 'https://www.stats.govt.nz/assets/Uploads/International-trade/International-trade-September-2021-quarter/Download-data/overseas-trade-indexes-September-2021-quarter-provisional-csv.csv'
arquivo = pd.read_csv(link)
df_spark = spark.createDataFrame(arquivo.astype(str))
df_spark.write.mode('overwrite').parquet(f'hdfs://srvsparkm-dev:9000/lnd/arquivo_extraido_comex.parquet')
Where "srvsparkm-dev" is an alias for the spark master IP.
Checking the logs for the "extrair_comex" job, I see this:
The Spark Executor Command:
Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-cp" "/home/spark/spark/conf/:/home/spark/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=38571" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#srvairflowcelery-dev:38571" "--executor-id" "157" "--hostname" "srvsparksl1-dev" "--cores" "2" "--app-id" "app-20220204183041-0031" "--worker-url" "spark://Worker#srvsparksl1-dev:37383"
The error:
Where "srvairflowcelery-dev" is the machine where the pyspark script is running.
Caused by: java.io.IOException: Failed to connect to srvairflowcelery-dev/xx.xxx.xxx.xx:38571
Where xx.xxx.xxx.xx is the srvairflowcelery-dev's IP.
It seems to me that the master is assigning to the client to run the task, and that's why it fails.
What can I do about this? Can't I submit jobs from another machine?

I solved the problem. The problem was that the srvairflowcelery is on docker, so only some ports are open. Other than that, the spark master tries to communicate on a random port of the driver (srvairflowcelery), so having some ports closed is a problem.
What I did was:
Opened a range of ports of my airflow workers with:
airflow-worker:
<<: *airflow-common
command: celery worker
hostname: ${HOSTNAME}
ports:
- 8793:8793
- "51800-51900:51800-51900"
Setting on my pyspark code fixed ports:
spark = SparkSession.builder.appName("extrair_comex_sb") \
.config("spark.executor.memory", "1g") \
.config("spark.driver.port", "51810") \
.config("spark.fileserver.port", "51811") \
.config("spark.broadcast.port", "51812") \
.config("spark.replClassServer.port", "51813") \
.config("spark.blockManager.port", "51814") \
.config("spark.executor.port", "51815") \
.master("spark://srvsparkm-dev:7077") \
.getOrCreate()
That fixed the problem.

Related

what are the configurations needed to lunch pyspark standalone cluster?

im new to pyspark and im looking for deploying my program in a cluster.
i checked out a tutorial which the steps are:
bin\spark-class2.cmd org.apache.spark.deploy.master.Master.
bin\spark-class2.cmd org.apache.spark.deploy.worker.Worker -c 2 -m 2G spark://192.168.43.78:7077.
lunching the app with python myapp.py with:
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster('spark://192.168.43.78:7077')
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
My question is: what are the configurations needed more than that to lunch pyspark standalone cluster ?

Start Spark master on the IP instead of Hostname

I'm trying to set up a remote Spark 2.4.5 cluster on Ubuntu 18. After I start ./sbin/stat-master.sh WebUI is available at <INSTANCE-IP>:8080 but it shows "Spark Master at spark://spark-master:7077" where spark-master is my hostname on the remote machine.
I'm able to start a worker with ./sbin/start-slave.sh spark://spark-master:7077 only, but <INSTANCE-IP>:4040 doesn't work. When I try ./sbin/start-slave.sh spark://<INSTANCE-IP>:7077 I can see the process but the worker is not visible in WebUI.
As a result, I can not connect to the cluster from my local machine with spark-shell --master spark://<INSTANCE-IP>:7077. The error is:
StandaloneAppClient$ClientEndpoint: Failed to connect to master <INSTANCE-IP>:7077

Spark 2 broadcast inside Docker uses random port

I'm trying to run Spark 2 inside Docker containers and it is being kind of hard for me. So long I think have come a long way, having been able to deploy a Standalone master to host A and a worker to host B. I configured the /etc/hosts of the Docker container so master and worker can access their respective hosts. They see each other and everything looks to be fine.
I deploy master with this set of opened ports:
docker run -ti --rm \
--name sparkmaster \
--hostname=host.a \
--add-host host.a:xx.xx.xx.xx \
--add-host host.b:xx.xx.xx.xx \
-p 18080:18080 \
-p 7001:7001 \
-p 7002:7002 \
-p 7003:7003 \
-p 7004:7004 \
-p 7005:7005 \
-p 7006:7006 \
-p 4040:4040 \
-p 7077:7077 \
malkab/spark:ablative_alligator
Then I submit this Spark config options:
export SPARK_MASTER_OPTS="-Dspark.driver.port=7001
-Dspark.fileserver.port=7002
-Dspark.broadcast.port=7003 -Dspark.replClassServer.port=7004
-Dspark.blockManager.port=7005 -Dspark.executor.port=7006
-Dspark.ui.port=4040 -Dspark.broadcast.blockSize=4096"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=18080
I deploy the worker on its host with the SPARK_WORKER_XXX version of the aforementioned env variables, and with an analogous docker run.
Then I enter into the master container and spark-submit a job:
spark-submit --master spark://host.a:7077 /src/Test06.py
Everything starts fine: I can see the job being distributed to the worker. But when the Block Manager tries to register the block, it seems to be using a random port, which is not accesible outside the container:
INFO BlockManagerMasterEndpoint: Registering block manager host.b:39673 with 366.3 MB RAM, BlockManagerId(0, host.b, 39673, None)
Then I get this error:
WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, host.b, executor 0): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0
And the worker reports:
java.io.IOException: Failed to connect to host.a/xx.xx.xx.xx:55638
I've been able so far to avoid the usage of random ports with the previous settings, but this Block Manager port in particular seems to be random. I though it was controlled by the spark.blockManager.port, but this seems not to be the case. I've reviewed all configuration options to no avail.
So, final question: what is this port and can be avoided it to be random?
Thanks in advance.
EDIT:
This is the executor launched. As you can see, random ports are open both on the driver (which is in the same host as the master) and the worker:
Random ports for executors
I understand that a worker instantiates many executors, and that they should have a port assigned. Is there any way to, at least, limit the range of ports for this communications? How do the people handle Spark behind tight firewalls then?
EDIT 2:
I finally got it working. If I pass, as I mentioned before, the property spark.driver.host=host.a in the SPARK_MASTER_OPTS and SPARK_WORKER_OPTS it won't work, however, if I configure it in my code at the configuration of the context:
conf = SparkConf().setAppName("Test06") \
.set("spark.driver.port", "7001") \
.set("spark.driver.host", "host.a") \
.set("spark.fileserver.port", "7002") \
.set("spark.broadcast.port", "7003") \
.set("spark.replClassServer.port", "7004") \
.set("spark.blockManager.port", "7005") \
.set("spark.executor.port", "7006") \
.set("spark.ui.port", "4040") \
.set("spark.broadcast.blockSize", "4096") \
.set("spark.local.dir", "/tmp") \
.set("spark.driver.extraClassPath", "/classes/postgresql.jar") \
.set("spark.executor.extraClassPath", "/classes/postgresql.jar")
it somehow worked. Why is the setting not honored in SPARK_MASTER_OPTS or SPARK_WORKER_OPTS?

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Flume is not able to send the event when submitting the job on cluster with yarn-client

I am using Horton Works Cluster (2 Node cluster) to run the spark and flume , So when I am running the job with --master "local[*]" , Flume is able to send the events and Spark is also able to receive and on checking at localhost:4040 I can see the events are being received from the flume. (We are pumping 100 Events/Sec from flume using flume-ng-sql source with an approx size of ~1KB each)
Where as when I run the same example with --master "yarn-client" , I am getting the below error in flume and spark is not getting any events as well.
2015-08-13 18:24:24,927 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to send events
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:403)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 55555 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:222)
at org.apache.flume.sink.AbstractRpcSink.verifyConnection(AbstractRpcSink.java:283)
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:360)
... 3 more
Caused by: java.io.IOException: Error connecting to localhost/127.0.0.1:55555
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:168)
... 10 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:496)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:452)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:365)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
^
Also below observation has been observed in cluster:
-- Memory consumption using yarn is pretty much higher than compared to that being used in case of local.
-- Also when I am pumping 100 events per 30 second then Flume and spark are able to connect and process the same using yarn-client as well as local..
Below is the command which I am using for flume and spark.
Flume:
sudo -u hdfs flume-ng agent --conf conf/ -f conf/flume_mysql_spark.conf -n agent1 -Dflume.root.logger=INFO,console > flumelog.txt
Spark:
sudo -u hdfs spark-submit --master "yarn-client" --class "org.paladion.atm.FlumeEventCount" target/atm-1.1-jar-with-dependencies.jar > sparklog.txt
sudo -u hdfs spark-submit --master "local[*]" --class "org.paladion.atm.FlumeEventCount" target/atm-1.1-jar-with-dependencies.jar > sparklog.txt
Kindly l;et me know what could be wrong over here?
It got solves as below:
1 - If running as local give IP of local machine in Flume as well as spark.
2 - If running as cluster (yarn-client or yarn-cluster) give IP of the machine in cluster where you want to send the events (other than the one where you are executing the program so may be give IP of node which is not a master node) machine in Flume as well as spark.
Let me know if I am wrong and this could have worked for some other reason and any better solution is there for the same.

Resources