I am trying to execute this github project in an aws emr spark cluster
https://github.com/pran4ajith/spark-twitter-streaming.git
I've succeeded to run 2 fisrt codes
tweet_stream_producer.py
sparkml_train_model.py
But when I run consumer part with command
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0,io.delta:delta-core_2.12:0.7.0 tweet_stream_consumer.py
I got file path error
Py4JJavaError: An error occurred while calling o137.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip-10-0-0-61.ec2.internal:8020/home/hadoop/spark-twitter-streaming/TwitterStreaming/src/app/models/metadata
It seems that the problem is located mapping between local file system path and hadoop file system path
model_path = str(SRC_DIR / 'models')
pipeline_model = PipelineModel.load(model_path)
I am running spark on windows using winutils.
In spark shell in trying to load a csv file, but it says Path does not exist, i.e. I have a file at location E:/data.csv.
I am executing:
scala> val df = spark.read.option("header","true").csv("E:\\data.csv")
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/E:/data.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
I cant figure out why is it appending a "/E:", whereas it should have been only E:
How should I access the file?
In my case I am able to read the file as below
val input = spark.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.option("delimiter", ";").option("quoteAll","true").option("inferSchema","false").load("C:/Work/test.csv").toDF()
I am new to Airflow and Spark and I am struggling with the SparkSubmitOperator.
Our airflow scheduler and our hadoop cluster are not set up on the same machine (first question: is it a good practice?).
We have many automatic procedures that need to call pyspark scripts. Those pyspark scripts are stored in the hadoop cluster (10.70.1.35). The airflow dags are stored in the airflow machine (10.70.1.22).
Currently, when we want to spark-submit a pyspark script with airflow, we use a simple BashOperator as follows:
cmd = "ssh hadoop#10.70.1.35 spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--executor-cores 2 \
/home/hadoop/pyspark_script/script.py"
t = BashOperator(task_id='Spark_datamodel',bash_command=cmd,dag=dag)
It works perfectly fine. But we would like to start using SparkSubmitOperator to spark submit our pyspark scripts.
I tried this:
from airflow import DAG
from datetime import timedelta, datetime
from airflow.contrib.operators.spark_submit_operator import
SparkSubmitOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
dag = DAG('SPARK_SUBMIT_TEST',start_date=datetime(2018,12,10),
schedule_interval='#daily')
sleep = BashOperator(task_id='sleep', bash_command='sleep 10',dag=dag)
_config ={'application':'hadoop#10.70.1.35:/home/hadoop/pyspark_script/test_spark_submit.py',
'master' : 'yarn',
'deploy-mode' : 'cluster',
'executor_cores': 1,
'EXECUTORS_MEM': '2G'
}
spark_submit_operator = SparkSubmitOperator(
task_id='spark_submit_job',
dag=dag,
**_config)
sleep.set_downstream(spark_submit_operator)
The syntax should be ok as the dag does not show up as broken. But when it runs it gives me the following error:
[2018-12-14 03:26:42,600] {logging_mixin.py:95} INFO - [2018-12-14
03:26:42,600] {base_hook.py:83} INFO - Using connection to: yarn
[2018-12-14 03:26:42,974] {logging_mixin.py:95} INFO - [2018-12-14
03:26:42,973] {spark_submit_hook.py:283} INFO - Spark-Submit cmd:
['spark-submit', '--master', 'yarn', '--executor-cores', '1', '--name',
'airflow-spark', '--queue', 'root.default',
'hadoop#10.70.1.35:/home/hadoop/pyspark_script/test_spark_submit.py']
[2018-12-14 03:26:42,977] {models.py:1760} ERROR - [Errno 2] No such
file or directory: 'spark-submit'
Traceback (most recent call last):
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/models.py", line 1659, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/contrib/operators/spark_submit_operator.py", line
168,
in execute
self._hook.submit(self._application)
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in
submit
**kwargs)
File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line
707,
in __init__
restore_signals, start_new_session)
File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line
1326, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit'
Here are my questions:
Should I install spark hadoop on my airflow machine? I'm asking because in this topic I read that I need to copy hdfs-site.xml and hive-site.xml. But as you can imagine, I have neither /etc/hadoop/ nor /etc/hive/ directories on my airflow machine.
a) If no, where exactly should I copy hdfs-site.xml and hive-site.xml on my airflow machine?
b) If yes, does it mean that I need to configure my airflow machine as a client? A kind of edge node that does not participate in jobs but can be used to submit actions?
Then, will I be able to spark-submit from my airflow machine? If yes, then I don't need to create a connection on Airflow like I do for a mysql database for example, right?
Oh and the cherry on the cake: will I be able to store my pyspark scripts in my airflow machine and spark-submit them from this same airflow machine. It would be amazing!
Any comment would be very useful, even if you're not able to answer all my questions...
Thanks in advance anyway! :)
To answer your first question, yes it is a good practice.
For how you can use SparkSubmitOperator, please refer to my answer on https://stackoverflow.com/a/53344713/5691525
Yes, you need spark-binaries on airflow machine.
-
Yes
No -> You still need a connection to tell Airflow where have you installed your spark binary files. Similar to https://stackoverflow.com/a/50541640/5691525
Should work
mymaster:
$ ./sbin/start-master.sh
myworker:
$ ./sbin/start-slave.sh spark://mymaster:7077
myclient:
$ ./bin/spark-shell --master spark://mymaster:7077
at this moment, the log of myworker says the following, indicating that it has accepted the job:
16/06/01 02:22:41 INFO Worker: Asked to launch executor app-20160601022241-0007/0 for Spark shell
myclient:
scala> sc.textFile("mylocalfile.txt").map(_.length}).sum
res0: Double = 3264.0
It works if the file mylocalfile.txt is available in myclient. However, according to the doc, the file should be available in myworker, not in myclient.
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
what am I missing here?
We are running some PySpark processes on Yarn, when the datasets increase in size we are getting this error in the yarn log:
Traceback (most recent call last):
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 544, in read_int
raise EOFError
java.net.SocketException: Socket is closed
at java.net.Socket.shutdownOutput(Socket.java:1496)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply$mcV$sp(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.tryLog(Utils.scala:1785)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
We are running on a EMR Setup 3*m3.xlarge - each with 4vCPUs, 15GiB and 2x40 GB
The job is executed with the following sh script:
export SPARK_HOME=/home/hadoop/spark
JARS="/home/hadoop/avro-1.7.7.jar,/home/hadoop/spark-avro-master/target/scala-2.10/spark-avro_2.10-1.0.0.jar”
$SPARK_HOME/bin/spark-submit --master yarn-cluster --py-files deploy.zip --jars $JARS main.py
where deploy.zip contains some utility methods and lambda functions
No other configuration changes were made to the cluster.
By looking at the UI seems that the all the jobs are finishing with a SUCCESS status, nevertheless we would like to get rid of this issue, or at the very least to understand what's causing it.
Would you have any idea on what it might be the origin of the error?
Thanks!