I'm using a clustered Airflow environment where I have four AWS ec2-instances for the servers.
ec2-instances
Server 1: Webserver, Scheduler, Redis Queue, PostgreSQL Database
Server 2: Webserver
Server 3: Worker
Server 4: Worker
My setup has been working perfectly fine for three months now but sporadically about once a week I get a Broken Pipe Exception when Airflow is attempting to log something.
*** Log file isn't local.
*** Fetching here: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-13T00:00:00/1.log
[2018-07-16 00:00:15,521] {cli.py:374} INFO - Running on host ip-1-2-3-4
[2018-07-16 00:00:15,698] {models.py:1197} INFO - Dependencies all met for <TaskInstance: foobar.task_1 2018-07-13 00:00:00 [queued]>
[2018-07-16 00:00:15,710] {models.py:1197} INFO - Dependencies all met for <TaskInstance: foobar.task_1 2018-07-13 00:00:00 [queued]>
[2018-07-16 00:00:15,710] {models.py:1407} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------
[2018-07-16 00:00:15,719] {models.py:1428} INFO - Executing <Task(OmegaFileSensor): task_1> on 2018-07-13 00:00:00
[2018-07-16 00:00:15,720] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run foobar task_1 2018-07-13T00:00:00 --job_id 1320 --raw -sd DAGS_FOLDER/datalake_digitalplatform_arl_workflow_schedule_test_2.py']
[2018-07-16 00:00:16,532] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,532] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
[2018-07-16 00:00:16,532] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,532] {default_celery.py:41} WARNING - Celery Executor will run without SSL
[2018-07-16 00:00:16,534] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,533] {__init__.py:45} INFO - Using executor CeleryExecutor
[2018-07-16 00:00:16,597] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,597] {models.py:189} INFO - Filling up the DagBag from /home/ec2-user/airflow/dags/datalake_digitalplatform_arl_workflow_schedule_test_2.py
[2018-07-16 00:00:16,768] {cli.py:374} INFO - Running on host ip-1-2-3-4
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - --- Logging error ---
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - Traceback (most recent call last):
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - File "/usr/lib64/python3.6/logging/__init__.py", line 996, in emit
self.flush()
[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush
self.stream.flush()
[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - BrokenPipeError: [Errno 32] Broken pipe
[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - Call stack:
[2018-07-16 00:16:24,933] {logging_mixin.py:84} WARNING - File "/usr/bin/airflow", line 27, in <module>
args.func(args)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 392, in run
pool=args.pool,
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1488, in _run_raw_task
result = task_copy.execute(context=context)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
while not self.poke(context):
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/home/ec2-user/airflow/plugins/custom_plugins.py", line 35, in poke
directory = os.listdir(full_path)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 36, in handle_timeout
self.log.error("Process timed out")
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - Message: 'Process timed out'
Arguments: ()
[2018-07-16 00:16:24,942] {models.py:1595} ERROR - Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1488, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
while not self.poke(context):
File "/home/ec2-user/airflow/plugins/custom_plugins.py", line 35, in poke
directory = os.listdir(full_path)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 37, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout
[2018-07-16 00:16:24,942] {models.py:1624} INFO - Marking task as FAILED.
[2018-07-16 00:16:24,956] {models.py:1644} ERROR - Timeout
Sometimes the error will also say
*** Log file isn't local.
*** Fetching here: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-12T00:00:00/1.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-12T00:00:00/1.log
I'm not sure why the logs are working ~95% of the time but are randomly failing at other times. Here are my log settings in my Airflow.cfg file,
# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = /home/ec2-user/airflow/logs
# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply an Airflow connection id that provides access to the storage
# location.
remote_log_conn_id =
encrypt_s3_logs = False
# Logging level
logging_level = INFO
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class =
# Log format
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s
# Name of handler to read task instance logs.
# Default to use file task handler.
task_log_reader = file.task
# Log files for the gunicorn webserver. '-' means log to stderr.
access_logfile = -
error_logfile =
# The amount of time (in secs) webserver will wait for initial handshake
# while fetching logs from other worker machine
log_fetch_timeout_sec = 5
# When you start an airflow worker, airflow starts a tiny web server
# subprocess to serve the workers local log files to the airflow main
# web server, who then builds pages and sends them to users. This defines
# the port on which the logs are served. It needs to be unused, and open
# visible from the main web server to connect into the workers.
worker_log_server_port = 8793
# How often should stats be printed to the logs
print_stats_interval = 30
child_process_log_directory = /home/ec2-user/airflow/logs/scheduler
I'm wondering if maybe I should try a different technique for my logging such as writing to an S3 Bucket or if there is something else I can do to fix this issue.
Update:
Writing the logs to S3 did not resolve this issue. Also, the error is more consistent now (still sporadic). It's happening more like 50% of the time now. One thing I noticed is that the task it's happening on is my AWS EMR creation task. Starting an AWS EMR cluster takes about 20 minutes and then the task has to wait for the Spark commands to run on the EMR cluster. So the single task is running for about 30 minutes. I'm wondering if this is too long for an Airflow task to be running and if that's why it starts to fail writing to the logs. If this is the case then I could breakup the EMR task so that there is one task for the EMR creation, then another task for the Spark commands on the EMR cluster.
Note:
I've also created a new bug ticket on Airflow's Jira here https://issues.apache.org/jira/browse/AIRFLOW-2844
This issue is a symptom of another issue I just resolved here AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname.
I didn't see the AirflowException: Celery command failed for a while because it showed up on the airflow worker logs. It wasn't until I watched the airflow worker logs in real time that I saw when the error is thrown I also got the BrokenPipeException in my task.
It gets somewhat weirder though. I would only see the BrokenPipeException thrown if I did print("something to log") and the AirflowException: Celery command failed... error happened on the Worker node. When I changed all of my print statements to use import logging ... logging.info("something to log") then I would not see the BrokenPipeException but the task would still fail because of the AirflowException: Celery command failed... error. But had I not seen the BrokenPipeException being thrown in my Airflow task logs I wouldn't have known why the task was failing because once I eliminated the print statements I never saw any error in the Airflow task logs (only on the $airflow worker logs)
So long story short there are a few take aways.
Don't do print("something to log") use Airflow's built in logging by importing logging and then using the logging class like import logging then logging.info("something to log")
If you're using an AWS EC2-Instance as your server for Airflow then you may be experiencing this issue: https://github.com/apache/incubator-airflow/pull/2484 a fix to this issue has already been integrated into Airflow Version 1.10 (I'm currently using Airflow Version 1.9). So upgrade your Airflow version to 1.10. You can also use the command here pip install git+git://github.com/apache/incubator-airflow.git#v1-10-stable. Also, if you don't want to upgrade your Airflow version then you could follow the steps on the github issue to either manually update the file with the fix or fork Airflow and cherry pick the commit that fixes it.
Related
I brought out the dag with SparkSubmitOperator. I turned it on so that it would create a task. And then turned it off so that it wouldn't create new ones. After that, I started running the task. And 5 minutes after the launch, he always sent himself a SIGTERM with log:
[2023-01-27 12:50:04,783] {local_task_job.py:188} WARNING - State of this instance has been externally set to None. Terminating instance.
[2023-01-27 12:50:04,798] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 968
[2023-01-27 12:50:04,802] {taskinstance.py:1265} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-01-27 12:50:04,804] {spark_submit.py:657} INFO - Sending kill signal to spark-submit
[2023-01-27 12:50:15,985] {spark_submit.py:674} INFO - YARN app killed with return code: 0
[2023-01-27 12:50:16,122] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 183, in execute
self._hook.submit(self._application)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 440, in submit
self._process_spark_submit_log(iter(self._submit_sp.stdout)) # type: ignore
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 494, in _process_spark_submit_log
for line in itr:
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1267, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
This error was reproduced, tried more than 5 times. But then I turned on the dag, launched it, and it worked. Why is this happening?
DAGs are supposed to stay unpaused for their tasks to run correctly. :)
You can limit how many DAG runs are created by setting the DAG parameter max_active_runs to 1 and limit how many instances of a specifc task can run at any time by setting the task level parameter max_active_tis_per_dag to 1. Also make sure you set the DAG parameter catchup to False, to avoid many runs to be scheduled if your start_date is a while back in the past.
This DAG should only create one running t1 task every day (based on the #daily schedule).
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
with DAG(
dag_id="simple_classic_dag",
start_date=datetime(2022,12,10),
schedule="#daily",
catchup=False,
max_active_runs=1,
tags=["simple", "debug"]
) as dag:
t1 = EmptyOperator(task_id="t1", max_active_tis_per_dag=1)
You might find these guides helpful:
DAG scheduling and timetables in Airflow explains scheduling basics
Scaling Airflow to optimize performance goes over scaling and limit parameters
I'm running apache-airflow on my Mac m1 to insert data from a staging table(the temporary table was created from S3) to AWS redshift.
The following error was raised, and it doesn't tell me what the source issue is. Could anybody please help me debug?
[2022-07-30 16:16:58,226] {subdag.py:186} INFO - Execution finished. State is failed
[2022-07-30 16:16:58,252] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1471, in _run_raw_task
self._execute_task_with_callbacks(context, test_mode)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1621, in _execute_task_with_callbacks
self.task.post_execute(context=context, result=result)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/subdag.py", line 189, in post_execute
raise AirflowException(f"Expected state: SUCCESS. Actual state: {dag_run.state}")
airflow.exceptions.AirflowException: Expected state: SUCCESS. Actual state: failed
[2022-07-30 16:16:58,268] {taskinstance.py:1415} INFO - Marking task as FAILED. dag_id=sparkify_dag, task_id=load_artist_dim_table, execution_date=20220729T174100, start_date=20220730T161557, end_date=20220730T161658
[2022-07-30 16:16:58,298] {standard_task_runner.py:92} ERROR - Failed to execute job 399 for task load_artist_dim_table (Expected state: SUCCESS. Actual state: failed; 55013)
[2022-07-30 16:16:58,356] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-07-30 16:16:58,426] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
ubuntu#vps-9b30a7d3:~/Database/yugabyte-2.1.8.1$ ./bin/yb-ctl create
Creating cluster.
Waiting for cluster to be ready.
Viewing file /home/ubuntu/yugabyte-data/node-1/disk-1/tserver.err:
sh: 1: /home/ubuntu/Database/yugabyte-2.1.8.1/bin/yb-tserver: not found
Viewing file /home/ubuntu/yugabyte-data/node-1/disk-1/master.err:
sh: 1: /home/ubuntu/Database/yugabyte-2.1.8.1/bin/yb-master: not found
Traceback (most recent call last):
File "./bin/yb-ctl", line 2021, in <module>
control.run()
File "./bin/yb-ctl", line 1998, in run
self.args.func()
File "./bin/yb-ctl", line 1755, in create_cmd_impl
self.wait_for_cluster_or_raise()
File "./bin/yb-ctl", line 1598, in wait_for_cluster_or_raise
raise RuntimeError("Timed out waiting for a YugaByte DB cluster!")
RuntimeError: Timed out waiting for a YugaByte DB cluster!
Viewing file /tmp/tmpfY6csf:
2020-08-02 10:15:38,864 INFO: Starting master-1 with:
/home/ubuntu/Database/yugabyte-2.1.8.1/bin/yb-master --fs_data_dirs "/home/ubuntu/yugabyte-data/node-1/disk-1" --webserver_interface 127.0.0.1 --rpc_bind_addresses 127.0.0.1 --v 0 --version_file_json_path=/home/ubuntu/Database/yugabyte-2.1.8.1 --webserver_doc_root "/home/ubuntu/Database/yugabyte-2.1.8.1/www" --replication_factor=1 --yb_num_shards_per_tserver 2 --ysql_num_shards_per_tserver=2 --default_memory_limit_to_ram_ratio=0.35 --master_addresses 127.0.0.1:7100 --enable_ysql=true >"/home/ubuntu/yugabyte-data/node-1/disk-1/master.out" 2>"/home/ubuntu/yugabyte-data/node-1/disk-1/master.err" &
2020-08-02 10:15:38,871 INFO: Starting tserver-1 with:
/home/ubuntu/Database/yugabyte-2.1.8.1/bin/yb-tserver --fs_data_dirs "/home/ubuntu/yugabyte-data/node-1/disk-1" --webserver_interface 127.0.0.1 --rpc_bind_addresses 127.0.0.1 --v 0 --version_file_json_path=/home/ubuntu/Database/yugabyte-2.1.8.1 --webserver_doc_root "/home/ubuntu/Database/yugabyte-2.1.8.1/www" --tserver_master_addrs=127.0.0.1:7100 --yb_num_shards_per_tserver=2 --redis_proxy_bind_address=127.0.0.1:6379 --cql_proxy_bind_address=127.0.0.1:9042 --local_ip_for_outbound_sockets=127.0.0.1 --use_cassandra_authentication=false --ysql_num_shards_per_tserver=2 --default_memory_limit_to_ram_ratio=0.65 --enable_ysql=true --pgsql_proxy_bind_address=127.0.0.1:5433 >"/home/ubuntu/yugabyte-data/node-1/disk-1/tserver.out" 2>"/home/ubuntu/yugabyte-data/node-1/disk-1/tserver.err" &
2020-08-02 10:15:38,873 INFO: Waiting for master and tserver processes to come up.
2020-08-02 10:15:49,111 INFO: PIDs found: {'tserver': [None], 'master': [None]}
2020-08-02 10:15:49,113 ERROR: Failed waiting for master and tserver processes to come up.
^^^ Encountered errors ^^^
I have a test server setup in which i am trying to install yugabyte. Everytime i try to create a cluster it throws an error.On my local server i encountered the same error but when i checked cluster status It shows cluster is created.But on test server when i am checking status it shows no node created.Although yugabyte-data folder is getting created .
I ran yb-ctl create as specified at https://download.yugabyte.com/local#linux and ran into these errors
13:10 $ bin/yb-ctl create
Creating cluster.
Waiting for cluster to be ready.
Viewing file /net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1/tserver.err:
/tmp/pkg1/yugabyte-2.0.7.0/bin/yb-tserver: error while loading shared libraries: libatomic.so.1: cannot open shared object file: No such file or directory
Viewing file /net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1/master.err:
/tmp/pkg1/yugabyte-2.0.7.0/bin/yb-master: error while loading shared libraries: libatomic.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "bin/yb-ctl", line 1968, in <module>
control.run()
File "bin/yb-ctl", line 1945, in run
self.args.func()
File "bin/yb-ctl", line 1707, in create_cmd_impl
self.wait_for_cluster_or_raise()
File "bin/yb-ctl", line 1552, in wait_for_cluster_or_raise
raise RuntimeError("Timed out waiting for a YugaByte DB cluster!")
RuntimeError: Timed out waiting for a YugaByte DB cluster!
Viewing file /tmp/tmp3NIbj3:
2019-12-06 13:10:18,634 INFO: Starting master-1 with:
/tmp/pkg1/yugabyte-2.0.7.0/bin/yb-master --fs_data_dirs "/net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1" --webserver_interface 127.0.0.1 --rpc_bind_addresses 127.0.0.1 --v 0 --version_file_json_path=/tmp/pkg1/yugabyte-2.0.7.0 --webserver_doc_root "/tmp/pkg1/yugabyte-2.0.7.0/www" --callhome_enabled=false --replication_factor=1 --yb_num_shards_per_tserver 2 --ysql_num_shards_per_tserver=2 --master_addresses 127.0.0.1:7100 --enable_ysql=true >"/net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1/master.out" 2>"/net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1/master.err" &
2019-12-06 13:10:18,658 INFO: Starting tserver-1 with:
/tmp/pkg1/yugabyte-2.0.7.0/bin/yb-tserver --fs_data_dirs "/net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1" --webserver_interface 127.0.0.1 --rpc_bind_addresses 127.0.0.1 --v 0 --version_file_json_path=/tmp/pkg1/yugabyte-2.0.7.0 --webserver_doc_root "/tmp/pkg1/yugabyte-2.0.7.0/www" --callhome_enabled=false --tserver_master_addrs=127.0.0.1:7100 --yb_num_shards_per_tserver=2 --redis_proxy_bind_address=127.0.0.1:6379 --cql_proxy_bind_address=127.0.0.1:9042 --local_ip_for_outbound_sockets=127.0.0.1 --use_cassandra_authentication=false --ysql_num_shards_per_tserver=2 --enable_ysql=true --pgsql_proxy_bind_address=127.0.0.1:5433 >"/net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1/tserver.out" 2>"/net/dev-server-sanketh-3/share/yugabyte-data/node-1/disk-1/tserver.err" &
2019-12-06 13:10:18,662 INFO: Waiting for master and tserver processes to come up.
2019-12-06 13:10:29,126 INFO: PIDs found: {'tserver': [None], 'master': [None]}
2019-12-06 13:10:29,127 ERROR: Failed waiting for master and tserver processes to come up.
^^^ Encountered errors ^^^
Could you please let me know as to how I could fix this?
Did you run ./bin/post_install.sh from the setup ?
If yes, maybe you're missing apt-get install libatomic1 ?
I am running a spark(1.2.1) standalone cluster on my virtual machine(Ubuntu 12.04). I can run the example such as als.py and pi.py successfully. But I can't run the workcount.py example because a connection error will occur.
bin/spark-submit --master spark://192.168.1.211:7077 /examples/src/main/python/wordcount.py ~/Documents/Spark_Examples/wordcount.py
The error message is as below:
15/03/13 22:26:02 INFO BlockManagerMasterActor: Registering block manager a12:45594 with 267.3 MB RAM, BlockManagerId(0, a12, 45594)
15/03/13 22:26:03 INFO Client: Retrying connect to server: a11/192.168.1.211:9000. Already tried 4 time(s).
......
Traceback (most recent call last):
File "/home/spark/spark/examples/src/main/python/wordcount.py", line 32, in <module>
.reduceByKey(add)
File "/home/spark/spark/lib/spark-assembly-1.2.1 hadoop1.0.4.jar/pyspark/rdd.py", line 1349, in reduceByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1559, in combineByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1942, in _defaultReducePartitions
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 297, in getNumPartitions
......
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
java.lang.RuntimeException: java.net.ConnectException: Call to a11/192.168.1.211:9000 failed on connection exception: java.net.ConnectException: Connection refused
......
I didn't use Yarn or ZooKeeper. And all the virtual machines can connect to each other via ssh without password. I also set the SPARK_LOCAL_IP for master and workers.
I think that wordcount.py example is accessing hdfs to reading lines in a file (and then count the words)
Something like:
sc.textFile("hdfs://<master-hostname>:9000/path/to/whatever")
Port 9000 is usually used for hdfs.
Please be sure that this file is accessible or do not use hdfs for that example :).
I hope it helps.