My environment:
I'm using Hortonworks HDP 2.4 with Spark 1.6.1 on a small AWS EC2 cluster of 4 g2.2xlarge instances with Ubuntu 14.04. Each instance has CUDA 7.5, Anaconda Python 3.5, and Pycuda 2016.1.1.
in /etc/bash.bashrc I've set:
CUDA_HOME=/usr/local/cuda
CUDA_ROOT=/usr/local/cuda
PATH=$PATH:/usr/local/cuda/bin
On all 4 machines I can access nvcc from the command line for the ubuntu user, the root user, and the yarn user.
My problem:
I have a Python-Pycuda project I've adapted to run on Spark. It runs great on my local Spark installation on my Mac, but when I run it on AWS I get:
FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'
since it runs on my Mac in local mode, my guess is that it is a configuration issue with CUDA/Pycuda in the worker processes but I'm really stumped as to what it could be.
Any ideas?
Edit: Below is a stack trace from one of the jobs failing:
16/11/10 22:34:54 INFO ExecutorAllocationManager: Requesting 13 new executors because tasks are backlogged (new desired total will be 17)
16/11/10 22:34:57 INFO TaskSetManager: Starting task 16.0 in stage 2.0 (TID 34, ip-172-31-26-35.ec2.internal, partition 16,RACK_LOCAL, 2148 bytes)
16/11/10 22:34:57 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-26-35.ec2.internal:54657 (size: 32.2 KB, free: 511.1 MB)
16/11/10 22:35:03 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 18, ip-172-31-26-35.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 46, in call_capture_output
popen = Popen(cmdline, cwd=cwd, stdin=PIPE, stdout=PIPE, stderr=PIPE)
File "/home/ubuntu/anaconda3/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/home/ubuntu/anaconda3/lib/python3.5/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/hadoop/yarn/local/usercache/ubuntu/appcache/application_1478814770538_0004/container_e40_1478814770538_0004_01_000009/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/hadoop/yarn/local/usercache/ubuntu/appcache/application_1478814770538_0004/container_e40_1478814770538_0004_01_000009/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/home/ubuntu/pycuda-euler/src/cli_spark_gpu.py", line 36, in <lambda>
hail_mary = data.mapPartitions(lambda x: ec.assemble2(k, buffer=x, readLength = dataLength,readCount=dataCount)).saveAsTextFile('hdfs://172.31.26.32/genome/sra_output')
File "./eulercuda.zip/eulercuda/eulercuda.py", line 499, in assemble2
lmerLength, evList, eeList, levEdgeList, entEdgeList, readCount)
File "./eulercuda.zip/eulercuda/eulercuda.py", line 238, in constructDebruijnGraph
lmerCount, h_kmerKeys, h_kmerValues, kmerCount, numReads)
File "./eulercuda.zip/eulercuda/eulercuda.py", line 121, in readLmersKmersCuda
d_lmers = enc.encode_lmer_device(buffer, partitionReadCount, d_lmers, readLength, lmerLength)
File "./eulercuda.zip/eulercuda/pyencode.py", line 78, in encode_lmer_device
""")
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 265, in __init__
arch, code, cache_dir, include_dirs)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 255, in compile
return compile_plain(source, options, keep, nvcc, cache_dir, target)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 78, in compile_plain
checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 50, in preprocess_source
result, stdout, stderr = call_capture_output(cmdline, error_on_nonzero=False)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 197, in call_capture_output
return forker[0].call_capture_output(cmdline, cwd, error_on_nonzero)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 54, in call_capture_output
% ( " ".join(cmdline), e))
pytools.prefork.ExecError: error invoking 'nvcc --preprocess -arch sm_30 -I/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/cuda /tmp/tmpkpqwoaxf.cu --compiler-options -P': [Errno 2] No such file or directory: 'nvcc'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
To close the loop on this, I finally worked my way through the problem.
Note: I know this is not really a good nor permanent answer for most people however in my case I am running POC code for my dissertation and as soon as I get some final results I'm decommissioning the servers. I doubt this answer will be suitable or appropriate for most users.
I ended up hardcoding the full path to nvcc into compile_plain() in Pycuda's compiler.py file.
Partial listing:
def compile_plain(source, options, keep, nvcc, cache_dir, target="cubin"):
from os.path import join
assert target in ["cubin", "ptx", "fatbin"]
nvcc = '/usr/local/cuda/bin/'+nvcc
if cache_dir:
checksum = _new_md5()
Hopefully this points someone else in the proper direction.
The error means that nvcc is not in PATH for the process that runs the code.
Amazon ECS Container Agent Configuration - Amazon EC2 Container Service has instructions on how to set up environment variables for the cluster.
For the same in Hadoop, there's Configuring Environment of Hadoop Daemons – Hadoop Cluster Setup.
Related
I am new to Python and trying to install Airflow in my Mac, by following this tutorial
While these two commands work fine:
$ airflow initdb
$ airflow webserver -p 8080
The scheduler command (airflow scheduler) throws the following error:
[2020-02-18 13:18:09,012] {scheduler_job.py:1382} ERROR - Exception when executing execute_helper Traceback (most recent call last):
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1380, in _execute
self._execute_helper()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1413, in _execute_helper
self.processor_agent.start()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 554, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'SchedulerJob._execute.<locals>.processor_factory'
[2020-02-18 13:18:09,035] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID None
Traceback (most recent call last): File "/Users/mac/Workspace/airflow/airflow_venv/bin/airflow", line 37, in <module>
args.func(args) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/bin/cli.py", line 1040, in scheduler
job.run() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 221, in run
self._execute() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1384, in _execute
self.processor_agent.end() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 707, in end
reap_process_group(self._process.pid, log=self.log) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 324, in reap_process_group
signal_procs(sig) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 293, in signal_procs
os.killpg(pgid, sig)
TypeError: an integer is required (got type NoneType)
EDIT: Python 3.8 is supported now https://github.com/apache/airflow#requirements. So this answer might not be relevant now.
This due to the Python version you are using. Airflow doesn't support Python 3.8 yet https://github.com/apache/airflow#stable-version-1109.
Downgrade your Python to 3.7 and check.
Maybe there are some compatibility problems?
Using Python 3.6.10 and airflow v1.10.4, I can get airflow running. Maybe you could try some other versions?
This worked for me!
1- Make sure you are using the correct celery version that supports your other packages like RabbitMQ ( as V5 doesn't support AMQP in its usual format), my advice is to use V4.6.X
2-THIS HAS NOTHING TO DO WITH PYTHON VERSION IF YOU ARE USING AIRFLOW V2.0
3- simply make yourself happy with airflow db reset (command may differ if you are using airflow Version X<2.0 )
4- Avoid deleting any dag like you delete a file and use airflow dag ... commands to do so. (it makes up a mess in your environment that you wont like, trust me on this..)
Wish you luck bearing python stuff..
I have Spark job which at the end uses saveAsTable to write the dataframe into an internal table w/ a given name.
The dataframe is created using different steps which one of them is using "beta" method in scipy, where I imported it through => from scipy.stats import beta. It's running on google cloud w/ 20 worker nodes but I get the following error which is complaining about scipy package,
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 14 in stage 7.0 failed 4 times, most recent failure:
Lost task 14.3 in stage 7.0 (TID 518, name-w-3.c.somenames.internal,
executor 23): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in
_read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in loads
return pickle.loads(obj)
ImportError: No module named scipy.stats._continuous_distns
Any idea or solutions?
I tried to pass the library as well for the spark job:
"spark.driver.extraLibraryPath" : "/usr/lib/spark/python/lib/pyspark.zip",
"spark.driver.extraClassPath" :"/usr/lib/spark/python/lib/pyspark.zip"
Is the library installed on all the nodes in the cluster?
You can simply do a
pip install --user scipy
I do it in AWS EMR using the bootstrap action, There should be a similar way on Google cloud as well
I am trying to configure airflow webserver and scheduler to run. It is a python application.
I used "python setup.py install" and then using the shell comand:
(start-stop-daemon --start --quiet --exec airflow webserver) started the processes.
Everything works ok.
But when I create a daemon script, on init.d i am getting:
2015-12-09 13:41:29,808 - root - INFO - Filling up the DagBag from /home/pedro/airflow/dags
2015-12-09 13:41:29,810 - root - INFO - Importing /home/pedro/airflow/dags/simple_ecs_dag.py
2015-12-09 13:41:29,830 - root - INFO - Loaded DAG
Running the Gunicorn server with 4 syncworkers on host 0.0.0.0 and port 8080...
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 4, in
import('pkg_resources').run_script('airflow==1.6.1', 'airflow')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 742, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 1667, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/airflow-1.6.1-py2.7.egg/EGG-INFO/scripts/airflow", line 17, in
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow-1.6.1-py2.7.egg/airflow/bin/cli.py", line 338, in webserver
'airflow.www.app:cached_app()'])
File "/usr/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
I imagine that the start-stop-daemon running or the python process are running under diferent users.
Does anyone can help me?
I'm trying to use the OpsCenter with my local multi-node development cluster created with CCM. I have manually installed and configured the Agents for each node using these instructions. I created my custom keyspace and its column families by uploading a SOURCE file in the CQLSH interface
I get the following error when clicking on Data > MyKeySpace > MyColumnFamily:
Error loading column family: Call to /test_cluster/keyspaces/flashcardsapp/cf/tag timed out.
I am however able to view the column families in the OpsCenter keyspace.
I am seeing the following in the OpsCenter log:
2015-03-14 07:58:35-0600 [] Unhandled Error
Traceback (most recent call last):
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 1076, in gotResult
_inlineCallbacks(r, g, deferred)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 1063, in _inlineCallbacks
deferred.callback(e.value)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 361, in callback
self._startRunCallbacks(result)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 455, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 542, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "build/lib/python2.7/site-packages/opscenterd/TwistedRouter.py", line 226, in controllerSucceeded
File "build/lib/python2.7/site-packages/opscenterd/WebServer.py", line 3953, in default_write
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "build/lib/python2.7/site-packages/opscenterd/WebServer.py", line 261, in default
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: UUID('457d5450-ca0b-11e4-a99a-53fff8597215') is not JSON serializable
My environment is as follows:
Cassandra: dsc-cassandra-2.1.2
OpsCenter: opscenter-5.1.0
Agents: datastax-agent-5.1.0
OS: OSX 10.10.1
There’s a known bug in OpsCenter where UUID columns in Cassandra 2.1.x are not handled properly. I am not aware of any workarounds (switching from UUID columns or downgrading C* to 2.0.x should work, but it might be a bit too much work.)
It’s going to be fixed in the upcoming patch release of OpsCenter 5.1 (not 5.1.1 though)
I'm running into the exact issue as described here https://issues.apache.org/jira/browse/CASSANDRA-4363 but with Cassandra 1.1.2, cqlsh --cql3.
When I try to create a column family, the error I get is
Traceback (most recent call last):
File "./cqlsh", line 1008, in perform_statement
self.cursor.execute(statement, decoder=decoder)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 117, in execute
response = self.handle_cql_execution_errors(doquery, prepared_q, compress)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 132, in handle_cql_execution_errors
return executor(*args, **kwargs)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1583, in execute_cql_query
self.send_execute_cql_query(query, compression)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1593, in send_execute_cql_query
self._oprot.trans.flush()
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TTransport.py", line 293, in flush
self.__trans.write(buf)
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TSocket.py", line 117, in write
plus = self.handle.send(buff)
error: [Errno 32] Broken pipe
and sometimes I simply get TSocket read 0 bytes.
The server side log is also the same as mentioned in the JIRA ticket.
ERROR [Thrift:12] 2012-09-05 12:06:10,999 CustomTThreadPoolServer.java (line 204) Error occurred during processing of message.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:373)
at org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:188)
at org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:139)
at org.apache.cassandra.cql3.statements.CreateColumnFamilyStatement.announceMigration(CreateColumnFamilyStatement.java:83)
at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:99)
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:108)
at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:121)
at org.apache.cassandra.thrift.CassandraServer.execute_cql_query(CassandraServer.java:1237)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3542)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3530)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:369)
... 15 more
Caused by: java.lang.NullPointerException
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
I tried deleting /var/lib/cassandra directory and restarting the server. I still get the error.
This bug in JIRA has been marked as Cannot Reproduce, Fixed version None.
So what do I do to get my cassandra working again?
https://issues.apache.org/jira/browse/CASSANDRA-4526 suggests that this was fixed in 1.1.3. I'd upgrade (to 1.1.4, the most recent release).