Using Pycuda with PySpark - nvcc not found - apache-spark

My environment:
I'm using Hortonworks HDP 2.4 with Spark 1.6.1 on a small AWS EC2 cluster of 4 g2.2xlarge instances with Ubuntu 14.04. Each instance has CUDA 7.5, Anaconda Python 3.5, and Pycuda 2016.1.1.
in /etc/bash.bashrc I've set:
CUDA_HOME=/usr/local/cuda
CUDA_ROOT=/usr/local/cuda
PATH=$PATH:/usr/local/cuda/bin
On all 4 machines I can access nvcc from the command line for the ubuntu user, the root user, and the yarn user.
My problem:
I have a Python-Pycuda project I've adapted to run on Spark. It runs great on my local Spark installation on my Mac, but when I run it on AWS I get:
FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'
since it runs on my Mac in local mode, my guess is that it is a configuration issue with CUDA/Pycuda in the worker processes but I'm really stumped as to what it could be.
Any ideas?
Edit: Below is a stack trace from one of the jobs failing:
16/11/10 22:34:54 INFO ExecutorAllocationManager: Requesting 13 new executors because tasks are backlogged (new desired total will be 17)
16/11/10 22:34:57 INFO TaskSetManager: Starting task 16.0 in stage 2.0 (TID 34, ip-172-31-26-35.ec2.internal, partition 16,RACK_LOCAL, 2148 bytes)
16/11/10 22:34:57 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-26-35.ec2.internal:54657 (size: 32.2 KB, free: 511.1 MB)
16/11/10 22:35:03 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 18, ip-172-31-26-35.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 46, in call_capture_output
popen = Popen(cmdline, cwd=cwd, stdin=PIPE, stdout=PIPE, stderr=PIPE)
File "/home/ubuntu/anaconda3/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/home/ubuntu/anaconda3/lib/python3.5/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/hadoop/yarn/local/usercache/ubuntu/appcache/application_1478814770538_0004/container_e40_1478814770538_0004_01_000009/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/hadoop/yarn/local/usercache/ubuntu/appcache/application_1478814770538_0004/container_e40_1478814770538_0004_01_000009/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/home/ubuntu/pycuda-euler/src/cli_spark_gpu.py", line 36, in <lambda>
hail_mary = data.mapPartitions(lambda x: ec.assemble2(k, buffer=x, readLength = dataLength,readCount=dataCount)).saveAsTextFile('hdfs://172.31.26.32/genome/sra_output')
File "./eulercuda.zip/eulercuda/eulercuda.py", line 499, in assemble2
lmerLength, evList, eeList, levEdgeList, entEdgeList, readCount)
File "./eulercuda.zip/eulercuda/eulercuda.py", line 238, in constructDebruijnGraph
lmerCount, h_kmerKeys, h_kmerValues, kmerCount, numReads)
File "./eulercuda.zip/eulercuda/eulercuda.py", line 121, in readLmersKmersCuda
d_lmers = enc.encode_lmer_device(buffer, partitionReadCount, d_lmers, readLength, lmerLength)
File "./eulercuda.zip/eulercuda/pyencode.py", line 78, in encode_lmer_device
""")
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 265, in __init__
arch, code, cache_dir, include_dirs)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 255, in compile
return compile_plain(source, options, keep, nvcc, cache_dir, target)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 78, in compile_plain
checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 50, in preprocess_source
result, stdout, stderr = call_capture_output(cmdline, error_on_nonzero=False)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 197, in call_capture_output
return forker[0].call_capture_output(cmdline, cwd, error_on_nonzero)
File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 54, in call_capture_output
% ( " ".join(cmdline), e))
pytools.prefork.ExecError: error invoking 'nvcc --preprocess -arch sm_30 -I/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/cuda /tmp/tmpkpqwoaxf.cu --compiler-options -P': [Errno 2] No such file or directory: 'nvcc'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

To close the loop on this, I finally worked my way through the problem.
Note: I know this is not really a good nor permanent answer for most people however in my case I am running POC code for my dissertation and as soon as I get some final results I'm decommissioning the servers. I doubt this answer will be suitable or appropriate for most users.
I ended up hardcoding the full path to nvcc into compile_plain() in Pycuda's compiler.py file.
Partial listing:
def compile_plain(source, options, keep, nvcc, cache_dir, target="cubin"):
from os.path import join
assert target in ["cubin", "ptx", "fatbin"]
nvcc = '/usr/local/cuda/bin/'+nvcc
if cache_dir:
checksum = _new_md5()
Hopefully this points someone else in the proper direction.

The error means that nvcc is not in PATH for the process that runs the code.
Amazon ECS Container Agent Configuration - Amazon EC2 Container Service has instructions on how to set up environment variables for the cluster.
For the same in Hadoop, there's Configuring Environment of Hadoop Daemons – Hadoop Cluster Setup.

Related

Unable to Start Scheduler

I am new to Python and trying to install Airflow in my Mac, by following this tutorial
While these two commands work fine:
$ airflow initdb
$ airflow webserver -p 8080
The scheduler command (airflow scheduler) throws the following error:
[2020-02-18 13:18:09,012] {scheduler_job.py:1382} ERROR - Exception when executing execute_helper Traceback (most recent call last):
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1380, in _execute
self._execute_helper()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1413, in _execute_helper
self.processor_agent.start()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 554, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'SchedulerJob._execute.<locals>.processor_factory'
[2020-02-18 13:18:09,035] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID None
Traceback (most recent call last): File "/Users/mac/Workspace/airflow/airflow_venv/bin/airflow", line 37, in <module>
args.func(args) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/bin/cli.py", line 1040, in scheduler
job.run() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 221, in run
self._execute() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1384, in _execute
self.processor_agent.end() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 707, in end
reap_process_group(self._process.pid, log=self.log) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 324, in reap_process_group
signal_procs(sig) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 293, in signal_procs
os.killpg(pgid, sig)
TypeError: an integer is required (got type NoneType)
EDIT: Python 3.8 is supported now https://github.com/apache/airflow#requirements. So this answer might not be relevant now.
This due to the Python version you are using. Airflow doesn't support Python 3.8 yet https://github.com/apache/airflow#stable-version-1109.
Downgrade your Python to 3.7 and check.
Maybe there are some compatibility problems?
Using Python 3.6.10 and airflow v1.10.4, I can get airflow running. Maybe you could try some other versions?
This worked for me!
1- Make sure you are using the correct celery version that supports your other packages like RabbitMQ ( as V5 doesn't support AMQP in its usual format), my advice is to use V4.6.X
2-THIS HAS NOTHING TO DO WITH PYTHON VERSION IF YOU ARE USING AIRFLOW V2.0
3- simply make yourself happy with airflow db reset (command may differ if you are using airflow Version X<2.0 )
4- Avoid deleting any dag like you delete a file and use airflow dag ... commands to do so. (it makes up a mess in your environment that you wont like, trust me on this..)
Wish you luck bearing python stuff..

ImportError: No module named scipy.stats._continuous_distns

I have Spark job which at the end uses saveAsTable to write the dataframe into an internal table w/ a given name.
The dataframe is created using different steps which one of them is using "beta" method in scipy, where I imported it through => from scipy.stats import beta. It's running on google cloud w/ 20 worker nodes but I get the following error which is complaining about scipy package,
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 14 in stage 7.0 failed 4 times, most recent failure:
Lost task 14.3 in stage 7.0 (TID 518, name-w-3.c.somenames.internal,
executor 23): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in
_read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in loads
return pickle.loads(obj)
ImportError: No module named scipy.stats._continuous_distns
Any idea or solutions?
I tried to pass the library as well for the spark job:
"spark.driver.extraLibraryPath" : "/usr/lib/spark/python/lib/pyspark.zip",
"spark.driver.extraClassPath" :"/usr/lib/spark/python/lib/pyspark.zip"
Is the library installed on all the nodes in the cluster?
You can simply do a
pip install --user scipy
I do it in AWS EMR using the bootstrap action, There should be a similar way on Google cloud as well

Process as Daemon on Init.d

I am trying to configure airflow webserver and scheduler to run. It is a python application.
I used "python setup.py install" and then using the shell comand:
(start-stop-daemon --start --quiet --exec airflow webserver) started the processes.
Everything works ok.
But when I create a daemon script, on init.d i am getting:
2015-12-09 13:41:29,808 - root - INFO - Filling up the DagBag from /home/pedro/airflow/dags
2015-12-09 13:41:29,810 - root - INFO - Importing /home/pedro/airflow/dags/simple_ecs_dag.py
2015-12-09 13:41:29,830 - root - INFO - Loaded DAG
Running the Gunicorn server with 4 syncworkers on host 0.0.0.0 and port 8080...
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 4, in
import('pkg_resources').run_script('airflow==1.6.1', 'airflow')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 742, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 1667, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/airflow-1.6.1-py2.7.egg/EGG-INFO/scripts/airflow", line 17, in
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow-1.6.1-py2.7.egg/airflow/bin/cli.py", line 338, in webserver
'airflow.www.app:cached_app()'])
File "/usr/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
I imagine that the start-stop-daemon running or the python process are running under diferent users.
Does anyone can help me?

Why "Error loading column family" in OpsCenter when reading Column Family?

I'm trying to use the OpsCenter with my local multi-node development cluster created with CCM. I have manually installed and configured the Agents for each node using these instructions. I created my custom keyspace and its column families by uploading a SOURCE file in the CQLSH interface
I get the following error when clicking on Data > MyKeySpace > MyColumnFamily:
Error loading column family: Call to /test_cluster/keyspaces/flashcardsapp/cf/tag timed out.
I am however able to view the column families in the OpsCenter keyspace.
I am seeing the following in the OpsCenter log:
2015-03-14 07:58:35-0600 [] Unhandled Error
Traceback (most recent call last):
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 1076, in gotResult
_inlineCallbacks(r, g, deferred)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 1063, in _inlineCallbacks
deferred.callback(e.value)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 361, in callback
self._startRunCallbacks(result)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 455, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 542, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "build/lib/python2.7/site-packages/opscenterd/TwistedRouter.py", line 226, in controllerSucceeded
File "build/lib/python2.7/site-packages/opscenterd/WebServer.py", line 3953, in default_write
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "build/lib/python2.7/site-packages/opscenterd/WebServer.py", line 261, in default
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: UUID('457d5450-ca0b-11e4-a99a-53fff8597215') is not JSON serializable
My environment is as follows:
Cassandra: dsc-cassandra-2.1.2
OpsCenter: opscenter-5.1.0
Agents: datastax-agent-5.1.0
OS: OSX 10.10.1
There’s a known bug in OpsCenter where UUID columns in Cassandra 2.1.x are not handled properly. I am not aware of any workarounds (switching from UUID columns or downgrading C* to 2.0.x should work, but it might be a bit too much work.)
It’s going to be fixed in the upcoming patch release of OpsCenter 5.1 (not 5.1.1 though)

Cassandra - NullPointerException on new ColumnFamily creation

I'm running into the exact issue as described here https://issues.apache.org/jira/browse/CASSANDRA-4363 but with Cassandra 1.1.2, cqlsh --cql3.
When I try to create a column family, the error I get is
Traceback (most recent call last):
File "./cqlsh", line 1008, in perform_statement
self.cursor.execute(statement, decoder=decoder)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 117, in execute
response = self.handle_cql_execution_errors(doquery, prepared_q, compress)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 132, in handle_cql_execution_errors
return executor(*args, **kwargs)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1583, in execute_cql_query
self.send_execute_cql_query(query, compression)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1593, in send_execute_cql_query
self._oprot.trans.flush()
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TTransport.py", line 293, in flush
self.__trans.write(buf)
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TSocket.py", line 117, in write
plus = self.handle.send(buff)
error: [Errno 32] Broken pipe
and sometimes I simply get TSocket read 0 bytes.
The server side log is also the same as mentioned in the JIRA ticket.
ERROR [Thrift:12] 2012-09-05 12:06:10,999 CustomTThreadPoolServer.java (line 204) Error occurred during processing of message.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:373)
at org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:188)
at org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:139)
at org.apache.cassandra.cql3.statements.CreateColumnFamilyStatement.announceMigration(CreateColumnFamilyStatement.java:83)
at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:99)
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:108)
at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:121)
at org.apache.cassandra.thrift.CassandraServer.execute_cql_query(CassandraServer.java:1237)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3542)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3530)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:369)
... 15 more
Caused by: java.lang.NullPointerException
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
I tried deleting /var/lib/cassandra directory and restarting the server. I still get the error.
This bug in JIRA has been marked as Cannot Reproduce, Fixed version None.
So what do I do to get my cassandra working again?
https://issues.apache.org/jira/browse/CASSANDRA-4526 suggests that this was fixed in 1.1.3. I'd upgrade (to 1.1.4, the most recent release).

Resources