PySpark - The system cannot find the path specified - apache-spark

Hy,
I have been run Spark multiple times (Spyder IDE).
Today I got this error (the code it's the same)
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
os.environ['SPARK_HOME']="C:/Apache/spark-1.6.0"
os.environ['JAVA_HOME']="C:/Program Files/Java/jre1.8.0_71"
sys.path.append("C:/Apache/spark-1.6.0/python/")
os.environ['HADOOP_HOME']="C:/Apache/spark-1.6.0/winutils/"
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
The system cannot find the path specified.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Apache\spark-1.6.0\python\pyspark\conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "C:\Apache\spark-1.6.0\python\pyspark\context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "C:\Apache\spark-1.6.0\python\pyspark\java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
What's go wrong?
thanks for your time.

Ok... Someone install a new java version in VirtualMachine. I'm only change this
os.environ['JAVA_HOME']="C:/Program Files/Java/jre1.8.0_91"
and works again.
thks for your time.

Related

Unable to run index migration in OpenSearch

I have a docker compose running where a django backend, opensearch & opensearch dashboard are running. I have connected the backend to talk to opensearch and I'm able to query it successfully. I'm trying to create indexes using this command inside the docker container.
./manage.py opensearch --rebuild
Reference: https://django-opensearch-dsl.readthedocs.io/en/latest/getting_started/#create-and-populate-opensearchs-indices
I get the following error when I run the above command
root#ed186e462ca3:/app# ./manage.py opensearch --rebuild
/usr/local/lib/python3.6/site-packages/OpenSSL/crypto.py:8: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
from cryptography import utils, x509
Traceback (most recent call last):
File "./manage.py", line 10, in <module>
execute_from_command_line(sys.argv)
File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 224, in fetch_command
klass = load_command_class(app_name, subcommand)
File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 37, in load_command_class
return module.Command()
File "/usr/local/lib/python3.6/site-packages/django_opensearch_dsl/management/commands/opensearch.py", line 32, in __init__
if settings.TESTING: # pragma: no cover
File "/usr/local/lib/python3.6/site-packages/django/conf/__init__.py", line 80, in __getattr__
val = getattr(self._wrapped, name)
AttributeError: 'Settings' object has no attribute 'TESTING'
Sentry is attempting to send 1 pending error messages
Waiting up to 2 seconds
Press Ctrl-C to quit
I'm not sure where I'm going wrong. Any help would be greatful.
TIA
For future reference, this was indeed a bug in django-opensearch-dsl which was fix in the 0.3.0 release.

MODULE MASKED IN PYSPARK . NOT ACCESSIBLE

I am a novice to Pyspark. I was trying to run a pyspark code. I ran a code with name "time.py" because of which the pyspark is not able to run now. I get the below error.
Traceback (most recent call last):
File "/home/VAL_CODE/test.py", line 1, in <module>
from pyspark import SparkContext,HiveContext,SparkConf
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p0.1796617/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 51, in <module>
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p0.1796617/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 24, in <module>
File "/usr/lib64/python2.7/threading.py", line 14, in <module>
from time import time as _time, sleep as _sleep
File "/home/VAL_CODE/time.py", line 1, in <module>
ImportError: cannot import name SparkContext
20/08/13 19:04:16 INFO util.ShutdownHookManager: Shutdown hook called
20/08/13 19:04:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-f104da1f-ba70-4c45-8a19-6ffc55b609aa
The error is that the script "/usr/lib64/python2.7/threading.py" is referring to the local script "/home/VAL_CODE/time.py" which I created. Now, I have deleted the "/home/VAL_CODE/time.py" script. But still face the issue when I run a new code "/home/VAL_CODE/test.py". Please help to resolve.

Unable to Start Scheduler

I am new to Python and trying to install Airflow in my Mac, by following this tutorial
While these two commands work fine:
$ airflow initdb
$ airflow webserver -p 8080
The scheduler command (airflow scheduler) throws the following error:
[2020-02-18 13:18:09,012] {scheduler_job.py:1382} ERROR - Exception when executing execute_helper Traceback (most recent call last):
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1380, in _execute
self._execute_helper()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1413, in _execute_helper
self.processor_agent.start()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 554, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'SchedulerJob._execute.<locals>.processor_factory'
[2020-02-18 13:18:09,035] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID None
Traceback (most recent call last): File "/Users/mac/Workspace/airflow/airflow_venv/bin/airflow", line 37, in <module>
args.func(args) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/bin/cli.py", line 1040, in scheduler
job.run() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 221, in run
self._execute() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1384, in _execute
self.processor_agent.end() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 707, in end
reap_process_group(self._process.pid, log=self.log) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 324, in reap_process_group
signal_procs(sig) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 293, in signal_procs
os.killpg(pgid, sig)
TypeError: an integer is required (got type NoneType)
EDIT: Python 3.8 is supported now https://github.com/apache/airflow#requirements. So this answer might not be relevant now.
This due to the Python version you are using. Airflow doesn't support Python 3.8 yet https://github.com/apache/airflow#stable-version-1109.
Downgrade your Python to 3.7 and check.
Maybe there are some compatibility problems?
Using Python 3.6.10 and airflow v1.10.4, I can get airflow running. Maybe you could try some other versions?
This worked for me!
1- Make sure you are using the correct celery version that supports your other packages like RabbitMQ ( as V5 doesn't support AMQP in its usual format), my advice is to use V4.6.X
2-THIS HAS NOTHING TO DO WITH PYTHON VERSION IF YOU ARE USING AIRFLOW V2.0
3- simply make yourself happy with airflow db reset (command may differ if you are using airflow Version X<2.0 )
4- Avoid deleting any dag like you delete a file and use airflow dag ... commands to do so. (it makes up a mess in your environment that you wont like, trust me on this..)
Wish you luck bearing python stuff..

Airflow scheduler starts up with exception when parallelism is set to a large number

I am new to Airflow and I am trying to use airflow to build a data pipeline, but it keeps getting some exceptions. My airflow.cfg look like this:
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow#localhost/airflow
sql_alchemy_pool_size = 5
parallelism = 96
dag_concurrency = 96
worker_concurrency = 96
max_threads = 96
broker_url = postgresql+psycopg2://airflow:airflow#localhost/airflow
result_backend = postgresql+psycopg2://airflow:airflow#localhost/airflow
When I started up airflow webserver -p 8080 in one terminal and then airflow scheduler in another terminal, the scheduler run will have the following execption(It failed when I set the parallelism number greater some amount, it works fine otherwise, this may be computer-specific but at least we know that it is resulted by the parallelism). I have tried run 1000 python processes on my computer and it worked fine, I have configured Postgres to allow maximum 500 database connections but it is still giving me the errors.
[2019-11-20 12:15:00,820] {dag_processing.py:556} INFO - Launched DagFileProcessorManager with pid: 85050
Process QueuedLocalWorker-18:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 811, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/edward/.local/share/virtualenvs/avat-utils-JpGzQGRW/lib/python3.7/site-packages/airflow/executors/local_executor.py", line 111, in run
key, command = self.task_queue.get()
File "<string>", line 2, in get
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 815, in _callmethod
self._connect()
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 802, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
Thanks
Updated: I tried run in Pycharm, and it worked fine in Pycharm but sometimes failed in the terminal and sometimes it's not
I had the same issue. Turns out I had set max_threads=10 in airflow.cfg in combination with LocalExecutor. Switching max_threads=2 solved the issue.
Found out few days ago, Airflow actually starts up all the parallel process when starting up, I was thinking max_sth and parallelism as the capacity but it is the number of processes it will run when start up. So it looks like this issue is caused by the insufficient resources of the computer.

memsql-ops start error in thread-7

After installing and starting memsql-ops, it shows the following error:
# ./memsql-ops start
Starting MemSQL Ops...
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/local/updated-openssl/lib/python3.4/threading.py", line 921, in _bootstrap_inner
File "/usr/local/updated-openssl/lib/python3.4/threading.py", line 869, in run
File "/memsql_platform/memsql_platform/agent/daemon/manage.py", line 200, in startup_watcher
File "/memsql_platform/memsql_platform/network/api_client.py", line 34, in call
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/__init__.py", line 516, in loads
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/decoder.py", line 370, in decode
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/decoder.py", line 400, in raw_decode
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Does anyone know the issue?
OS : CentOS 6.7
Memsql : 5.1.0 Enterprise Trial
It is likely that you have another server running on the port Ops is trying to start with (default 9000) that is returning data which is not JSON decodable. The solution is to either start MemSQL Ops on a different port, or kill the server running at that port.
We will fix this bug in an upcoming release of Ops! Thanks for pointing it out.

Resources