apache spark "Py4JError: Answer from Java side is empty" - apache-spark

I get this error every time...
I use sparkling water...
My conf-file:
***"spark.driver.memory 65g
spark.python.worker.memory 65g
spark.master local[*]"***
The amount of data is about 5 Gb.
There is no another information about this error...
Does anybody know why it happens? Thank you!
***"ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused"***

Have you tried setting spark.executor.memory and spark.driver.memory in your Spark configuration file?
See https://stackoverflow.com/a/22742982/5453184 for more info.

Usually, you'll see this error when the Java process get silently killed by the OOM Killer.
The OOM Killer (Out of Memory Killer) is a Linux process that kicks in when the system becomes critically low on memory. It selects a process based on its "badness" score and kills it to reclaim memory.
Read more on OOM Killer here.
Increasing spark.executor.memory and/or spark.driver.memory values will only make things worse in this case, i.e. you may want to do the opposite!
Other options would be to:
increase the number of partitions if you're working with very big data sources;
increase the number of worker nodes;
add more physical memory to worker/driver nodes;
Or, if you're running your driver/workers using docker:
increase docker memory limit;
set --oom-kill-disable on your containers, but make sure you understand possible consequences!
Read more on --oom-kill-disable and other docker memory settings here.

Another point to note if you are on wsl2 using pyspark. Ensure that your wsl2 config file has an increased memory.
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB
memory=12GB # This was originally set to 3gb which caused me to fail since spark.executor.memory and spark.driver.memory was only able to MAX of 3gb regardless of how high i set it.
# Sets the VM to use eight virtual processors
processors=8
for reference. your .wslconfig config file should be located in C:\Users\USERNAME

Related

Python script on ubuntu - OSError: [Errno 12] Cannot allocate memory

I am running a script on AWS (Ubunut) EC2 instance. It's a web scraper that uses selenium/chromedriver and headless chrome to scrape some webpages. I've had this script running previously with no problems, but today I'm getting an error. Here's the script:
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1420,1080')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument("--disable-notifications")
options.binary_location='/usr/bin/chromium-browser'
driver = webdriver.Chrome(chrome_options=options)
#Set base url (SAN FRANCISCO)
base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='
events = []
for i in range(1,90):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
print(pageURL)
When I run this script from ubuntu, I get this error:
Traceback (most recent call last):
File "BandsInTown_Scraper_SF.py", line 91, in <module>
driver = webdriver.Chrome(chrome_options=options)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
self.service.start()
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1295, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
I confirmed that I'm running the same version of Chromedriver/Chromium Browser:
ChromeDriver 79.0.3945.130 (e22de67c28798d98833a7137c0e22876237fc40a-refs/branch-heads/3945#{#1047})
Chromium 79.0.3945.130 Built on Ubuntu , running on Ubuntu 18.04
For what it's worth, I have this running on a mac, and I do have multiple web scraping scripts like this one running on the same EC2 instance (only 2 scripts so far, so not that much).
Update
I'm now getting these errors as well when trying to run this script on ubuntu:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 346, in _make_request
self._validate_conn(conn)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 852, in _validate_conn
conn.connect()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 284, in connect
conn = self._new_conn()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
^[[B File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
^[[B^[[A^[[A _stacktrace=sys.exc_info()[2])
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.bandsintown.com', port=443): Max retries exceeded with url: /en/c/san-francisco-ca?page=6 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "BandsInTown_Scraper_SF.py", line 39, in <module>
res = requests.get(url)
File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.bandsintown.com', port=443): Max retries exceeded with url: /en/c/san-francisco-ca?page=6 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
Finally, here's my currently monthly AWS usage, which doesn't show any memory quota being exceed.
This error message...
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
...implies that the operating system was unable to allocate memory to initiate/spawn a new session.
Additionally, this error message...
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.bandsintown.com', port=443): Max retries exceeded with url: /en/c/san-francisco-ca?page=6 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
...implies that your program have successfully iterated till Page 5 and while on Page 6 you see this error.
I don't see any issues in your code block as such. I have taken your code, made some minor adjustments and here is the execution result:
Code Block:
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='
for i in range(1,10):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
print(pageURL)
Console Output:
https://www.bandsintown.com/en/c/san-francisco-ca?page=1
https://www.bandsintown.com/en/c/san-francisco-ca?page=2
https://www.bandsintown.com/en/c/san-francisco-ca?page=3
https://www.bandsintown.com/en/c/san-francisco-ca?page=4
https://www.bandsintown.com/en/c/san-francisco-ca?page=5
https://www.bandsintown.com/en/c/san-francisco-ca?page=6
https://www.bandsintown.com/en/c/san-francisco-ca?page=7
https://www.bandsintown.com/en/c/san-francisco-ca?page=8
https://www.bandsintown.com/en/c/san-francisco-ca?page=9
Deep dive
This error is coming from subprocess.py:
self.pid = _posixsubprocess.fork_exec(
args, executable_list,
close_fds, tuple(sorted(map(int, fds_to_keep))),
cwd, env_list,
p2cread, p2cwrite, c2pread, c2pwrite,
errread, errwrite,
errpipe_read, errpipe_write,
restore_signals, start_new_session, preexec_fn)
However, as per the discussion in OSError: [Errno 12] Cannot allocate memory this error OSError: [Errno 12] Cannot allocate memory is related to RAM / SWAP.
Swap Space
Swap Space is the memory space in the system hard drive that has been designated as a place for the os to temporarily store data which it can no longer hold with in the RAM. This gives you the ability to increase the amount of data your program can keep in its working memory. The swap space on the hard drive will be used primarily when there is no longer sufficient space in RAM to hold in-use application data. However, the information written to I/O will be significantly slower than information kept in RAM, but the operating system will prefer to keep running application data in memory and use swap space for the older data. Deploying swap space as a fall back for when your system’s RAM is depleted is a safety measure against out-of-memory issues on systems with non-SSD storage available.
System Check
To check if the system already has some swap space available, you need to execute the following command:
$ sudo swapon --show
If you don’t get any output, that means your system does not have swap space available currently. You can also verify that there is no active swap using the free utility as follows:
$ free -h
If there is no active swap in the system you will see an output as:
Output
total used free shared buff/cache available
Mem: 488M 36M 104M 652K 348M 426M
Swap: 0B 0B 0B
Creating Swap File
In these cases you need to allocate space for swap to use as a separate partition devoted to the task and you can create a swap file that resides on an existing partition. To create a 1 Gigabyte file you need to execute the following command:
$ sudo fallocate -l 1G /swapfile
You can verify that the correct amount of space was reserved by executing the following command:
$ ls -lh /swapfile
#Output
$ -rw-r--r-- 1 root root 1.0G Mar 08 10:30 /swapfile
This confirms the swap file has been created with the correct amount of space set aside.
Enabling the Swap Space
Once the correct size file is available we need to actually turn this into swap space. Now you need to lock down the permissions of the file so that only the users with specific privileges can read the contents. This prevents unintended users from being able to access the file, which would have significant security implications. So you need to follow the steps below:
Make the file only accessible to specific user e.g. root by executing the following command:
$ sudo chmod 600 /swapfile
Verify the permissions change by executing the following command:
$ ls -lh /swapfile
#Output
-rw------- 1 root root 1.0G Apr 25 11:14 /swapfile
This confirms only the root user has the read and write flags enabled.
Now you need to mark the file as swap space by executing the following command:
$ sudo mkswap /swapfile
#Sample Output
Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
no label, UUID=6e965805-2ab9-450f-aed6-577e74089dbf
Next you need to enable the swap file, allowing the system to start utilizing it executing the following command:
$ sudo swapon /swapfile
You can verify that the swap is available by executing the following command:
$ sudo swapon --show
#Sample Output
NAME TYPE SIZE USED PRIO
/swapfile file 1024M 0B -1
Finally check the output of the free utility again to validate the settings by executing the following command:
$ free -h
#Sample Output
total used free shared buff/cache available
Mem: 488M 37M 96M 652K 354M 425M
Swap: 1.0G 0B 1.0G
Conclusion
Once the Swap Space has been set up successfully the underlying operating system will begin to use it as necessary.
Probably what has happened is that Chromium browser is updated, now takes more memory (or perhaps leaks memory worse..you don't say how many urls it gets before dying)
As a work around, launch a larger instance size. Do don't say what instance size you are using but if you have a t3.micro try a t3.medium instead.
There is an easy to understand chart here https://www.ec2instances.info/?region=eu-west-1
If you have launched an instance and want to resize it without rebuilding from scratch then use the console to take it to state stopped, alter the size and start again

Airflow scheduler starts up with exception when parallelism is set to a large number

I am new to Airflow and I am trying to use airflow to build a data pipeline, but it keeps getting some exceptions. My airflow.cfg look like this:
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow#localhost/airflow
sql_alchemy_pool_size = 5
parallelism = 96
dag_concurrency = 96
worker_concurrency = 96
max_threads = 96
broker_url = postgresql+psycopg2://airflow:airflow#localhost/airflow
result_backend = postgresql+psycopg2://airflow:airflow#localhost/airflow
When I started up airflow webserver -p 8080 in one terminal and then airflow scheduler in another terminal, the scheduler run will have the following execption(It failed when I set the parallelism number greater some amount, it works fine otherwise, this may be computer-specific but at least we know that it is resulted by the parallelism). I have tried run 1000 python processes on my computer and it worked fine, I have configured Postgres to allow maximum 500 database connections but it is still giving me the errors.
[2019-11-20 12:15:00,820] {dag_processing.py:556} INFO - Launched DagFileProcessorManager with pid: 85050
Process QueuedLocalWorker-18:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 811, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/edward/.local/share/virtualenvs/avat-utils-JpGzQGRW/lib/python3.7/site-packages/airflow/executors/local_executor.py", line 111, in run
key, command = self.task_queue.get()
File "<string>", line 2, in get
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 815, in _callmethod
self._connect()
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 802, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
Thanks
Updated: I tried run in Pycharm, and it worked fine in Pycharm but sometimes failed in the terminal and sometimes it's not
I had the same issue. Turns out I had set max_threads=10 in airflow.cfg in combination with LocalExecutor. Switching max_threads=2 solved the issue.
Found out few days ago, Airflow actually starts up all the parallel process when starting up, I was thinking max_sth and parallelism as the capacity but it is the number of processes it will run when start up. So it looks like this issue is caused by the insufficient resources of the computer.

Can't load python NMSLIB index?

I'm trying to load an index of about 8.6GB using nmslib by the following commands:
search_index = nmslib.init(method='hnsw', space='cosinesimil')
search_index.loadIndex('./data/search/search_index.nmslib')
But, the following error happens:
Check failed: data_level0_memory_
Traceback (most recent call last):
search_index.loadIndex("./data/search/search_index.nmslib")
RuntimeError: Check failed: it's either a bug or inconsistent data!
My computer has only 8GB as primary memory. So, did this error happen because the index is over 8GB and could not be loaded into memory?
Thank you for any help.

'Cannot allocate memory' when submit Spark job

I got an error when try to submit a Spark Job to yarn.
But I can't understand which JVM throwed this error.
How can I avoid this error?
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006bff80000, 3579314176, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 3579314176 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/meituan/warehouse/ETL/sqlweaver/hs_err_pid6272.log
[DBUG]: AppID NOT FOUND
Traceback (most recent call last):
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/step.py", line 119, in act
self.actionclass.run(step = self)
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/action/base.py", line 60, in run
cls.__run__(step)
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/action/sparkjobsubmit.py", line 43, in __run__
handler.submit_job(step.template, step.args['submit_user'])
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/handler/SparkBaseHandler.py", line 115, in submit_job
raise RuntimeError(msg_f)
RuntimeError: <StringIO.StringIO instance at 0x5a929e0>
Traceback (most recent call last):
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/engine/koala.py", line 34, in process
step.act()
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/step.py", line 128, in act
raise e
RuntimeError: <StringIO.StringIO instance at 0x5a929e0>
Fail in Action Step(SparkJobSubmit)
"

Spark standalone mode : failed on connection exception:

I am running a spark(1.2.1) standalone cluster on my virtual machine(Ubuntu 12.04). I can run the example such as als.py and pi.py successfully. But I can't run the workcount.py example because a connection error will occur.
bin/spark-submit --master spark://192.168.1.211:7077 /examples/src/main/python/wordcount.py ~/Documents/Spark_Examples/wordcount.py
The error message is as below:
15/03/13 22:26:02 INFO BlockManagerMasterActor: Registering block manager a12:45594 with 267.3 MB RAM, BlockManagerId(0, a12, 45594)
15/03/13 22:26:03 INFO Client: Retrying connect to server: a11/192.168.1.211:9000. Already tried 4 time(s).
......
Traceback (most recent call last):
File "/home/spark/spark/examples/src/main/python/wordcount.py", line 32, in <module>
.reduceByKey(add)
File "/home/spark/spark/lib/spark-assembly-1.2.1 hadoop1.0.4.jar/pyspark/rdd.py", line 1349, in reduceByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1559, in combineByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1942, in _defaultReducePartitions
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 297, in getNumPartitions
......
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
java.lang.RuntimeException: java.net.ConnectException: Call to a11/192.168.1.211:9000 failed on connection exception: java.net.ConnectException: Connection refused
......
I didn't use Yarn or ZooKeeper. And all the virtual machines can connect to each other via ssh without password. I also set the SPARK_LOCAL_IP for master and workers.
I think that wordcount.py example is accessing hdfs to reading lines in a file (and then count the words)
Something like:
sc.textFile("hdfs://<master-hostname>:9000/path/to/whatever")
Port 9000 is usually used for hdfs.
Please be sure that this file is accessible or do not use hdfs for that example :).
I hope it helps.

Resources