'Cannot allocate memory' when submit Spark job - apache-spark

I got an error when try to submit a Spark Job to yarn.
But I can't understand which JVM throwed this error.
How can I avoid this error?
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006bff80000, 3579314176, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 3579314176 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/meituan/warehouse/ETL/sqlweaver/hs_err_pid6272.log
[DBUG]: AppID NOT FOUND
Traceback (most recent call last):
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/step.py", line 119, in act
self.actionclass.run(step = self)
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/action/base.py", line 60, in run
cls.__run__(step)
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/action/sparkjobsubmit.py", line 43, in __run__
handler.submit_job(step.template, step.args['submit_user'])
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/handler/SparkBaseHandler.py", line 115, in submit_job
raise RuntimeError(msg_f)
RuntimeError: <StringIO.StringIO instance at 0x5a929e0>
Traceback (most recent call last):
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/engine/koala.py", line 34, in process
step.act()
File "/opt/meituan/warehouse/ETL/sqlweaver/../sqlweaver/step.py", line 128, in act
raise e
RuntimeError: <StringIO.StringIO instance at 0x5a929e0>
Fail in Action Step(SparkJobSubmit)
"

Related

Unpickling error when running fairseq on AML using multiple GPUs

I am trying to run fairseq translation task on AML using 4 GPUs (P100)and it fails with the following error:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
2019-09-18
17:28:44,727|azureml.WorkerPool|DEBUG|[STOP]
Error occurred: User program failed with Exception:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
The same code with same param runs fine on a single local GPU. How do I resolve this issue?

pexpect telnet on windows

I've used pexpect on linux successfully to telnet/ssh into Cisco switches for CLI scrapping. I'm trying to convert this code over to Windows and having some issues since it doesn't support the pxpect.spawn() command.
I read some online documentation and it suggested to use the pexpect.popen_spawn.PopenSpawn command. Can someone please point to me what I'm doing wrong here? I removed all my exception handling to simplify the code. Thanks.
import pexpect
from pexpect.popen_spawn import PopenSpawn
child = pexpect.popen_spawn.PopenSpawn('C:/Windows/System32/telnet 192.168.1.1')
child.expect('Username:')
child.sendline('cisco')
child.expect('Password:')
child.sendline('cisco')
child.expect('>')
child.close()
Error:
Traceback (most recent call last):
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\expect.py", line 98, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\popen_spawn.py", line 68, in read_nonblocking
raise EOF('End Of File (EOF).')
pexpect.exceptions.EOF: End Of File (EOF).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python\windows scripts\telnet\telnet.py", line 37, in <module>
main()
File "C:\Python\windows scripts\telnet\telnet.py", line 20, in main
child.expect('Username:')
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\spawnbase.py", line 327, in expect
timeout, searchwindowsize, async_)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\spawnbase.py", line 355, in expect_list
return exp.expect_loop(timeout)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\expect.py", line 104, in expect_loop
return self.eof(e)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\expect.py", line 50, in eof
raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF).
<pexpect.popen_spawn.PopenSpawn object at 0x000000000328C550>
searcher: searcher_re:
0: re.compile("b'Username:'")
The solution is to use plink.exe, which is a part of putty installation. You can download it from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Enable TelnetClient on windows system and put plink.exe in the same folder from where you are running pexpect. Then you can telnet using pexpect as mentioned in the below example.
Also, use timeout flag with PopenSpawn to wait for the connection to establish. The above error is due to a timeout flag is not set.
p = pexpect.popen_spawn.PopenSpawn('plink.exe -telnet 192.168.0.1 -P 23', timeout=1)

CherryPy Autoloader dies with bcrypt

I have pure project on CherryPy and want AuthTool to use bcrypt module for password processing.
But when I put "import bcrypt" line into AuthTool.py the CherryPy tells me:
[17/Nov/2017:17:55:57] ENGINE Error in background task thread function <bound method Autoreloader.run of <cherrypy.process.plugins.Autoreloader object at 0x0000026948C82E80>>.
AttributeError: cffi library '_bcrypt' has no function, constant or global variable named '__loader__'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\site-packages\cherrypy\process\plugins.py", line 519, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\site-packages\cherrypy\process\plugins.py", line 651, in run
for filename in self.sysfiles() | self.files:
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\site-packages\cherrypy\process\plugins.py", line 635, in sysfiles
hasattr(m, '__loader__') and
SystemError: <built-in function hasattr> returned a result with an error set
Exception in thread Autoreloader:
AttributeError: cffi library '_bcrypt' has no function, constant or global variable named '__loader__'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\site-packages\cherrypy\process\plugins.py", line 519, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\site-packages\cherrypy\process\plugins.py", line 651, in run
for filename in self.sysfiles() | self.files:
File "C:\Users\Serge\AppData\Local\Programs\Python\Python36\lib\site-packages\cherrypy\process\plugins.py", line 635, in sysfiles
hasattr(m, '__loader__') and
SystemError: <built-in function hasattr> returned a result with an error set
After that everithing works fine but Autoloader. I have bcrypt (3.1.4) and CherryPy (11.1.0).
Can I do something to fix this issue this autoreloader? Merci.
You could disable the autoloader. Set 'engine.autoloader.on' to False. Of course you will loose auto loading when monitored files are changed and thus you'll need to manually restart the server on each change. Another way would be to put the server in production mode. See available mode and what they do at: http://docs.cherrypy.org/en/latest/config.html#id14
BTW: I am seeing a similar issue with cherrypy (13.0.0) and cryptography (2.1.4)

memsql-ops start error in thread-7

After installing and starting memsql-ops, it shows the following error:
# ./memsql-ops start
Starting MemSQL Ops...
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/local/updated-openssl/lib/python3.4/threading.py", line 921, in _bootstrap_inner
File "/usr/local/updated-openssl/lib/python3.4/threading.py", line 869, in run
File "/memsql_platform/memsql_platform/agent/daemon/manage.py", line 200, in startup_watcher
File "/memsql_platform/memsql_platform/network/api_client.py", line 34, in call
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/__init__.py", line 516, in loads
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/decoder.py", line 370, in decode
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/decoder.py", line 400, in raw_decode
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Does anyone know the issue?
OS : CentOS 6.7
Memsql : 5.1.0 Enterprise Trial
It is likely that you have another server running on the port Ops is trying to start with (default 9000) that is returning data which is not JSON decodable. The solution is to either start MemSQL Ops on a different port, or kill the server running at that port.
We will fix this bug in an upcoming release of Ops! Thanks for pointing it out.

apache spark "Py4JError: Answer from Java side is empty"

I get this error every time...
I use sparkling water...
My conf-file:
***"spark.driver.memory 65g
spark.python.worker.memory 65g
spark.master local[*]"***
The amount of data is about 5 Gb.
There is no another information about this error...
Does anybody know why it happens? Thank you!
***"ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused"***
Have you tried setting spark.executor.memory and spark.driver.memory in your Spark configuration file?
See https://stackoverflow.com/a/22742982/5453184 for more info.
Usually, you'll see this error when the Java process get silently killed by the OOM Killer.
The OOM Killer (Out of Memory Killer) is a Linux process that kicks in when the system becomes critically low on memory. It selects a process based on its "badness" score and kills it to reclaim memory.
Read more on OOM Killer here.
Increasing spark.executor.memory and/or spark.driver.memory values will only make things worse in this case, i.e. you may want to do the opposite!
Other options would be to:
increase the number of partitions if you're working with very big data sources;
increase the number of worker nodes;
add more physical memory to worker/driver nodes;
Or, if you're running your driver/workers using docker:
increase docker memory limit;
set --oom-kill-disable on your containers, but make sure you understand possible consequences!
Read more on --oom-kill-disable and other docker memory settings here.
Another point to note if you are on wsl2 using pyspark. Ensure that your wsl2 config file has an increased memory.
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB
memory=12GB # This was originally set to 3gb which caused me to fail since spark.executor.memory and spark.driver.memory was only able to MAX of 3gb regardless of how high i set it.
# Sets the VM to use eight virtual processors
processors=8
for reference. your .wslconfig config file should be located in C:\Users\USERNAME

Resources