submit pyspark job with virtual environment using livy to AWS EMR - python-3.x

I have created an EMR cluster with the below configurations following AWS documentation
https://aws.amazon.com/premiumsupport/knowledge-center/emr-pyspark-python-3x/
{
"Classification": "livy-conf",
"Properties": {
"livy.spark.deploy-mode": "cluster",
"livy.impersonation.enabled": "true",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/usr/bin/python3"
}
},
When I submit the pyspark job using livy with the following post request
```
payload = {
'file': self.py_file,
'pyFiles': self.py_files,
'name': self.job_name,
'archives': ['s3://test.test.bucket/venv.zip#venv', 's3://test.test.bucket/requirements.pip'],
'proxyUser': 'hadoop',
"conf": {
"PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.executorEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.appMasterEnv.VIRTUAL_ENV": "./venv/bin/python",
"spark.yarn.executorEnv.VIRTUAL_ENV": "./venv/bin/python",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.requirements": "s3://test.test.bucket/requirements.pip",
"spark.pyspark.virtualenv.path": "./venv/bin/python"
}
}
```
I get the following error message:
```
Log Type: stdout
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'
Current thread 0x00007efc72b57740 (most recent call first)
```
I also tried to change the PYTHONHOME PYTHONPATH to the parent folder of the bin file of python in the virtual environment, but nothing works.
```
"spark.yarn.appMasterEnv.PYTHONPATH": "./venv/bin/",
"spark.yarn.executorEnv.PYTHONPATH": "./venv/bin/",
"livy.spark.yarn.appMasterEnv.PYTHONPATH": "./venv/bin/",
"livy.spark.yarn.executorEnv.PYTHONPATH": "./venv/bin/",
#
"spark.yarn.appMasterEnv.PYTHONHOME": "./venv/bin/",
"spark.yarn.executorEnv.PYTHONHOME": "./venv/bin/",
"livy.spark.yarn.appMasterEnv.PYTHONHOME": "./venv/bin/",
"livy.spark.yarn.executorEnv.PYTHONHOME": "./venv/bin/",
```
Error:
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'
Current thread 0x00007f7351d53740 (most recent call first):
This is how I created the virtual environment
```
python3 -m venv venv/
source venv/bin/activate
python3 -m pip install -r requirements.pip
deactivate
pushd venv/
zip -rq ../venv.zip *
popd
```
virtual environment structure:
drwxrwxr-x 2 4096 Oct 15 12:37 bin/
drwxrwxr-x 2 4096 Oct 15 12:37 include/
drwxrwxr-x 3 4096 Oct 15 12:37 lib/
lrwxrwxrwx 1 3 Oct 15 12:37 lib64 -> lib/
-rw-rw-r-- 1 59 Oct 15 12:37 pip-selfcheck.json
-rw-rw-r-- 1 69 Oct 15 12:37 pyvenv.cfg
drwxrwxr-x 3 4096 Oct 15 12:37 share/
bin dir:
activate activate.csh activate.fish chardetect easy_install easy_install-3.5 pip pip3 pip3.5 python python3
lib dir:
python3.5/site-packages/
Aws support saying it's an ongoing bug.
https://issues.apache.org/jira/browse/SPARK-13587
https://issues.apache.org/jira/browse/ZEPPELIN-2233
Any suggestions ?
Thanks!

I needed to submit a PySpark job with virtual environment. To use a virtual virtual-env in EMR with the 5.x distribution I did this :
Go in the root of your code folder (example: /home/hadoop) and run:
virtualenv -p /usr/bin/python3 <your-venv_name>
source <your-venv_name>/bin/activate
Go into <your-venv_name>/bin and run:
./pip3 freeze --> ensure that is empty
sudo ./pip3 install -r <CODE FOLDER PATH>/requirements.txt
./pip3 freeze --> ensure that is populated
To submit my job I used (with basic config) this command:
spark-submit --conf spark.pyspark.virtualenv.bin.path=<path-to-your-venv_name>- --conf spark.pyspark.python=<path-to-your-venv_name>/bin/python3 --conf spark.pyspark.driver.python=<path-to-your-venv_name>/bin/python3 <path-to-your-main.py>
in the main.py code you also have to specify the "PYSPARK PYTHON" in the environment:
import os
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"

Related

Unable to locate JVM fatal error log file (hs_err_pid.log) after Dataproc Spark Job crash

After Apache Spark Executor JVM crash in C++ library I'm unable to locate hs_err_pid.log file, that is specified in the Executor JVM output log. Here's an example of Executor output log:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f6326dce8b0, pid=28580, tid=0x00007f630ea57700
#
# JRE version: OpenJDK Runtime Environment (8.0_212-b01) (build 1.8.0_212-8u212-b01-1~deb9u1-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.212-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libessence-jni.so+0x18b0] Java_com_evernote_service_nts_indexer_lib_Essence_EssProcess+0x0
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1559573462307_0002/container_1559573462307_0002_01_000005/hs_err_pid28580.log
10:50:00:[32m562[m [Executor task launch worker for task 41] [32mINFO[m .....NtsLibInternalIndexerProcessor(NtsLibInternalIndexerProcessor.java:50) [32mprocess [m Process for user: 18432
[thread 140063422109440 also had an error]
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
But when I'm SSH to target worker machine to locate hs_err_pid28580.log I can't find any traces of this file. I've tried:
vglazkov#reindex-cluster-vg-w-0:~$ sudo find / -name hs_err_pid28580.log
vglazkov#reindex-cluster-vg-w-0:~$
vglazkov#reindex-cluster-vg-w-0:~$ sudo ls -la /hadoop/yarn/nm-local-dir/usercache/root/appcache/
total 12
drwx--x--- 3 yarn yarn 4096 Jun 4 10:46 .
drwxr-x--- 4 yarn yarn 4096 May 15 15:47 ..
drwx--x--- 3 yarn yarn 4096 Jun 4 10:48 application_1557935076075_0097
But in the last case directory named application_1557935076075_0097 does not match my applicationId application_1559573462307_0002 and does not contain any hs_err_pid.log files

Aosp does not have tools/vendor/google3 project

When I build Android Studio from source code using: 'bazel build //tools/adt/idea/...' command can't always find 'tools/vendor/google3' module, isn't google no open source project?
zhangyang#zhangyang-OptiPlex-7040:~/aosp/gradle_3.1.2$ bazel build //tools/adt/idea/...
WARNING: ignoring http_proxy in environment.
Starting local Bazel server and connecting to it...
..............................
ERROR: error loading package '': Encountered error while reading extension file 'binds.bzl': no such package '#blaze//': /home/zhangyang/.cache/bazel/_bazel_zhangyang/e54d4cb13781c1d72b64dc99700261fe/external/blaze must be an existing directory
ERROR: error loading package '': Encountered error while reading extension file 'binds.bzl': no such package '#blaze//': /home/zhangyang/.cache/bazel/_bazel_zhangyang/e54d4cb13781c1d72b64dc99700261fe/external/blaze must be an existing directory
INFO: Elapsed time: 0.621s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
the bazel WORKSPACE:
load("//tools/base/bazel:repositories.bzl", "setup_external_repositories")
setup_external_repositories()
local_repository(
name = "blaze",
path = "tools/vendor/google3/blaze",
)
load("#blaze//:binds.bzl", "blaze_binds")
blaze_binds()
http_archive(
name = "bazel_toolchains",
urls = [
"https://mirror.bazel.build/github.com/bazelbuild/bazel-toolchains/archive/b49ba3689f46ac50e9277dafd8ff32b26951f82e.tar.gz",
"https://github.com/bazelbuild/bazel-toolchains/archive/b49ba3689f46ac50e9277dafd8ff32b26951f82e.tar.gz",
],
strip_prefix = "bazel-toolchains-b49ba3689f46ac50e9277dafd8ff32b26951f82e",
sha256 = "1266f1e27b4363c83222f1a776397c7a069fbfd6aacc9559afa61cdd73e1b429",
)
But Aosp does not have tools/vendor/google3 project
TL;DR:
bazel build is broken in AOSP
Use <studio-master-dev>/tools/idea/build_studio.sh instead
Or if you just want build a submodule inside tools/base , simply run gradle build. You might have to remove some dead dependencies from gradle.build but this shouldn't be difficult to fix.
Long version:
I had encountered the same error message, and took a look in the externaldirectory:
ls -lah ~/.cache/bazel/_bazel_xxx/89112fe8516b5fa5b01df0651312df31/external/
total 16K
drwxrwxr-x 2 xxx xxx 4.0K Dec 12 14:04 .
drwxrwxr-x 7 xxx xxx 4.0K Dec 12 14:04 ..
-rw-rw-r-- 1 xxx xxx 33 Dec 12 14:04 #bazel_tools.marker
lrwxrwxrwx 1 xxx xxx 110 Dec 12 14:04 bazel_tools -> /home/xxx/.cache/bazel/_bazel_xxx/install/35f799b1c96ee2522d30a28ff4ef485a/_embedded_binaries/embedded_tools
lrwxrwxrwx 1 xxx xxx 55 Dec 12 14:04 blaze -> /home/xxx/studio-master-dev/tools/vendor/google3/blaze
What is actually missing is /tools/vendor/google3/blaze. A quick Google search shows that blaze is an internal version of bazel, exclusively used within Google.
A thread in Android Studio's issue tracker also confirms that bazel build is broken in AOSP, with some bonus hints that build instructions in the studio-master-dev branch are all outdated (ouch). The issue is still open at the point of writing so if you are building Android Studio (or related tools), you might want to take a look at the latest discussion there.
Remove all references to tools/vendor/google from tools/base/bazel/toplevel.WORKSPACE:
https://android.googlesource.com/platform/tools/idea/+/refs/heads/studio-master-dev/RELEASE.md

Unable to Connect Sqoop to Oracle TimesTen through JDBC on Linux

I have installed Timesten database (full version) on linux (Linux is guest OS installed through Oracle viritual box with cloudera VM)
I am trying to run following sqoop command on linux and getting below errors
command
sqoop list-tables --connect jdbc:timesten:direct:dsn=sampledb_1122 --driver com.timesten.jdbc.TimesTenDriver
**error**
ERROR manager.SqlManager: Error reading database metadata: java.sql.SQLException: Problems with loading native library/missing methods: no ttJdbc in java.library.path
java.sql.SQLException: Problems with loading native library/missing methods: no ttJdbc in java.library.path
at com.timesten.jdbc.JdbcOdbcConnection.connect(JdbcOdbcConnection.java:1809)
at com.timesten.jdbc.TimesTenDriver.connect(TimesTenDriver.java:305)
at com.timesten.jdbc.TimesTenDriver.connect(TimesTenDriver.java:161)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:233)
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:878)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.listTables(SqlManager.java:520)
at org.apache.sqoop.tool.ListTablesTool.run(ListTablesTool.java:49)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Could not retrieve tables list from server
18/02/18 18:56:04 ERROR tool.ListTablesTool: manager.listTables() returned null
TimesTen bin and lib folder location
/home/cloudra/timesten/TimesTen/tt1122_64/bin
/home/cloudera/timesten/TimesTen/tt1122_64/lib
Following values are setup in my environment and other parameters
USERNAME=cloudera
DESKTOP_SESSION=gnome
MAIL=/var/spool/mail/cloudera
PATH=/var/lib/sqoop:/home/cloudera/timesten/TimesTen/tt1122_64/bin:/home/cloudera/timesten/TimesTen/tt1122_64/lib:/home/cloudera/anaconda3/bin:/var/lib/sqoop:/home/cloudra/timesten/TimesTen/tt1122_64/bin:/home/cloudera/timesten/TimesTen/tt1122_64/lib:/home/cloudera/anaconda3/bin:/home/cloudera/anaconda3/bin:/usr/local/firefox:/sbin:/usr/java/jdk1.7.0_67-cloudera/bin:/usr/local/apache-ant/apache-ant-1.9.2/bin:/usr/local/apache-maven/apache-maven-3.0.4/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/cloudera/bin
PWD=/home/cloudera
THREAD_FLAGS=native
HOME=/home/cloudera
SHLVL=2
M2_HOME=/usr/local/apache-maven/apache-maven-3.0.4
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
LOGNAME=cloudera
CVS_RSH=ssh
CLASSPATH=/home/cloudera/timesten/TimesTen/tt1122_64/lib/ttjdbc6.jar
[cloudera#quickstart ~]$ echo $LD_LIBRARY_PATH
/home/cloudera/timesten/TimesTen/tt1122_64/lib:/home/cloudera/timesten/TimesTen/tt1122_64/lib:
[cloudera#quickstart ~]$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
[cloudera#quickstart ~]$
cloudera#quickstart bin]$ ./ttversion
TimesTen Release 11.2.2.8.0 (64 bit Linux/x86_64) (tt1122_64:53396) 2015-01-20T08:36:31Z
Instance admin: cloudera
Instance home directory: /home/cloudera/timesten/TimesTen/tt1122_64
World accessible
Daemon home directory: /home/cloudera/timesten/TimesTen/tt1122_64/info
PL/SQL enabled.
In addition to above.. the ttjdbc6.jar file is located at following location
[cloudera#quickstart sqoop]$ pwd
/var/lib/sqoop
[cloudera#quickstart sqoop]$ ls -ltr
total 0
lrwxrwxrwx 1 root root 40 Jun 9 2015 mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar
lrwxrwxrwx 1 root root 58 Feb 16 21:37 ttjdbc6.jar -> /home/cloudera/timesten/TimesTen/tt1122_64/lib/ttjdbc6.jar
[cloudera#quickstart timesten]$ pwd
/usr/lib/timesten
[cloudera#quickstart timesten]$ ls -ltr
total 276
-rwxrwxrwx 1 root root 279580 Feb 18 11:33 ttjdbc6.jar
Java_library_path output
[cloudera#quickstart timesten]$ java -XshowSettings:properties
Property settings:
awt.toolkit = sun.awt.X11.XToolkit
file.encoding = UTF-8
file.encoding.pkg = sun.io
file.separator = /
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
java.awt.printerjob = sun.print.PSPrinterJob
java.class.path = /home/cloudera/timesten/TimesTen/tt1122_64/lib/ttjdbc6.jar
java.class.version = 51.0
java.endorsed.dirs = /usr/java/jdk1.7.0_67-cloudera/jre/lib/endorsed
java.ext.dirs = /usr/java/jdk1.7.0_67-cloudera/jre/lib/ext
/usr/java/packages/lib/ext
java.home = /usr/java/jdk1.7.0_67-cloudera/jre
java.io.tmpdir = /tmp
java.library.path = /home/cloudera/timesten/TimesTen/tt1122_64/lib
/home/cloudera/timesten/TimesTen/tt1122_64/lib
/usr/java/packages/lib/amd64
/usr/lib64
/lib64
/lib
/usr/lib
java.runtime.name = Java(TM) SE Runtime Environment
java.runtime.version = 1.7.0_67-b01
I execute ttenv.sh scripts but it is not setting up any parameter when I check env parameters, so I had to do it manually.
Gurus and experts.. please help me here.. not sure what is the issue and why I am getting the above error.
Thanks for your help..
The key line here is this:
java.sql.SQLException: Problems with loading native library/missing methods:
no ttJdbc in java.library.path
The TimesTen JDBC driver is a type 1 / 2 driver and it relies on the underlying TimesTen native libraries. Specifically it needs several shared libraries located in <TimesTen_install_dir>/lib such as libttJdbc.so (the one that the error is complaining about), libtten.so etc. Typically you need to make sure that the java.library.path includes this directory (which it appears is the case) and that the CLASSPATH includes the ttjdbc7.jar file in that directory. Another possibility is that your TimesTen installation is a 'client only' installation in which case you cannot use the 'direct' driver and if you try to do so then you would get this exact error. I suggest checking to see if you actually have the files libttJdbc.so and libtten.so in <TimesTen_install_dir>/lib and if not, then this means you have a client only install and need to configure / use client/server connectivity instead.

Bower - ENOGIT git is not installed or not in the PATH using Symfony Process

We are currently running into an issue where runnning bower update --allow-root via Symfony's Process component which results in an error or;
bower jquery#* <3.0 * * * * * * * * ENOGIT git is not installed or not in the PATH
Running git or bower via the component retrieves expected response.
It seems as if bower/node cant find the git path. Running the follwoing via Symphony Process
echo $PATH
results in
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
inside /usr/bin is;
lrwxrwxrwx 1 root root 35 Feb 21 09:06 bower -> ../lib/node_modules/bower/bin/bower
lrwxrwxrwx 1 root root 22 Jan 23 11:49 node -> /etc/alternatives/node
-rwxr-xr-x 1 root root 25163808 Jan 6 00:12 nodejs
lrwxrwxrwx 1 root root 38 Jan 6 00:11 npm -> ../lib/node_modules/npm/bin/npm-cli.js
-rwxr-xr-x 1 root root 1862800 Mar 23 2016 git
Running which git via Symfony Process returns /usr/bin/git
Running which bower via Symfony Process returns /usr/bin/bower
Running which node via Symfony Process returns /usr/bin/node
Stack Trace:
Error: git is not installed or not in the PATH
at createError (/usr/lib/node_modules/bower/lib/util/createError.js:4:15)
at GitHubResolver.GitResolver (/usr/lib/node_modules/bower/lib/core/resolvers/GitResolver.js:45:15)
at GitHubResolver.GitRemoteResolver (/usr/lib/node_modules/bower/lib/core/resolvers/GitRemoteResolver.js:10:17)
at new GitHubResolver (/usr/lib/node_modules/bower/lib/core/resolvers/GitHubResolver.js:13:23)
at /usr/lib/node_modules/bower/lib/core/resolverFactory.js:18:16
From previous event:
at PackageRepository.fetch (/usr/lib/node_modules/bower/lib/core/PackageRepository.js:46:6)
at Manager._fetch (/usr/lib/node_modules/bower/lib/core/Manager.js:323:51)
at Array.forEach (native)
at Manager.resolve (/usr/lib/node_modules/bower/lib/core/Manager.js:116:23)
at Project._bootstrap (/usr/lib/node_modules/bower/lib/core/Project.js:559:6)
at /usr/lib/node_modules/bower/lib/core/Project.js:193:21
Console trace:
Error
at StandardRenderer.error (/usr/lib/node_modules/bower/lib/renderers/StandardRenderer.js:81:37)
at Logger.<anonymous> (/usr/lib/node_modules/bower/lib/bin/bower.js:110:26)
at emitOne (events.js:96:13)
at Logger.emit (events.js:188:7)
at Logger.emit (/usr/lib/node_modules/bower/lib/node_modules/bower-logger/lib/Logger.js:29:39)
at /usr/lib/node_modules/bower/lib/commands/index.js:48:20
at _rejected (/usr/lib/node_modules/bower/lib/node_modules/q/q.js:844:24)
at /usr/lib/node_modules/bower/lib/node_modules/q/q.js:870:30
at Promise.when (/usr/lib/node_modules/bower/lib/node_modules/q/q.js:1122:31)
at Promise.promise.promiseDispatch (/usr/lib/node_modules/bower/lib/node_modules/q/q.js:788:41)
Server:
NGINX
1 GB
30 GB Disk
Ubuntu 16.04.1 x64
Linux 4.4.0-62-generic x64
has anyone else ran into this problem or could suggest more tests to debug.
regards
EDIT
Ive debugged it back to /usr/lib/node_modules/bower/lib/util/createError.js
var which = require('which');
var hasGit;
// Check if git is installed
try {
which.sync('git');
hasGit = true;
} catch (ex) {
hasGit = false;
}
EDIT 2
tracked it down to a node module which;
https://github.com/npm/node-which/blob/master/which.js#L21
which the line process.env.PATH is returning undefined.
export PATH=$PATH:/usr/bin; bower update
process.env.PATH was returning undefined from the which node module;
https://github.com/npm/node-which/blob/master/which.js#L21
Adding the PATH prefix to the command fixed this issue.
Being more specific to my issue, Symfony's Process class takes a third parameter for environment variables;
$process = new Process( 'php composer.phar update', $path, [ 'PATH' => '/usr/bin', 'HOME' => '/home/forge' ] );

Spark-submit not finding main class although it does exist in the jar file

First: here is the application jar file to be submitted:
$ls -rlta /shared/ysgood/target/yardstick-spark-uber-0.0.1.jar
-rw-r--r-- 1 steve staff 138611565 Aug 6 01:41 /shared/ysgood/target/yardstick-spark-uber-0.0.1.jar
Here is the class to be submitted:
01:55:02/ysgood $jar -tvf target/yardstick-spark-uber-0.0.1.jar | grep SparkCoreRDDBenchmark.class
15091 Thu Aug 06 01:36:30 PDT 2015 org/yardstick/spark/SparkCoreRDDBenchmark.class
Here is the attempt at submitting:
$spark-submit --master $MASTER --class org.yardstick.spark.SparkCoreRDDBenchmark target/yardstick-spark-uber-0.0.1.jar
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Error: Cannot load main class from JAR
file:/shared/ysgood/org.yardstick.spark.SparkCoreRDDBenchmark
Regarding the error: notice the path to the jar is incorrect: the following
/shared/ysgood/org.yardstick.spark.SparkCoreRDDBenchmark
does not make sense: it is missing the path to the jar file
target/yardstick-spark-uber-0.0.1.jar
After -class you have to put the package of your main class not the path to the main class. So check if in your code the SparkCoreRDDBenchmark class is in the package org.yardstick.spark.
If it is, then try running your jar without Spark, see if you get the error Cannot load main class. Maybe there where some problems, when the jar was created.
Good luck!

Resources