Issues while using setConf in SparkLauncher from Windows - apache-spark

I am trying to trigger Pyspark code using SparkLauncher from Windows.
When I use
.setConf(SparkLauncher.DRIVER_MEMORY, "1G")
or any other configuration, the following error message is thrown,
--conf "spark.driver.memory' is not recognized as an internal or external command
Also, I need to add multiple dependency jars. For example, when I use
addJar("D:\\jars\\elasticsearch-spark-20_2.11-6.0.0-rc2.jar")
it is working. But when it is used multiple times
.addJar("D:\\jars\\elasticsearch-spark-20_2.11-6.0.0-rc2.jar")
.addJar("D:\\jars\\mongo-spark-connector_2.11-2.2.0.jar")
the following error is thrown
The filename, directory name, or volume label syntax is incorrect.
The same code works in Linux environment.
Could someone please help me on this?

Related

what is it wrong? about jar file in linux running

0
I am using Java 11
when I running the source code in intelligence it was working very well but when I packed jar flie after that I running the jar file it show the error
like
"Java.lang.classnamefoundexception: org.apache.hive.jdbc.HiveDriver"
when I running the jar I use this command
"Java-jar test.jar"
the part of source code is "Class.forName("org.apache.hive.jdbc.HiveDriver")
I don't understand I run in local(intellij) is good work
but I used in Linux env it show above the error
if know the why is wrong please give me a hand
I want to add up image I can't do it because of my company policy... thanks
I searched all the site

Execute databricks magic command from PyCharm IDE

With databricks-connect we can successfully run codes written in Databricks or Databricks notebook from many IDE. Databricks has also created many magic commands to support their feature with regards to running multi-language support in each cell by adding commands like %sql or %md. One issue I am facing currently is when I try to execute Databricks notebooks in Pycharm is as follows:
How to execute Databricks specific magic command from PyCharm.
E.g.
Importing a script or notebook in Done in Databricks using this command-
%run
'./FILE_TO_IMPORT'
Where as in IDE from FILE_TO_IMPORT import XYZ works.
Again everytime I download Databricks notebook it comments out the magic commands and that makes it impossible to be used anywhere outside Databricks environment.
It's really inefficient to convert all databricks magic command everytime I want to do any developement.
Is there any configuration I could set which automatically detects Databricks specific magic commands?
Any solution to this will be helpful. Thanks in Advance!!!
Unfortunately, as per the databricks-connect version 6.2.0-
" We cannot use magic command outside the databricks environment directly. This will either require creating custom functions but again that will only work for Jupyter not PyCharm"
Again, since importing py files requires %run magic command so this also becomes a major issue. A solution to this is by converting the set of files to be imported as a python package and add it to the cluster via Databricks UI and then import and use it in PyCharm. But this is a very tedious process.

How to implement spark.ui.filter

I have a spark cluster set up on 2 CentOS machines. I want to secure the web UI of my cluster (master node). I have made a BasicAuthenticationFilter servlet. I am unable to understand:
how should I use spark.ui.filter to secure my web UI.
Where should I place the servlet/jar file.
Kindly help.
I also needed to handle this security problem to prevent unauthorized access to spark standalone UI. At last I fixed it after surfing on the web, the procedure is :
code and compile a java filter using standard basic authentication protocol, I refered to this [blog]: http://lambda.fortytools.com/post/26977061125/servlet-filter-for-http-basic-auth
packaged above filter class as a jar file, put it in $spark_home/jars/
add config lines in $spark_home/conf/spark-default.conf as :
spark.ui.filters xxx.BasicAuthFilter # the full class name
spark.test.BasicAuthFilter.params user=foo,password=cool,realm=some
the username and password need to provide to access the spark UI, “realm” is insignificant whatever you typed
restart all slave and master process and test to find it works
Hi place the jar file in all the nodes in the folder /opt/spark/conf/. In terminal, type the following commands:
Navigate to the directory /usr/local/share/jupyter/kernels/pyspark/kernel.json
Edit the file kernel.json
Add the following argument to the PYSPARK_SUBMIT_ARGS --jars /opt/spark/conf/filterauth.jar –conf spark.ui.filters=authenticate.MyFilter
Here, filterauth.jar is the jar file created and authenticate.MyFilter represents <package name>.<class name>
Hope this answers your query. :)

When trying to register a UDF using Python on I get an error about Spark BUILD with HIVE

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o54))
This happens whenever I create a UDF on a second notebook in Jupyter on IBM Bluemix Spark as a Service.
If you are using IBM Bluemix Spark as a Service, execute the following command in a cell of the python notebook :
!rm -rf /gpfs/global_fs01/sym_shared/YPProdSpark/user/spark_tenant_id/notebook/notebooks/metastore_db/*.lck
Replace spark_tenant_id with the actual one. You can find the tenant id using the following command in a cell of the notebook:
!whoami
I've run into these errors as well. Only the first notebook you launch will have access to the hive context. From here
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user.

What is the proper way of running a Spark application on YARN using Oozie (with Hue)?

I have written an application in Scala that uses Spark.
The application consists of two modules - the App module which contains classes with different logic, and the Env module which contains environment and system initialization code, as well as utility functions.
The entry point is located in Env, and after initialization, it creates a class in App (according to args, using Class.forName) and the logic is executed.
The modules are exported into 2 different JARs (namely, env.jar and app.jar).
When I run the application locally, it executes well. The next step is to deploy the application to my servers. I use Cloudera's CDH 5.4.
I used Hue to create a new Oozie workflow with a Spark task with the following parameters:
Spark Master: yarn
Mode: cluster
App name: myApp
Jars/py files: lib/env.jar,lib/app.jar
Main class: env.Main (in Env module)
Arguments: app.AggBlock1Task
I then placed the 2 JARs inside the lib folder in the workflow's folder (/user/hue/oozie/workspaces/hue-oozie-1439807802.48).
When I run the workflow, it throws a FileNotFoundException and the application does not execute:
java.io.FileNotFoundException: File file:/cloudera/yarn/nm/usercache/danny/appcache/application_1439823995861_0029/container_1439823995861_0029_01_000001/lib/app.jar,lib/env.jar does not exist
However, when I leave the Spark master and mode parameters empty, it all works properly, but when I check spark.master programmatically it is set to local[*] and not yarn. Also, when observing the logs, I encountered this under Oozie Spark action configuration:
--master
null
--name
myApp
--class
env.Main
--verbose
lib/env.jar,lib/app.jar
app.AggBlock1Task
I assume I'm not doing it right - not setting Spark master and mode parameters and running the application with spark.master set to local[*]. As far as I understand, creating a SparkConf object within the application should set the spark.master property to whatever I specify in Oozie (in this case yarn) but it just doesn't work when I do that..
Is there something I'm doing wrong or missing?
Any help will be much appreciated!
I managed to solve the problem by putting the two JARs in the user directory /user/danny/app/ and specifying the Jar/py files parameter as ${nameNode}/user/danny/app/env.jar. Running it caused a ClassNotFoundException to be thrown, even though the JAR was located at the same folder in HDFS. To work around that, I had to go to the settings and add the following to the options list: --jars ${nameNode}/user/danny/app/app.jar. This way the App module is referenced as well and the application runs successfully.

Resources