spark-shell avoid typing spark.sql(""" query """) - apache-spark

I use spark-shell a lot and often it is to run sql queries on database. And only way to run sql queries is by wrapping them in spark.sql(""" query """).
Is there a way to switch to spark-sql directly and avoid the wrapper code? E.g. when using beeline, we get a direct sql interface.

spark-sql CLI is available with Spark package
$SPARK_HOME/bin/spark-sql
$ spark-sql
spark-sql> select 1-1;
0
Time taken: 6.368 seconds, Fetched 1 row(s)
spark-sql> select 1=1;
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql>
Notes:
Spark SQL CLI cannot talk to the Thrift JDBC server
Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in conf/
spark-sql --help
Usage: ./bin/spark-sql [options] [cli option]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
CLI options:
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)

Related

hadoop multi node with spark sample job

I have just configured spark on my Hadoop cluster and i want to run the spark sample job.
before that I want to understand what, this below job code stands for.
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
You can see all possible parameters for submitting a spark job on here. I summarized the ones in your submit script as below:
spark-submit
--deploy-mode client # client/cluster. default value client. Whether to deploy your driver on the worker nodes or locally
--class org.apache.spark.examples.SparkPi # The entry point for your application
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10 #jar file path and expected arguments
--master is another parameter usually defined in submit scripts. For my HDP cluster default value of master is yarn. You can see all possible values for master in spark documentation again.

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?
If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

how to execute spark program efficient in cluster

I have 2 node hadoop cluster. Each with 16GB RAM and 512GB Harddisk.
I have written spark program like below one
Code :
val input = sc.wholeTextFiles("folderpath/*")
do some operations on input.
convert it to dataframe. then register temptable. execute insert command to insert the dataframe value to hive table.
Then I open host 1 (which is my namenode of the cluster) terminal & I run spark submit command like
>spark-submit --class com.sample.parser --master yarn Parser.jar.
But it takes more than 50 mins to process 25 files which totals around 1gb.And when I check spark UI, executor list has only my host 2. host 1 is listed as driver.
So practically only one node is executing the program(host 2). Why?
Is there a way that I can have my driver also to execute the program. so that it runs little faster? Am I doing something wrong? Basically I want my driver node also to be part of executor(Both machines have 8 cores).
Thanks in Advance.
spark-submit by default runs in client(local) mode, in order to submit spark job in cluster mode use --deploy-mode as:
spark-submit \
--class com.sample.parser \
--master yarn \
--deploy-mode cluster \
Parser.jar
--deploy-mode: Whether to deploy your driver on the worker nodes
(cluster) or locally as an external client (client) (default: client)
also, experiment with --num-executors <n> - with different <n> values...and see if it make any difference with perfomance of your app.

Apache Spark Multi Node Clustering

I am currently working on logger analyse by using apache spark. I am new for Apache Spark. I have tried to use apache spark standalone mode. I can run my code by submitting jar with deploy-mode on the client. But I can not run with multi node cluster. I have used worker nodes are different machine.
sh spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:7077 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
when i Run this command, it shows following error
15/10/20 18:04:23 ERROR ClientEndpoint: Exception from cluster was: java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
at org.spark-project.guava.io.Files.copy(Files.java:436)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
As my understanding, The driver program sends the data and application code to worker node. I don't know my understanding is correct or not. So Please help me to run application on a cluster.
I have tried to run jar on cluster and Now there is no exception but why the task is not assigned to worker node?
I have tried without clustering. Its working fine. shown in following figure
Above image shows, Task assigned to worker nodes. But I have one more problem to analyse the log file. Actually, I have log files in master node which is in a folder (ex: '/home/visva/log'). But the worker node searching the file on their own file system.
I met same problem.
My solution was that I uploaded my .jar file on the HDFS.
Enter the command line like this:
spark-submit --class com.example.RunRecommender --master spark://Hadoop-NameNode:7077 --deploy-mode cluster --executor-memory 6g --executor-cores 3 hdfs://Hadoop-NameNode:9000/spark-practise-assembly-1.0.jar
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
If you use the cluster model in spark-submit , you need use the 6066 port(the default port of rest in spark) :
spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:6066 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
In my case, i upload the jar of app to every node in cluster because i do not know how does the spark-submit to transfer the app automatically and i don't know how to specify a node as driver node .
Note: The jar path of app is a path that in the any node of cluster.
There are two deploy modes in Spark to run the script.
1.client (default): In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.(Master node)
2.cluster : If your application is submitted from a machine far from the worker machines, it is common to use cluster mode to minimize network latency between the drivers and the executors.
Reference Spark Documentation For Submitting JAR

Resources