Getting error while setting spark.ext.h2o.backend.cluster.mode=external in pysparkling standalone cluster - apache-spark

Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
hc = H2OContext.getOrCreate()
Spark-submit Command:
spark-submit --master spark://local:7077 --py-files
sparkling-water-3.36.1.3-1-3.2/py/h2o_pysparkling_3.2-3.36.1.3-1-3.2.zip
--conf "spark.ext.h2o.backend.cluster.mode=external" --conf spark.ext.h2o.external.start.mode="auto" --conf
spark.ext.h2o.external.h2o.driver="/home/whiz/spark/h2odriver-3.36.1.3.jar"
--conf spark.ext.h2o.external.cluster.size=2 spark_h20/h2o_script.py
Error Logs:
py4j.protocol.Py4JJavaError: An error occurred while calling o58.getOrCreate.
: java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory**

the automatic start of SW external backend is only support in Hadoop or K8s environments. In a standalone deployment, you need to deploy the external backend manually according to the tutorial in SW documentation.

Related

spark-submit command with --py-files fails if the driver class path or executor class path is not set

I have a main script as below
from pyspark.sql.session import SparkSession
..............
..............
..............
import callmodule as cm <<<--- This is imported from another pyspark script which is in callmod.zip file
..............
..............
..............
when I submit the spark command as below it fails with Error: No module named Callmodule
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with driver class path(without executor class path) as below it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --driver-class-path C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with executor class path (without driver classpath) also it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
Can you explain me where does the below import statement work? on driver or executor?
import callmodule as cm
Why is the code not failing with Error: No Module Named callmodule when only the driver classpath is set or only the executor classpath is set?
You are using --master local, so the driver is the same as the executor. Therefore, setting classpath on either driver or executor produces the same behaviour, and neither would cause an error.

How to connect to remote Cassandra server through pyspark for write operation?

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
Data_to_Write.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="tablename",keyspace="keyspace_name").save()`
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 OutboundConnectionInitiator.java:373 - Failed to handshake with peer xx.xxx.xxx.xxx:9042(xx.xxx.xxx.xxx:9042) org.apache.cassandra.net.Crc$InvalidCrc –

spark (pyspark) UnicodeEncodeError on yarn cluster mode

I'm running some simple pyspark code that tries to print a "shrug" at the end. I'm running pyspark with Yarn and when I run my code in client mode everything works properly. When I run my code in cluster mode (--deploy-mode cluster) then I get a UnicodeEncodeError error.
Obviously I can not print the "shrug" but I feel like I'm missing something major here. Should I be worried my code isn't handling unicode properly?
Attempted solution:
I've set PYTHONIOENCODING=utf8 and PYSPARK_PYTHON=python3.
Code:
#!/usr/bin/env python3
# Spark Imports
from pyspark import SparkContext
from pyspark.sql import SparkSession
...
if __name__ == '__main__':
sc = SparkContext()
session = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
...
print("¯\_(ツ)_/¯")
No error: ./bin/spark-submit --master yarn myApp.py
Throws error: ./bin/spark-submit --master yarn --deploy-mode clustersparkcluster myApp.py
Error:
Traceback (most recent call last):
File "myApp.py", line 435, in <module>
print("\xaf\_(\u30c4)_/\xaf")
UnicodeEncodeError: 'ascii' codec can't encode character '\xaf' in position 46: ordinal not in range(128)
Settings:
I have the following in my .zshrc:
alias sparkclient="/opt/mapr/spark/spark-2.0.1/bin/spark-submit --master yarn
alias sparkcluster="/opt/mapr/spark/spark-2.0.1/bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 6g
export SPARK_HOME=/opt/mapr/spark/spark-2.0.1
export PYSPARK_PYTHON=python3
export PYTHONHASHSEED=0
export PYTHONIOENCODING=utf8
export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

A master URL must be set in your configuration gives lot of confusion

I have compiled my spark-scala code in eclipse.
I am trying to run my jar in EMR (5.9.0 Spark 2.2.0)using spark-submit option.
But when I run I get an error:
Details : Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
After reading lots of StackOverflow solution I get confused and did not find a correct explanation of how and why to set app master.
This is how I run my jar.I have tried all below option
spark-submit --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[*] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[1] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[2] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[3] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[4] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[5] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
I am not setting any app master in my Scala code .
package financialLineItem
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{ Date, Timestamp }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions._
object FinancialLineItem {
def main(args: Array[String]) {
println("Enterin In to Spark Mode ")
val conf = new SparkConf().setAppName("FinanicalLineItem");
println("After conf")
val sc = new SparkContext(conf); //Creating spark context
println("After SC")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val rdd = sc.textFile("s3://path/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
val rdd1 = sc.textFile("s3://path/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
.filter(!$"FFAction|!|".contains("D|!|"))
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode", concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val headerLast = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", headerLast)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition", "StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://path/FinancialLineItem/output")
Even i tried setting master url in spark-scala code.
This is working in EMR example for spark
spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi /usr/lib/spark/examples/jars/spark-examples.jar 5
If this working then why my jar is not working ?
I tried printing statement in my scala class before creating spark context and it is printing ,so there is no issue in jar file creation .
I don't know what am i missing ?
Updating my eclipse IDE setup also .
Followed below docs
AWS add steps document
This is what my observation
A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf,
More findings .
When i tried to submit from spark-shell i got below error
User class threw exception: java.lang.UnsupportedOperationException: empty collection.
I think there might some issue with my code also .
But i am getting correct result when i run it from Zeppelin .
There's a lot of confusion going on here in the question and in the first answer. If you're running on EMR, which runs Spark on YARN, you do not need to set a master URL at all. It automatically defaults to "yarn", which is the correct value when running Spark on YARN (as opposed to Spark Standalone, which would have a master URL like spark://:7077).
As mentioned in one of the other answers, "--master local" and "--deploy-mode cluster" also don't make sense together. "--master local" should only be used for local development and testing purposes and doesn't make sense to use on a cluster of machines such as on EMR. All it does is run your entire application in a single JVM; it won't run on YARN, it won't be distributed across the cluster, and there won't even be a separation between your driver code and the tasks.
As for "--deploy-mode cluster", as also stated in the other answer, this means that your driver runs in a YARN container on the cluster along with the executors, as opposed to the default of "--deploy-mode client", where the driver runs on the master node outside of YARN.
For more information, please see the Spark documentation, mainly https://spark.apache.org/docs/latest/submitting-applications.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
As explained in the documentation, --deploy-mode cluster asks spark-submit to run the driver on one of the executors.
That, however, isn't applicable to your execution. as you're running locally. You should be using the client deploy mode. For that, just remove the --deploy-mode parameter altogether.
You have to choose one of the following calls, depending on how you want to run the driver program (or executors, for the last option). It's important to understand the differences as they are consequential.
If you want to run the driver program on the cluster (cluster mode, master chooses where on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode cluster #other options
If you want to run the driver program on the compute that is calling spark-submit (client mode, executors remain on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode client #other options
If you are running all locally (driver and executors), then your local master is appropriate:
spark-submit --master local[*] #other options

java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver exception for connecting BigSQL using Python

I'm new to pyspark.I'm using python 3.5 & spark2.2.0 on my Ubuntu 16.0. I wrote following code to connect BigSQL using pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_train_df = spark.read.jdbc("jdbc:db2://my bigsq url :port number:sslConnection=true;sslTrustStoreLocation=ibm-truststore.jks;sslTrustStorePassword=*password123;","schema.Table Name",
properties={"user": username,
"password": password,
'driver' : 'com.ibm.db2.jcc.DB2Driver'}) # Trust store location is defined in .bashrc
spark_train_df.registerTempTable('data_table')
train_df = spark.sql('select * from data_table')
Also I have added my trust store & driver path in my .bashrc file
But while running this code I'm getting error message
java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver exception
Can you expert please guide me to solve this problem?
You need to add the DB2 JDBC jars in your spark-submit, i.e., for postgres
spark-shell --master local[*] --packages org.postgresql:postgresql:9.4.1207.jre7
or (or DB2)
spark-shell --master local[*] --jars /path/to/db2/jdbc/db2.jar

Resources