Unable to Connect to ElasticSearch From Spark After creating the SparkSession, It is connecting If we set the ES Conf before SparkSession Creation - apache-spark

I am experiencing issue while connecting to ES from Spark,
It is connecting to ES if we pass the ES details from Spark Submit(i.e prior SparkSession creation) as below,
spark-submit --conf spark.es.nodes=<es_address_url> --conf spark.es.port=<port_number> --conf spark.es.net.ssl=true --conf spark.es.nodes.wan.only=true --class <> <jar_loacction>
and creating the SparkSession in code as,
val spark: SparkSession = SparkSession.builder().appName(conf.get("spark.app.name"))
.enableHiveSupport().getOrCreate()
Throwing Exception
If we set the ElasticSearch details to spark session object in code as below and not passing them from spark-submit,
spark.conf.set("spark.es.nodes", <ES_URL>)
spark.conf.set("spark.es.port", <PortNumber>)
spark.conf.set("spark.es.net.ssl", "true")
spark.conf.set("spark.es.nodes.wan.only", "true")
val esTableData = spark.read
.format("org.elasticsearch.spark.sql")
.option("pushdown", "true").option("es.ignoreNulls","true")
.load(<PathOfTheIndexToRead>)
Spark-Submit spark-submit --class <class_name> <jar_loacction>
Getting the Exception as below
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:348)
org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:338)
... 40 more
Same issue from Spark-shell also happening
spark-shell --packages org.elasticsearch:elasticsearch-spark-30_2.12:8.1.0
scala> spark.conf.set("spark.es.nodes","https://<URL>/")
scala> spark.conf.set("spark.es.port", "<pot_number>")
scala> spark.conf.set("spark.es.net.ssl", "true")
scala> spark.conf.set("spark.es.nodes.wan.only", "true")
scala> val DF = spark.read.format("org.elasticsearch.spark.sql").option("pushdown", "true").option("es.ignoreNulls","true").option("es.field.read.empty.as.null", "no").load(<name_of_the_index>)

Related

Spark Kafka Streaming not displaying data on spark-submit on EMR

I am trying to stream the data from a Kafka topic, it is working in spark shell. But if i create a .py file and use spark-submit for the same, it is failing:
Code:
spark_session = SparkSession.builder.appName("TestApp").enableHiveSupport().getOrCreate()
kafka_bootstrap_server = BOOTSTRAP_SERVERS
topic = 'ota-impactreportsync'
starting_offsets = 'earliest'
df = spark_session.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_server).option(
"subscribe", topic).option("startingOffsets", starting_offsets).option("failOnDataLoss", "false").load()
df.writeStream.format("console").outputMode("append").start()
Commands used:
pyspark --master local --packages io.delta:delta-core_2.12:2.1.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
spark-submit --master local --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" test.py
If i use batch streaming with spark-submit it is working
The job fails in like 10 seconds everytime, the logs are not helpful either. No errors.

Spark-submit fails with return code 13 for example of wordCount

My spark-submit command is :
spark-submit --class com.sundogsoftware.spark.WordCountBetterDataset --master yarn --deploy-mode cluster SparkCourse.jar
And for defining the sparkSession, i use this :
val spark = SparkSession
.builder
.master("spark://youness:7077")
.appName("WordCount")
.getOrCreate()
but at the end, my job fails with return code 13.
You need to let the master unset in the code. It is preferable to set it later when you issue spark-submit (spark-submit --master yarn-client ...) and you are already doing that above. Just remove .master("spark://youness:7077") from your code.

How to connect to remote Cassandra server through pyspark for write operation?

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
Data_to_Write.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="tablename",keyspace="keyspace_name").save()`
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 OutboundConnectionInitiator.java:373 - Failed to handshake with peer xx.xxx.xxx.xxx:9042(xx.xxx.xxx.xxx:9042) org.apache.cassandra.net.Crc$InvalidCrc –

How to use Apache Spark to query Hive table with Kerberos?

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data programmatically without Spark. However, the problem comes when I try to connect and query in Spark.
My code when run locally without spark:
Class.forName("org.apache.hive.jdbc.HiveDriver")
System.setProperty("kerberos.keytab", keytab)
System.setProperty("kerberos.principal", keytab)
System.setProperty("java.security.krb5.conf", krb5.conf)
System.setProperty("java.security.auth.login.config", jaas.conf)
val conf = new Configuration
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.createProxyUser("user", UserGroupInformation.getLoginUser)
UserGroupInformation.loginUserFromKeytab(user, keytab)
UserGroupInformation.getLoginUser.checkTGTAndReloginFromKeytab()
if (UserGroupInformation.isLoginKeytabBased) {
UserGroupInformation.getLoginUser.reloginFromKeytab()
}
else if (UserGroupInformation.isLoginTicketBased) UserGroupInformation.getLoginUser.reloginFromTicketCache()
val con = DriverManager.getConnection("jdbc:hive://hdpe-hive.company.com:10000", user, password)
val ps = con.prepareStatement("select * from table limit 5").executeQuery();
Does anyone know how I could include the keytab, krb5.conf and jaas.conf into my Spark initialization function so that I am able to authenticate with Kerberos to get the TGT?
My Spark initialization function:
conf = new SparkConf().setAppName("mediumData")
.setMaster(numCores)
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled","true") //enable spark UI
.set("spark.sql.shuffle.partitions",defaultPartitions)
sparkSession = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
I do not have files such as hive-site.xml, core-site.xml.
Thank you!
Looking at your code, you need to set the following properties in the spark-submit command on the terminal.
spark-submit --master yarn \
--principal YOUR_PRINCIPAL_HERE \
--keytab YOUR_KEYTAB_HERE \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.driver.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--class YOUR_MAIN_CLASS_NAME_HERE code.jar

A master URL must be set in your configuration gives lot of confusion

I have compiled my spark-scala code in eclipse.
I am trying to run my jar in EMR (5.9.0 Spark 2.2.0)using spark-submit option.
But when I run I get an error:
Details : Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
After reading lots of StackOverflow solution I get confused and did not find a correct explanation of how and why to set app master.
This is how I run my jar.I have tried all below option
spark-submit --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --deploy-mode cluster --master yarn-client --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[*] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[1] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[2] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[3] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[4] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
spark-submit --master local[5] --deploy-mode cluster --class financialLineItem.FinancialLineItem s3://trfsmallfffile/AJAR/SparkJob-0.1-jar-with-dependencies.jar
I am not setting any app master in my Scala code .
package financialLineItem
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{ Date, Timestamp }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions._
object FinancialLineItem {
def main(args: Array[String]) {
println("Enterin In to Spark Mode ")
val conf = new SparkConf().setAppName("FinanicalLineItem");
println("After conf")
val sc = new SparkContext(conf); //Creating spark context
println("After SC")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val rdd = sc.textFile("s3://path/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
val rdd1 = sc.textFile("s3://path/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
.filter(!$"FFAction|!|".contains("D|!|"))
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode", concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val headerLast = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", headerLast)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition", "StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://path/FinancialLineItem/output")
Even i tried setting master url in spark-scala code.
This is working in EMR example for spark
spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi /usr/lib/spark/examples/jars/spark-examples.jar 5
If this working then why my jar is not working ?
I tried printing statement in my scala class before creating spark context and it is printing ,so there is no issue in jar file creation .
I don't know what am i missing ?
Updating my eclipse IDE setup also .
Followed below docs
AWS add steps document
This is what my observation
A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf,
More findings .
When i tried to submit from spark-shell i got below error
User class threw exception: java.lang.UnsupportedOperationException: empty collection.
I think there might some issue with my code also .
But i am getting correct result when i run it from Zeppelin .
There's a lot of confusion going on here in the question and in the first answer. If you're running on EMR, which runs Spark on YARN, you do not need to set a master URL at all. It automatically defaults to "yarn", which is the correct value when running Spark on YARN (as opposed to Spark Standalone, which would have a master URL like spark://:7077).
As mentioned in one of the other answers, "--master local" and "--deploy-mode cluster" also don't make sense together. "--master local" should only be used for local development and testing purposes and doesn't make sense to use on a cluster of machines such as on EMR. All it does is run your entire application in a single JVM; it won't run on YARN, it won't be distributed across the cluster, and there won't even be a separation between your driver code and the tasks.
As for "--deploy-mode cluster", as also stated in the other answer, this means that your driver runs in a YARN container on the cluster along with the executors, as opposed to the default of "--deploy-mode client", where the driver runs on the master node outside of YARN.
For more information, please see the Spark documentation, mainly https://spark.apache.org/docs/latest/submitting-applications.html and https://spark.apache.org/docs/latest/running-on-yarn.html.
As explained in the documentation, --deploy-mode cluster asks spark-submit to run the driver on one of the executors.
That, however, isn't applicable to your execution. as you're running locally. You should be using the client deploy mode. For that, just remove the --deploy-mode parameter altogether.
You have to choose one of the following calls, depending on how you want to run the driver program (or executors, for the last option). It's important to understand the differences as they are consequential.
If you want to run the driver program on the cluster (cluster mode, master chooses where on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode cluster #other options
If you want to run the driver program on the compute that is calling spark-submit (client mode, executors remain on the cluster):
spark-submit --master master.address.com:7077 --deploy-mode client #other options
If you are running all locally (driver and executors), then your local master is appropriate:
spark-submit --master local[*] #other options

Resources