I am try to send code to Spark master from intellij directly:
def main(args: Array[String]) = {
println("***hello spark-test***")
val spark = SparkSession
.builder()
.master("spark://172.22.208.1:7077")
.appName("Spark-Test Application")
.getOrCreate()
import spark.implicits._
val rawData = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val rdd = spark.sparkContext.parallelize(rawData)
val df = rdd.toDF()
val result = df.filter($"value" % 2 === 1).count()
println(s"***Result odd numbers count: $result ***")
spark.stop()
}
result:
the application log hangs at connecting to master:
22/12/29 12:38:55 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.208.1:7077...
22/12/29 12:38:55 INFO TransportClientFactory: Successfully created connection to /172.22.208.1:7077 after 38 ms (0 ms spent in bootstraps)
22/12/29 12:39:15 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.208.1:7077...
driver log:
22/12/29 12:39:35 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
java.io.InvalidClassException: scala.collection.immutable.ArraySeq; local class incompatible: stream classdesc serialVersionUID = -8615987390676041167, local class serialVersionUID = 2701977568115426262
spark version : spark-3.3.0-hadoop3-scala2.13
intellij's scala version: v2.13.5
However when I remove the line 'master("spark://172.22.208.1:7077")' and build a jar and send it via spark-submit it works fine.
Usually this error means that you are not running the same Spark version or the same Scala version.
You need to make sure that the code you are running on Intellij is running on the same Scala and Spark version than the one you are running on your cluster.
Since you mentionned you are running Scala 2.13 on IntelliJ, I guess that your local Spark version is not the same as the one you have on the cluster.
Related
I am trying to write data to Cassandra table (cosmos DB) via Azure DBR job (spark streaming). Getting below exception:
Query [id = , runId = ] terminated with exception: Failed to open native connection to Cassandra at {<name>.cassandra.cosmosdb.azure.com:10350} :: Method com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory$.createSession(Lcom/datastax/spark/connector/cql/CassandraConnectorConf;)Lcom/datastax/oss/driver/api/core/CqlSession; is abstract`
`Caused by: IOException: Failed to open native connection to Cassandra at {<name>.cassandra.cosmosdb.azure.com:10350} :: Method com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory$.createSession(Lcom/datastax/spark/connector/cql/CassandraConnectorConf;)Lcom/datastax/oss/driver/api/core/CqlSession; is abstract
Caused by: AbstractMethodError: Method com/microsoft/azure/cosmosdb/cassandra/CosmosDbConnectionFactory$.createSession(Lcom/datastax/spark/connector/cql/CassandraConnectorConf;)Lcom/datastax/oss/driver/api/core/CqlSession; is abstract`
What I did to get here:
created cosmos DB account
created cassandra keyspace
created cassandra table
created DBR job
added com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0 to the job cluster
added com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0 to the job cluster
What I tried:
different versions of connectors or azure cosmos db helper libraries, but some or the other ClassNotFoundExceptions or MethodNotFound errors
Code Snippet:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.log4j.Logger
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.cassandra._
import java.time.LocalDateTime
def writeDelta(spark:SparkSession,dataFrame: DataFrame,sourceName: String,checkpointLocation: String,dataPath: String,loadType: String,log: Logger): Boolean = {
spark.conf.set("spark.cassandra.output.batch.size.rows", "1")
spark.conf.set("spark.cassandra.connection.remoteConnectionsPerExecutor", "10")
spark.conf.set("spark.cassandra.connection.localConnectionsPerExecutor", "10")
spark.conf.set("spark.cassandra.output.concurrent.writes", "100")
spark.conf.set("spark.cassandra.concurrent.reads", "512")
spark.conf.set("spark.cassandra.output.batch.grouping.buffer.size", "1000")
spark.conf.set("spark.cassandra.connection.keepAliveMS", "60000000") //Increase this number as needed
spark.conf.set("spark.cassandra.output.ignoreNulls","true")
spark.conf.set("spark.cassandra.connection.host", "*******.cassandra.cosmosdb.azure.com")
spark.conf.set("spark.cassandra.connection.port", "10350")
spark.conf.set("spark.cassandra.connection.ssl.enabled", "true")
// spark.cassandra.auth.username and password are set in cluster conf
val write=dataFrame.writeStream.
format("org.apache.spark.sql.cassandra").
options(Map( "table" -> "****", "keyspace" -> "****")).
foreachBatch(upsertToDelta _).
outputMode("update").
option("mergeSchema", "true").
option("mode","PERMISSIVE").
option("checkpointLocation", checkpointLocation).
start()
write.awaitTermination()
}
def upsertToDelta(newBatch: DataFrame, batchId: Long) {
try {
val spark = SparkSession.active
println(LocalDateTime.now())
println("BATCH ID = "+batchId+" REC COUNT = "+newBatch.count())
newBatch.persist()
val userWindow = Window.partitionBy(keyColumn).orderBy(col(timestampCol).desc)
val deDup = newBatch.withColumn("rank", row_number().over(userWindow)).where(col("rank") === 1).drop("rank")
deDup.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "****", "keyspace" -> "****"))
.mode("append")
.save()
newBatch.unpersist()
} catch {
case e: Exception =>
throw e
}
}
############################
After implementing solution suggested by #theo-van-kraay, Getting error in executor's logs (Job keeps on running even after this error)
23/02/13 07:28:55 INFO CassandraConnector: Connected to Cassandra cluster.
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 9 (task 26, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Committed partition 9 (task 26, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO Executor: Finished task 9.0 in stage 6.0 (TID 26). 1511 bytes result sent to driver
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 7 (task 24, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 1 (task 18, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 3 (task 20, attempt 0, stage 6.0)
23/02/13 07:28:56 INFO DataWritingSparkTask: Commit authorized for partition 5 (task 22, attempt 0, stage 6.0)
23/02/13 07:28:56 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Unable to get Token Metadata
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$tokenMap$1(LocalNodeFirstLoadBalancingPolicy.scala:86)
at scala.Option.orElse(Option.scala:447)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.tokenMap(LocalNodeFirstLoadBalancingPolicy.scala:86)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.replicasForRoutingKey$1(LocalNodeFirstLoadBalancingPolicy.scala:103)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$getReplicas$8(LocalNodeFirstLoadBalancingPolicy.scala:107)
at scala.Option.flatMap(Option.scala:271)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$getReplicas$7(LocalNodeFirstLoadBalancingPolicy.scala:107)
at scala.Option.orElse(Option.scala:447)
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.$anonfun$getReplicas$3(LocalNodeFirstLoadBalancingPolicy.scala:107)
at scala.Option.flatMap(Option.scala:271)
...
...
23/02/13 07:28:56 ERROR Utils: Aborting task
You can remove:
com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0
It is not required with Spark 3 Cassandra Connector and was created for Spark 2 only. Also remove references to it in the code.
The "Unable to get Token Metadata" error is a known issue that affects Spark 3 (Java 4 driver) and Cosmos DB API for Apache Cassandra in certain scenarios. It has been fixed recently but is still in the process of being rolled out across the service. If resolution is urgent, you can raise a support case in Azure and we can expedite by enabling the fix explicitly on your account until it has been fully deployed. Feel free to mention this Stack Overflow question when raising the support case so that the engineer who handles it will have context.
In our Spark app, we use Spark structured streaming. It uses Kafka as input stream, & HiveAcid as writeStream to Hive table.
For HiveAcid, it is open source library called spark acid from qubole: https://github.com/qubole/spark-acid
Below is our code:
import za.co.absa.abris.avro.functions.from_confluent_avro
....
val spark = SparkSession
.builder()
.appName("events")
.config("spark.sql.streaming.metricsEnabled", true)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val input_stream_df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("startingOffsets", '{"events":{"0":2310384922,"1":2280420020,"2":2278027233,"3":2283047819,"4":2285647440}}')
.option("maxOffsetsPerTrigger", 10000)
.option("subscribe", "events")
.load()
// schema registry config
val srConfig = Map(
"schema.registry.url" -> "http://schema-registry:8081",
"value.schema.naming.strategy" -> "topic.name",
"schema.registry.topic" -> "events",
"value.schema.id" -> "latest"
)
val data = input_stream_df
.withColumn("value", from_confluent_avro(col("value"), srConfig))
.withColumn("timestamp_s", from_unixtime($"value.timestamp" / 1000))
.select(
$"value.*",
year($"timestamp_s") as 'year,
month($"timestamp_s") as 'month,
dayofmonth($"timestamp_s") as 'day
)
// format "HiveAcid" is provided by spark-acid lib from Qubole
val output_stream_df = data.writeStream.format("HiveAcid")
.queryName("hiveSink")
.option("database", "default")
.option("table", "events_sink")
.option("checkpointLocation", "/user/spark/events/checkpoint")
.option("spark.acid.streaming.log.metadataDir", "/user/spark/events/checkpoint/spark-acid")
.option("metastoreUri", "thrift://hive-metastore:9083")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
output_stream_df.awaitTermination()
We able to deploy the app to production, & redeployed it several times (~ 10 times) without issue. Then it ran into the following error:
Query hiveSink [id = 080a9f25-23d2-4ec8-a8c0-1634398d6d29, runId =
990d3bba-0f7f-4bae-9f41-b43db6d1aeb3] terminated with exception: Job
aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most
recent failure: Lost task 3.3 in stage 0.0 (TID 42, 10.236.7.228,
executor 3): org.apache.hadoop.fs.FileAlreadyExistsException:
/warehouse/tablespace/managed/hive/events/year=2020/month=5/day=18/delta_0020079_0020079/bucket_00003
for client 10.236.7.228 already exists (...) at
com.qubole.shaded.orc.impl.PhysicalFsWriter.(PhysicalFsWriter.java:95)
at com.qubole.shaded.orc.impl.WriterImpl.(WriterImpl.java:177)
at
com.qubole.shaded.hadoop.hive.ql.io.orc.WriterImpl.(WriterImpl.java:94)
at
com.qubole.shaded.hadoop.hive.ql.io.orc.OrcFile.createWriter(OrcFile.java:334)
at
com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.initWriter(OrcRecordUpdater.java:602)
at
com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.addSimpleEvent(OrcRecordUpdater.java:423)
at
com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.addSplitUpdateEvent(OrcRecordUpdater.java:432)
at
com.qubole.shaded.hadoop.hive.ql.io.orc.OrcRecordUpdater.insert(OrcRecordUpdater.java:484)
at
com.qubole.spark.hiveacid.writer.hive.HiveAcidFullAcidWriter.process(HiveAcidWriter.scala:295)
at
com.qubole.spark.hiveacid.writer.TableWriter$$anon$1$$anonfun$6.apply(TableWriter.scala:153)
at
com.qubole.spark.hiveacid.writer.TableWriter$$anon$1$$anonfun$6.apply(TableWriter.scala:153)
(...) at
com.qubole.spark.hiveacid.writer.TableWriter$$anon$1.apply(TableWriter.scala:153)
at
com.qubole.spark.hiveacid.writer.TableWriter$$anon$1.apply(TableWriter.scala:139)
Each time the app is restarted, it shows different delta + bucket files already exists error. However, those files are newly created (most probably) each time it starts, but no clue why the error is thrown.
Any pointer will be much appreciated.
I discovered the actual root cause from the worker's error log. It was due to code changes I made in one of the library used, that causes out of memory issue.
What I posted before was the error log from the driver, after several failures on the worker node.
I have Installed Scala.
I have installed java 8.
Also all environment variables has been set for spark,java and Hadoop.
Still getting this error while running spark-shell command. Please someone help....google it a lot but didn't find anything.
spark-shell error
spark shell error2
Spark’s shell provides a simple way to learn the API, Start shell by running the following in the Spark directory:
./bin/spark-shell
Then run below scala code snippet:
import org.apache.spark.sql.SparkSession
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
If stills error persist,then we have to look into environment set up
The SparkContext is not serializable. It is meant to be used on the driver, thus can someone explain the following ?
Using the spark-shell, on yarn, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Nothing happens on the client (prints executors-side)
Using the spark-shell, local master, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Prints "null" on the client
Using pyspark, local master, and Spark version 1.6.0
rdd = sc.parallelize([1])
def _print(x):
print(x)
rdd.foreach(lambda x: _print(sc))
Throws an Exception
I also tried the following :
Using the spark-shell, and Spark version 1.6.0
class Test(val sc:org.apache.spark.SparkContext) extends Serializable{}
val test = new Test(sc)
rdd.foreach(x => print(test))
Now it finally throws a java.io.NotSerializableException: org.apache.spark.SparkContext
Why does it works in Scala when I only print sc ? Why do I have a null reference when it should have thrown a NotSerializableException (or so I thought ...)
I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I encounter the error:
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout
If I run spark-submit standalone, I can see the table exists ok:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…
See my previous question related to this issue: hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
This question is similar to this other question: Spark can access Hive table from pyspark but not from spark-submit. However, unlike that question I am using HiveContext.
Update: see here for the final solution https://stackoverflow.com/a/41272260/1033422
This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.
It looks like you are affected by this bug: https://issues.apache.org/jira/browse/SPARK-15345.
I had a similar issue with Spark 1.6.2 and 2.0.0 on HDP-2.5.0.0:
My goal was to create a dataframe from a Hive SQL query, under these conditions:
python API,
cluster deploy-mode (driver program running on one of the executor nodes)
use YARN to manage the executor JVMs (instead of a standalone Spark master instance).
The initial tests gave these results:
spark-submit --deploy-mode client --master local ... =>
WORKING
spark-submit --deploy-mode client --master yarn ... => WORKING
spark-submit --deploy-mode cluster --master yarn .... => NOT WORKING
In case #3, the driver running on one of the executor nodes could find the database. The error was:
pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'
Fokko Driesprong's answer listed above worked for me.
With, the command listed below, the driver running on the executor node was able to access a Hive table in a database which is not default:
$ /usr/hdp/current/spark2-client/bin/spark-submit \
--deploy-mode cluster --master yarn \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \
/path/to/python/code.py
The python code I have used to test with Spark 1.6.2 and Spark 2.0.0 is:
(Change SPARK_VERSION to 1 to test with Spark 1.6.2. Make sure to update the paths in the spark-submit command accordingly)
SPARK_VERSION=2
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)
def spark1():
from pyspark.sql import HiveContext
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
query = 'select * from database_name.table_name limit 5'
df = hc.sql(query)
printout(df)
def spark2():
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
query = 'select * from database_name.table_name limit 5'
df = spark.sql(query)
printout(df)
def printout(df):
print('\n########################################################################')
df.show()
print(df.count())
df_list = df.collect()
print(df_list)
print(df_list[0])
print(df_list[1])
print('########################################################################\n')
def main():
if SPARK_VERSION == 1:
spark1()
elif SPARK_VERSION == 2:
spark2()
if __name__ == '__main__':
main()
For me the accepted answer did not work.
(--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml)
Adding the below code on top of the code file solved it.
import findspark
findspark.init('/usr/share/spark-2.4') # for 2.4