How to connect remote hive from spark with authentication - apache-spark

I have to use my local spark to connect a remote hive with authentication.
I am able to connect via beeline.
beeline> !connect jdbc:hive2://bigdatamr:10000/default
Connecting to jdbc:hive2://bigdatamr:10000/default
Enter username for jdbc:hive2://bigdatamr:10000/default: myusername
Enter password for jdbc:hive2://bigdatamr:10000/default: ********
Connected to: Apache Hive (version 1.2.0-mapr-1703)
Driver: Hive JDBC (version 1.2.0-mapr-1703)
Transaction isolation: TRANSACTION_REPEATABLE_READ
How can I convert it to using spark?
I tried thrift and jdbc but both not working
My trift try, don't know how to pass authentication
from pyspark.sql import SparkSession
spark = SparkSession\
.builder.master("yarn")\
.appName("my app")\
.config("hive.metastore.uris", "thrift://bigdatamr:10000")\
.enableHiveSupport()\
.getOrCreate()
My jdbc try, throw method not support
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:hive2://bigdatamr:10000") \
.option("dbtable", "default.tmp") \
.option("user", "myusername") \
.option("password", "xxxxxxx") \
.load()
Py4JJavaError: An error occurred while calling o183.load.
: java.sql.SQLException: Method not supported

You need to specify the driver you are using in the options of spark.read:
.option("driver", "org.apache.hive.jdbc.HiveDriver")
Also, for some reason you have to specify the database in the jdbc url and the name of the table with option dbTable. For some reason it does not work to simply define dbTable as database.table.
It would look like this:
jdbcDF = spark.read \
.format("jdbc") \
.option("driver", "org.apache.hive.jdbc.HiveDriver") \
.option("url", "jdbc:hive2://bigdatamr:10000/default")
.option("dbtable", "tmp") \
.option("user", "myusername") \
.option("password", "xxxxxxx") \
.load()

Apparently this problem is a configuration problem.
If you have access to your server /PATH/TO/HIVE/hive-site.xml file, copy it to your local spark configuration folder /PATH/TO/SPARK/conf/ and then retry running your application

Replace cloudera hive jdbc driver to overwrite official jdbc. It works.
jdbc url below:
https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-15.html
I uploaded it to databricks libraries and change the connected code.
Here is my code:
sql=f"SELECT * FROM (select column from db.table where column = 'condition'"
print(sql)
print("\nget Hive data\n")
spark_df = spark.read \
.format("jdbc")\
.option("driver", "com.cloudera.hive.jdbc41.HS2Driver") \
.option("url", "url") \
.option("query", "sql") \
.load()
here is my blog
https://blog.8owe.com/
it mgiht helps you more.

Related

I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark

I am using this tech stack:
Spark version: 3.3.1
Scala Version: 2.12.15
Hadoop Version: 3.3.4
Kafka Version: 3.3.1
I am trying to get data from kafka topic through spark structure streaming, But I am facing mentioned error, Code I am using is:
For reading data from kafka topic
result_1 = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sampleTopic1") \
.option("startingOffsets", "latest") \
.load()
For writing data on console
trans_detail_write_stream = result_1 \
.writeStream\
.trigger(processingTime='1 seconds')\
.outputMode("update")\
.option("truncate", "false")\
.format("console")\
.start()\
.awaitTermination()
For execution I am using following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 streamer.py
I am facing this error "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)"
and on later logs it give me this exception too
"StreamingQueryException: Query [id = 600dfe3b-6782-4e67-b4d6-97343d02d2c0, runId = 197e4a8b-699f-4852-a2e6-1c90994d2c3f] terminated with exception: Writing job aborted"
Please suggest
Edit: Screenshot for Spark Version

Spark Structured Streaming + pyspark app returns "Initial job has not accepted any resources"

RunCode
spark-submit --master spark://{SparkMasterIP}:7077
--deploy-mode cluster --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,
com.datastax.spark:spark-cassandra-connector_2.12:3.2.0,
com.github.jnr:jnr-posix:3.1.15
--conf spark.dynamicAllocation.enabled=false
--conf com.datastax.spark:spark.cassandra.connectiohost={SparkMasterIP==CassandraIP},
spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions test.py
Source Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SQLContext
# Spark Bridge local to spark_master == Connect master
spark = SparkSession.builder \
.master("spark://{SparkMasterIP}:7077") \
.appName("Spark_Streaming+kafka+cassandra") \
.config('spark.cassandra.connection.host', '{SparkMasterIP==CassandraIP}') \
.config('spark.cassandra.connection.port', '9042') \
.getOrCreate()
# Parse Schema of json
schema = StructType() \
.add("col1", StringType()) \
.add("col2", StringType()) \
.add("col3", StringType()) \
.add("col4", StringType()) \
.add("col5", StringType()) \
.add("col6", StringType()) \
.add("col7", StringType())
# Read Stream From {TOPIC} at BootStrap
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "{KAFKAIP}:9092") \
.option('startingOffsets','earliest') \
.option("subscribe", "{TOPIC}") \
.load() \
.select(from_json(col("value").cast("String"), schema).alias("parsed_value")) \
.select("parsed_value.*")
df.printSchema()
# write Stream at cassandra
ds = df.writeStream \
.trigger(processingTime='15 seconds') \
.format("org.apache.spark.sql.cassandra") \
.option("checkpointLocation","./checkPoint") \
.options(table='{TABLE}',keyspace="{KEY}") \
.outputMode('append') \
.start()
ds.awaitTermination()
Error Code
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I was checked Spark UI, workers have no problem.
here is my Spark status
[![enter image description here][2]][2]
my plan is
kafka(DBIP)--readStream-->LOCAL(DriverIP)--writeStream-->Spark&kafka&casaandra(MasterIP)
DBIP, DriverIP, MasterIP is different IP.
LOCAL have no spark, so i use pyspark on python_virtualenv
Edit
You app can't run because there are no resources available in your Spark cluster.
If you look closely at the Spark UI screenshot you posted, all the cores are used on all the 3 workers. That means there are no cores left for any other apps so any new submitted app will have to wait until resources are available before it can be scheduled. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. 🙏 Thanks!

Pyspark is unable to find bigquery datasource

This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
.config('spark.jars.packages','com.google.cloud.bigdataoss:gcsio:1.5.4') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar,spark-bigquery-with-dependencies_2.12-0.21.1.jar,spark-bigquery-latest_2.11.jar') \
.config('spark.jars', 'postgresql-42.2.23.jar,bigquery-connector-hadoop2-latest.jar') \
.getOrCreate()
Then when i try to write a demo spark dataframe to bigquery
df.write.format('bigquery') \
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket",bucket) \
.save()
It throws and error
File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-1.9.5,com.google.guava:guava:r05') \
.config('spark.jars','guava-11.0.1.jar,gcsio-1.9.0-javadoc.jar') \ # you will have to download these jars manually
.getOrCreate()

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!
The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

Resources