How to connect spark with hive using pyspark? - python-3.x

I am trying to read hive tables using pyspark, remotely. It states the error that it is unable to connect to Hive Metastore client.
I have read multiple answers on SO and other sources, they were mostly configurations but none of them could address why am I unable to connect remotely. I read the documentation and observed that without making changes in any configuration file, we can connect spark with hive. Note: I have port-forwarded a machine where hive is running and brought it available to localhost:10000. I even connected the same using presto and was able to run queries on hive.
The code is:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')
I expect the output to be an acknowledgment of table being saved but instead, I am facing this error.
Abstract error is:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I have fired a command:
ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 ubuntu#I.J.K.l
When I check for ports 10000 and 9083 via the commands:
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!
Upon running the script, I get the following error:
Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
... 45 more

The catch is in letting the hive configs being stored while creating the spark session itself.
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
It should be noted that no changes in spark conf are required, even serverless services like AWS Glue can have such connections.
For full code:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

Related

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

Hive query from pyspark

I want to connect to hive from pyspark using following code:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
sparkSession = (SparkSession
.builder
.master('spark://spark_host:7077')
.appName('example-pyspark-read-and-write-from-hive')
.config("spark.sql.warehouse.dir", "hdfs://spark_host:9000/user/hive/warehouse", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
Output:
raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
My other attempts have failed. Please help me to configure pyspark and hive for normal working.
Spark version - 2.4.5 Hive version - 3.1.2

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

I use pyspark streaming to read kafka data, but it went wrong:
import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?
I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables
You also definitely need to be using the same version of packages as your Spark version
import os
sparkVersion = '2.4.3' # update this accordingly
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion)
# import Spark core
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
# import extra packages
from pyspark.streaming.kafka import KafkaUtils
# begin application
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
Note: Kafka 0.8 support is deprecated as of Spark 2.3.0

How to connect to a secured Kafka cluster from Zeppelin ("Failed to construct kafka consumer")?

I am trying to read some data from a Kafka broker using structured streaming to display it in a Zeppelin note. I am using Spark 2.4.3, Scala 2.11, Python 2.7, Java 9 and Kafka 2.2 with SSL enabled hosted on Heroku, but get the StreamingQueryException: 'Failed to construct kafka consumer'.
I am using the following dependencies (set in the Spark interpreter settings):
org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.3
org.apache.spark:spark-streaming_2.11:2.4.3
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
I have tried older and newer versions, but these should match Spark/Scala versions I am using.
I have successfully written and read from Kafka using simple Python producer and consumer.
The code I am using:
%pyspark
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
from pyspark.sql.functions import col, expr, when
schema = StructType().add("power", IntegerType()).add("colorR", IntegerType()).add("colorG",IntegerType()).add("colorB",IntegerType()).add("colorW",IntegerType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", brokers) \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", "/home/ubuntu/kafka/truststore.jks") \
.option("kafka.ssl.keystore.location", "/home/ubuntu/kafka/keystore.jks") \
.option("kafka.ssl.keystore.password", password) \
.option("kafka.ssl.truststore.password", password) \
.option("kafka.ssl.endpoint.identification.algorithm", "") \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load()
schema = ArrayType(
StructType([StructField("power", IntegerType()),
StructField("colorR", IntegerType()),
StructField("colorG", IntegerType()),
StructField("colorB", IntegerType()),
StructField("colorW", IntegerType())]))
readDF = df.select( \
col("key").cast("string"),
from_json(col("value").cast("string"), schema))
query = readDF.writeStream.format("console").start()
query.awaitTermination()
And the error I get:
Fail to execute line 43: query.awaitTermination()
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2171412221151055324.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 43, in <module>
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 103, in awaitTermination
return self._jsq.awaitTermination()
File "/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
StreamingQueryException: u'Failed to construct kafka consumer\n=== Streaming Query ===\nIdentifier: [id = 2ee20c47-8293-469a-bc0b-ef71a1f118bc, runId = 72422290-090a-4b6d-bd66-088a5a534240]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [cast(key#7 as string) AS key#22, jsontostructs(ArrayType(StructType(StructField(power,IntegerType,true), StructField(colorR,IntegerType,true), StructField(colorG,IntegerType,true), StructField(colorB,IntegerType,true), StructField(colorW,IntegerType,true)),true), cast(value#8 as string), Some(Etc/UTC)) AS jsontostructs(CAST(value AS STRING))#21]\n+- StreamingExecutionRelation KafkaV2[Subscribe[tanana-44614.lightbulb]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
When I use read and write instead of readStream and writeStream I do not get any errors, but nothing appears on the console when I send some data to Kafka.
What else should I try?
It looks like the Kafka Consumer cannot access ~/kafka/truststore.jks and hence the exception. Replace ~ with the fully-specified path (without the tilde) and the issue should go away.

PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found

I'm trying to write a Spark DataFrame to S3 with pyspark. I'm using Spark version 2.2.0.
sc = SparkContext('local', 'Test')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", aws_secret)
sc._jsc.hadoopConfiguration().set("fs.s3a.multipart.uploads.enabled", "true")
spark = sql.SparkSession \
.builder \
.appName("TEST") \
.getOrCreate()
sql_context = sql.SQLContext(sc, spark)
filename = 'gerrymandering'
s3_uri = 's3a://mybucket/{}'.format(filename)
print(s3_uri)
df = sql_context.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"])
df.write.parquet(s3_uri)
The traceback I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
I'm not sure but there seems to be a jar dependency error. I've tried multiple versions of hadoop-aws-X.jar as well as aws-java-sdk-X.jar but they all produce this same error.
As of writing this my command was:
spark-submit --jars hadoop-aws-2.9.0.jar,aws-java-sdk-1.7.4.jar test.py
Any ideas on how I can resolve this NoClassDefFoundError?
Don't try and use a Hadoop-aws JAR and AWS SDK. different from that it ships with; the AWS SDK Changes too much between versions. For hadoop-2.9.0 you need aws-java-sdk-bundle version 1.11.199
See mvnrepo/hadoop-aws

Resources