Error while connecting big query in GCP using Spark - apache-spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

Related

Spark expecting HDFS location instead of Local Dir

I am trying to run spark streaming, but getting this issue. Please help
from pyspark.sql import SparkSession
if __name__ == "__main__":
print("Application started")
spark = SparkSession \
.builder \
.appName("Socker streaming demo") \
.master("local[*]")\
.getOrCreate()
# Steam will return unbounded table
stream_df = spark\
.readStream\
.format("socket")\
.option("host","localhost")\
.option("port","1100")\
.load()
print(stream_df.isStreaming)
stream_df.printSchema()
write_query = stream_df \
.writeStream\
.format("console")\
.start()
# this line of code will turn to streaming application into never ending
write_query.awaitTermination()
print("Application Completed")
Error is getting
22/07/31 00:13:16 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\786000702\AppData\Local\Temp\temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
Traceback (most recent call last):
File "D:\PySparkProject\pySparkStream\socker_streaming.py", line 23, in <module>
write_query = stream_df \
File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\streaming.py", line 1202, in start
return self._sq(self._jwrite.start())
File "D:\PySparkProject\venv\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "D:\PySparkProject\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o36.start.
**: org.apache.hadoop.fs.InvalidPathException: Invalid path name Path part /C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 from URI hdfs://0.0.0.0:19000/C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 is not a valid filename.
at org.apache.hadoop.fs.AbstractFileSystem.getUriPath(AbstractFileSystem.java:427)
at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:366)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:809)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:805)
at**
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:812)
at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createCheckpointDirectory(CheckpointFileManager.scala:368)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.resolveCheckpointLocation(ResolveWriteToStream.scala:121)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:42)
at
You can modify the FS path that Spark defaults by editing fs.defaultFS in core-site.xml file located either in your Spark or Hadoop conf directorie
You seem to have set that at hdfs://0.0.0.0:19000/ rather than some file:// URI path, based on the error

Pyspark general import problems

I succesfully instaled Spark and Pyspark in my machine, added path variables, etc. but keeps facing import problems.
This is the code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.hive.exec.dynamic.partition", "true") \
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
And this is the error message:
"C:\...\Desktop\Clube\venv\Scripts\python.exe" "C:.../Desktop/Clube/services/ce_modelo_analise.py"
Traceback (most recent call last):
File "C:\...\Desktop\Clube\services\ce_modelo_analise.py", line 1, in <module>
from pyspark.sql import SparkSession
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\context.py", line 31, in <module>
from pyspark import accumulators
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\accumulators.py", line 97, in <module>
from pyspark.serializers import read_int, PickleSerializer
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\serializers.py", line 71, in <module>
from pyspark import cloudpickle
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 145, in <module>
_cell_set_template_code = _make_cell_set_template_code()
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: 'bytes' object cannot be interpreted as an integer
If I remove the import line, those problems disappear. As I said before, my path variables are set:
and
Also, Spark is running correctly in cmd:
Going deeper I found the problem: I'm using Spark in version 2.4, which works with Python 3.7 tops.
As I was using Python 3.10, the problem was happening.
So if you're experiencing the same kind of issue, try to change your versions.

Running Pyspark app on Hadoop cluster yields java.lang.NoClassDefFoundError

Whenever I run this error shows up:
Traceback (most recent call last):
File "~/test-tung/spark_tf.py", line 69, in <module>
'spark_tf').master('yarn').getOrCreate()
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/session.py", line 186, in getOrCreate
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 371, in getOrCreate
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 131, in __init__
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 193, in _do_init
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 310, in _initialize_context
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1569, in __call__
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: org/spark_project/guava/base/Preconditions
A part of my python app spark_tf.py:
spark = SparkSession.builder.appName(
'spark_tf').master('yarn').getOrCreate()
model = tf.keras.models.load_model('./model/kdd_binary.h5')
weights = model.get_weights()
config = model.get_config()
bc_weights = spark.sparkContext.broadcast(weights)
bc_config = spark.sparkContext.broadcast(config)
scheme = StructType().add('#timestamp', StringType()).add('#address', StringType())
stream = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', 'my-host:9092') \
.option('subscribe', 'dltest') \
.load() \
.selectExpr("CAST(value AS STRING)") \
.select(from_json('value', scheme).alias('json'),
online_predict('value').alias('result')) \
.select(to_json(struct('result', 'json.#timestamp', 'json.#address'))
.alias('value'))
x = stream.writeStream \
.format('kafka') \
.option("kafka.bootstrap.servers", 'my-host:9092') \
.option('topic', 'dlpred') \
.option('checkpointLocation', './kafka_checkpoint') \
.start()
x.awaitTermination()
My submit line: spark-submit --deploy-mode client --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 spark_tf.py
I think it's probably because of improper Spark setup but I don't know what caused that.
EDIT: This code I think apparently runs on client instead of Hadoop cluster but running it on the cluster yields the same error.

How to read Druid data using JDBC driver with spark?

How can I read data from Druid using spark and Avatica JDBC Driver?
This is avatica JDBC document
Reading data from Druid using python and Jaydebeapi module, I succeed like below code.
$ python
import jaydebeapi
conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
"jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
{"user": "druid", "password":"druid"},
"/root/avatica-1.17.0.jar",
)
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
output is:
[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')] -> default tables
But I want to read using spark and JDBC.
I tried it but there is a problem using spark like below code.
$ pyspark --jars /root/avatica-1.17.0.jar
df = spark.read.format('jdbc') \
.option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
.option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
.option('user', 'druid') \
.option('password', 'druid') \
.option('driver', 'org.apache.calcite.avatica.remote.Driver') \
.load()
output is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Note:
I downloaded Avatica jar file(avatica-1.17.0.jar) from maven-repository
I installed Druid server using docker-compose and default setting values.
I found another way to solve this problem. I used spark-druid-connector to connect druid with spark.
But I changed some codes like this to use this code for my environment.
This is my environment:
spark: 2.4.4
scala: 2.11.12
python: python 3.6.8
druid:
zookeeper: 3.5
druid: 0.17.0
However, it has a problem.
If you use spark-druid-connector at least once, all sql queries like spark.sql("select * from tmep_view") used from the following will be entered into this planner.
but, if you use dataframe's api like df.distinct().count(), then there are no problems. I didn't solve yet.
I tried with spark-shell:
./bin/spark-shell --driver-class-path avatica-1.17.0.jar --jars avatica-1.17.0.jar
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/")
.option("dbtable", "INFORMATION_SCHEMA.TABLES")
.option("user", "druid")
.option("password", "druid")
.load()

How to fix "No FileSystem for scheme: gs" in pyspark?

I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here's the code:
import pandas as pd
import numpy as np
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
conf = SparkConf().setAll([('spark.executor.memory', '16g'),
('spark.executor.cores','4'),
('spark.cores.max','4')]).setMaster('local[*]')
spark = (SparkSession.
builder.
config(conf=conf).
getOrCreate())
sc = spark.sparkContext
import glob
import bz2
import json
import pickle
bucket_path = "gs://<SOME_PATH>/"
client = storage.Client(project='<SOME_PROJECT>')
bucket = client.get_bucket ('<SOME_PATH>')
blobs = bucket.list_blobs()
theframes = []
for blob in blobs:
print(blob.name)
testspark = spark.read.json(bucket_path + blob.name).cache()
theframes.append(testspark)
It's reading files from the bucket fine (I can see the print out from blob.name), but then crashes like this:
Traceback (most recent call last):
File "test_code.py", line 66, in <module>
testspark = spark.read.json(bucket_path + blob.name).cache()
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o51.json.
: java.io.IOException: No FileSystem for scheme: gs
I've seen this type of error discussed on stackoverflow, but most solutions seem to be in Scala while I have pyspark, and/or involve messing with core-site.xml, which I've done to no effect.
I am using spark 2.4.1 and python 3.6.7.
Help would be much appreciated!
Some config params are required to recognize "gs" as a distributed filesystem.
Use this setting for google cloud storage connector, gcs-connector-hadoop2-latest.jar
spark = SparkSession \
.builder \
.config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar") \
.getOrCreate()
Other configs that can be set from pyspark
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
Alternatively you can set up these configs in core-site.xml or spark-defaults.conf.
Hadoop Configuration on Command Line
You can also use spark.hadoop-prefixed configuration properties to set things up when pyspark (or spark-submit in general), e.g.
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Resources