How to fix "No FileSystem for scheme: gs" in pyspark? - apache-spark

I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here's the code:
import pandas as pd
import numpy as np
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
conf = SparkConf().setAll([('spark.executor.memory', '16g'),
('spark.executor.cores','4'),
('spark.cores.max','4')]).setMaster('local[*]')
spark = (SparkSession.
builder.
config(conf=conf).
getOrCreate())
sc = spark.sparkContext
import glob
import bz2
import json
import pickle
bucket_path = "gs://<SOME_PATH>/"
client = storage.Client(project='<SOME_PROJECT>')
bucket = client.get_bucket ('<SOME_PATH>')
blobs = bucket.list_blobs()
theframes = []
for blob in blobs:
print(blob.name)
testspark = spark.read.json(bucket_path + blob.name).cache()
theframes.append(testspark)
It's reading files from the bucket fine (I can see the print out from blob.name), but then crashes like this:
Traceback (most recent call last):
File "test_code.py", line 66, in <module>
testspark = spark.read.json(bucket_path + blob.name).cache()
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o51.json.
: java.io.IOException: No FileSystem for scheme: gs
I've seen this type of error discussed on stackoverflow, but most solutions seem to be in Scala while I have pyspark, and/or involve messing with core-site.xml, which I've done to no effect.
I am using spark 2.4.1 and python 3.6.7.
Help would be much appreciated!

Some config params are required to recognize "gs" as a distributed filesystem.
Use this setting for google cloud storage connector, gcs-connector-hadoop2-latest.jar
spark = SparkSession \
.builder \
.config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar") \
.getOrCreate()
Other configs that can be set from pyspark
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
Alternatively you can set up these configs in core-site.xml or spark-defaults.conf.
Hadoop Configuration on Command Line
You can also use spark.hadoop-prefixed configuration properties to set things up when pyspark (or spark-submit in general), e.g.
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Related

Spark expecting HDFS location instead of Local Dir

I am trying to run spark streaming, but getting this issue. Please help
from pyspark.sql import SparkSession
if __name__ == "__main__":
print("Application started")
spark = SparkSession \
.builder \
.appName("Socker streaming demo") \
.master("local[*]")\
.getOrCreate()
# Steam will return unbounded table
stream_df = spark\
.readStream\
.format("socket")\
.option("host","localhost")\
.option("port","1100")\
.load()
print(stream_df.isStreaming)
stream_df.printSchema()
write_query = stream_df \
.writeStream\
.format("console")\
.start()
# this line of code will turn to streaming application into never ending
write_query.awaitTermination()
print("Application Completed")
Error is getting
22/07/31 00:13:16 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\786000702\AppData\Local\Temp\temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
Traceback (most recent call last):
File "D:\PySparkProject\pySparkStream\socker_streaming.py", line 23, in <module>
write_query = stream_df \
File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\streaming.py", line 1202, in start
return self._sq(self._jwrite.start())
File "D:\PySparkProject\venv\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "D:\PySparkProject\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o36.start.
**: org.apache.hadoop.fs.InvalidPathException: Invalid path name Path part /C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 from URI hdfs://0.0.0.0:19000/C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 is not a valid filename.
at org.apache.hadoop.fs.AbstractFileSystem.getUriPath(AbstractFileSystem.java:427)
at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:366)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:809)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:805)
at**
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:812)
at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createCheckpointDirectory(CheckpointFileManager.scala:368)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.resolveCheckpointLocation(ResolveWriteToStream.scala:121)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:42)
at
You can modify the FS path that Spark defaults by editing fs.defaultFS in core-site.xml file located either in your Spark or Hadoop conf directorie
You seem to have set that at hdfs://0.0.0.0:19000/ rather than some file:// URI path, based on the error

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

Pyspark general import problems

I succesfully instaled Spark and Pyspark in my machine, added path variables, etc. but keeps facing import problems.
This is the code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.hive.exec.dynamic.partition", "true") \
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
And this is the error message:
"C:\...\Desktop\Clube\venv\Scripts\python.exe" "C:.../Desktop/Clube/services/ce_modelo_analise.py"
Traceback (most recent call last):
File "C:\...\Desktop\Clube\services\ce_modelo_analise.py", line 1, in <module>
from pyspark.sql import SparkSession
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\context.py", line 31, in <module>
from pyspark import accumulators
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\accumulators.py", line 97, in <module>
from pyspark.serializers import read_int, PickleSerializer
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\serializers.py", line 71, in <module>
from pyspark import cloudpickle
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 145, in <module>
_cell_set_template_code = _make_cell_set_template_code()
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: 'bytes' object cannot be interpreted as an integer
If I remove the import line, those problems disappear. As I said before, my path variables are set:
and
Also, Spark is running correctly in cmd:
Going deeper I found the problem: I'm using Spark in version 2.4, which works with Python 3.7 tops.
As I was using Python 3.10, the problem was happening.
So if you're experiencing the same kind of issue, try to change your versions.

Azure databricks: KafkaUtils createDirectStream causes Py4JNetworkError("Answer from Java side is empty") error

In Azure databricks, I tried to create a kafka stream in notebook and used it to create a spark
job. Databricks throw error at the line KafkaUtils.createDirectStream(). Attached the correponding code below.
from kazoo.client import KazooClient
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
sc = spark.sparkContext
ssc = StreamingContext(sc, 30)
print('SSC created:: {}'.format(ssc))
zk = KazooClient(hosts=kafka_host)
print(kafka_host)
zk.start()
_offset_directory = "/" + topic + "/" + "DA_DAINT" + "/partitions"
print(_offset_directory)
if zk.exists(_offset_directory):
partitions = zk.get_children(_offset_directory)
print(partitions)
partition_offsets_dict = {}
for partition in partitions:
offset, stat = zk.get((_offset_directory + '/' + partition))
partition_offsets_dict[partition] = offset.decode()
print(partition_offsets_dict)
from_offset = {}
for _partition in partitions:
offset = partition_offsets_dict[_partition]
topic_partition = TopicAndPartition(topic, int(_partition))
from_offset[topic_partition] = int(offset)
print(from_offset)
print("\nCreate kafka direct stream ...")
kafka_stream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": broker_list},
fromOffsets=from_offset)
Attaching the error stack traces.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
An error occurred while calling
o581.createTopicAndPartition Traceback (most recent call last):
File "<command-3832551107104577>", line 77, in <module> fromOffsets=from_offset)
File "/databricks/spark/python/pyspark/streaming/kafka.py", line 141, in createDirectStream v) for (k, v) in fromOffsets.items()])
File "/databricks/spark/python/pyspark/streaming/kafka.py", line 141, in <listcomp> v) for (k, v) in fromOffsets.items()])
File "/databricks/spark/python/pyspark/streaming/kafka.py",
line 314, in _jTopicAndPartition return helper.createTopicAndPartition(self._topic, self._partition)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name)
File "/databricks/spark/python/pyspark/sql/utils.py",
line 63, in deco return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o581.createTopicAndPartition
In Azure databricks, when using Kafka stream in python notebook, I have installed kafka-python and org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 libraries and added them as a dependencies to the spark-job in databricks.
Note 1:
Also i am able to receive data from Kafka when i use simple kafka consumer in databricks notebook.
from kafka import KafkaConsumer
if __name__ == "__main__":
consumer_ = KafkaConsumer(group_id='test', bootstrap_servers=['my_kafka_server:9092'])
print(consumer_.topics())
consumer_.subscribe(topics=['dev_test'])
for m in consumer_:
print(m)
The problem arises only, if i try to create Kafka direct stream using KafkaUtils.createDirectStream() in azure databricks python notebook.
Another minimal set of code for reproducing this issue,
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
broker = "broker:9092"
topic = "dev_topic"
sc = spark.sparkContext
ssc = StreamingContext(sc, 30)
dks = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": broker})
print("Direct stream created...")
parsed = dks.map(lambda v: v[1])
summary_dstream = parsed.count().map(lambda x: 'Words in this batch: %s' % x)
print(summary_dstream)
NOTE 2:
Kafka version: 0.10
Scala version: 2.11
Spark version: 2.4.3
Still i am unable to get the root cause of the issue.
But using the jar org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.3 fixed the issue.
UPDATE 1:
Got the following update from microsoft support team:
Below is the update from databricks engineering.
We see the customer
is using DStreams API
(https://learn.microsoft.com/en-us/azure/databricks/spark/latest/rdd-streaming/)
which is outdated and we don't support it anymore. Also, we strongly recommend them switch
to Structured Streaming, you can follow this doc for doing it -
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/kafka

PySpark: ModuleNotFoundError: No module named 'app'

I am saving a dataframe to a CSV file in PySpark using below statement:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')
But i am getting below error
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'
i am using PySpark version 2.3.0
I am getting this error while trying to write to a file.
import json, jsonschema
from pyspark.sql import functions
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType
from datetime import datetime
import os
feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)
df_all = feb.union(apr)
df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])
create_emi_amount_udf = udf(create_emi_amount, FloatType())
df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))
df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')
The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.
So, somewhere in your method create_emi_amount you use or import the app module.
A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....
Thanks to ggeop! He helped me out. ggeop has explained the problem. But the solution may not be correct if the 'app' is his own package.
My solution is to add the file in sparkcontext:
sc = SparkContext()
sc.addPyFile("app.zip")
But you have to zip app package first, and you have to make sure the zipped packaged get app directory.
i.e. if your app is at:/home/workplace/app
then you have to do the zip under workplace, which will zip all directories under workplace including app.
The other way is to send the file in spark-submit, as below:
--py-files app.zip
--py-files myapp.py

Resources