I am trying to write DataFrames to HBase using Pyspark.
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
catalog = ''.join("""{
"table":{"namespace":"default", "name":"testtable"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf", "col":"col1", "type":"string"}
}
}""".split())
df.write.options(catalog=catalog).format(data_source_format).save()
Executing Command in the following format:
sudo spark-submit --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --files /home/chenxx/hbase/conf/hbase-site.xml sogou4.py
Spark version : 2.3.0 Hadoop version: 2.7.6 HBase version: 1.1.5 Scala version: 2.11.6
ERROR:
Traceback (most recent call last):
File "/home/chenxx/PycharmProjects/SoGou/sogou4.py", line 52, in <module>
df.write.options(catalog = catalog1).format(data_source_format).save()
File "/usr/local/lib/python2.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 701, in save
File "/usr/local/lib/python2.7/dist-packages/pyspark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/local/lib/python2.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/lib/python2.7/dist-packages/pyspark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
**py4j.protocol.Py4JJavaError: An error occurred while calling o61.save.
: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: {"table":{"namespace":"default","name":"testtable"},"rowkey":"key","columns":{"col0":{"cf":"rowkey","col":"key","type":"string"},"col1":{"cf":"cf","col":"col1","type":"string"}}; line: 1, column: 0])
at [Source: {"table":{"namespace":"default","name":"testtable"},"rowkey":"key","columns":{"col0":{"cf":"rowkey","col":"key","type":"string"},"col1":{"cf":"cf","col":"col1","type":"string"}}; line: 1, column: 355]**
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
Does anyone know what the problem might be? I would appreciate it if there were any suggestions! Thanks!
Related
I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)
I succesfully instaled Spark and Pyspark in my machine, added path variables, etc. but keeps facing import problems.
This is the code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.hive.exec.dynamic.partition", "true") \
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
And this is the error message:
"C:\...\Desktop\Clube\venv\Scripts\python.exe" "C:.../Desktop/Clube/services/ce_modelo_analise.py"
Traceback (most recent call last):
File "C:\...\Desktop\Clube\services\ce_modelo_analise.py", line 1, in <module>
from pyspark.sql import SparkSession
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\context.py", line 31, in <module>
from pyspark import accumulators
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\accumulators.py", line 97, in <module>
from pyspark.serializers import read_int, PickleSerializer
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\serializers.py", line 71, in <module>
from pyspark import cloudpickle
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 145, in <module>
_cell_set_template_code = _make_cell_set_template_code()
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: 'bytes' object cannot be interpreted as an integer
If I remove the import line, those problems disappear. As I said before, my path variables are set:
and
Also, Spark is running correctly in cmd:
Going deeper I found the problem: I'm using Spark in version 2.4, which works with Python 3.7 tops.
As I was using Python 3.10, the problem was happening.
So if you're experiencing the same kind of issue, try to change your versions.
In Azure databricks, I tried to create a kafka stream in notebook and used it to create a spark
job. Databricks throw error at the line KafkaUtils.createDirectStream(). Attached the correponding code below.
from kazoo.client import KazooClient
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
sc = spark.sparkContext
ssc = StreamingContext(sc, 30)
print('SSC created:: {}'.format(ssc))
zk = KazooClient(hosts=kafka_host)
print(kafka_host)
zk.start()
_offset_directory = "/" + topic + "/" + "DA_DAINT" + "/partitions"
print(_offset_directory)
if zk.exists(_offset_directory):
partitions = zk.get_children(_offset_directory)
print(partitions)
partition_offsets_dict = {}
for partition in partitions:
offset, stat = zk.get((_offset_directory + '/' + partition))
partition_offsets_dict[partition] = offset.decode()
print(partition_offsets_dict)
from_offset = {}
for _partition in partitions:
offset = partition_offsets_dict[_partition]
topic_partition = TopicAndPartition(topic, int(_partition))
from_offset[topic_partition] = int(offset)
print(from_offset)
print("\nCreate kafka direct stream ...")
kafka_stream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": broker_list},
fromOffsets=from_offset)
Attaching the error stack traces.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
An error occurred while calling
o581.createTopicAndPartition Traceback (most recent call last):
File "<command-3832551107104577>", line 77, in <module> fromOffsets=from_offset)
File "/databricks/spark/python/pyspark/streaming/kafka.py", line 141, in createDirectStream v) for (k, v) in fromOffsets.items()])
File "/databricks/spark/python/pyspark/streaming/kafka.py", line 141, in <listcomp> v) for (k, v) in fromOffsets.items()])
File "/databricks/spark/python/pyspark/streaming/kafka.py",
line 314, in _jTopicAndPartition return helper.createTopicAndPartition(self._topic, self._partition)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name)
File "/databricks/spark/python/pyspark/sql/utils.py",
line 63, in deco return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o581.createTopicAndPartition
In Azure databricks, when using Kafka stream in python notebook, I have installed kafka-python and org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 libraries and added them as a dependencies to the spark-job in databricks.
Note 1:
Also i am able to receive data from Kafka when i use simple kafka consumer in databricks notebook.
from kafka import KafkaConsumer
if __name__ == "__main__":
consumer_ = KafkaConsumer(group_id='test', bootstrap_servers=['my_kafka_server:9092'])
print(consumer_.topics())
consumer_.subscribe(topics=['dev_test'])
for m in consumer_:
print(m)
The problem arises only, if i try to create Kafka direct stream using KafkaUtils.createDirectStream() in azure databricks python notebook.
Another minimal set of code for reproducing this issue,
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
broker = "broker:9092"
topic = "dev_topic"
sc = spark.sparkContext
ssc = StreamingContext(sc, 30)
dks = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": broker})
print("Direct stream created...")
parsed = dks.map(lambda v: v[1])
summary_dstream = parsed.count().map(lambda x: 'Words in this batch: %s' % x)
print(summary_dstream)
NOTE 2:
Kafka version: 0.10
Scala version: 2.11
Spark version: 2.4.3
Still i am unable to get the root cause of the issue.
But using the jar org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.3 fixed the issue.
UPDATE 1:
Got the following update from microsoft support team:
Below is the update from databricks engineering.
We see the customer
is using DStreams API
(https://learn.microsoft.com/en-us/azure/databricks/spark/latest/rdd-streaming/)
which is outdated and we don't support it anymore. Also, we strongly recommend them switch
to Structured Streaming, you can follow this doc for doing it -
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/kafka
i am new in using spark , i try to run this code on pyspark
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
but he till me this erore message
Using Python version 3.5.2 (default, Jul 5 2016 11:41:13)
SparkSession available as 'spark'.
>>> from pyspark import SparkConf, SparkContext
>>> import collections
>>> conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
>>> sc = SparkContext(conf = conf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\python\pyspark\context.py", line 275, in _ensure_initialized
callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by getOrCreate at C:\spark\bin\..\python\pyspark\shell.py:43
>>>
i have version spark 2.1.1 and python 3.5.2 , i search and found it is problem in sc ,he could not read it but no when till why , any one have help here
You can try out this
sc = SparkContext.getOrCreate();
You can try:
sc = SparkContext.getOrCreate(conf=conf)
Your previous session is still on. You can run
sc.stop()
it can run through Jupyter lab also. but you have to use as your previous session is still running and local can not run two sessions at a time
sc = SparkContext.getOrCreate( conf =conf)
This simple PySpark snippet runs fine with normal spark-submit but fails with Apache Zeppelin on the cast call. Any ideas?
%pyspark
import pyspark.sql.functions as spark_functions
col1 = spark_functions.lit(None)
print("type(col1)={}".format(type(col1)))
col2 = col1.cast(StringType())
error is:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6046223946582899049.py", line 252, in <module>
eval(compiledCode)
File "<string>", line 14, in <module>
File "/usr/lib/spark/python/pyspark/sql/column.py", line 334, in cast
jdt = ctx._ssql_ctx.parseDataType(dataType.json())
AttributeError: 'JavaMember' object has no attribute 'parseDataType'
This is a known bug with Spark 2.0 on Zeppelin 0.6.1 that is targeted to be fixed in Zeppelin 0.6.2: https://issues.apache.org/jira/browse/ZEPPELIN-1411