Can't write anything in PySpark

Can't write anything in PySpark - apache-spark

I try to write a DataFrame into any kind of file format. I tried to reinstall spark several times in different ways and different versions, but receieve the same error everytime, even on another machine. Currently using Spark 3.3.1 on Hadoop 2.7 locally on Windows 11:
data = [[1, 43, 41], [2, 43, 41], [3, 43, 4]]
x = spark.createDataFrame(data)
x.write.csv('qqq')
And receive this:
File "D:\venvs\spark2\spark_hw.py", line 77, in <module>
x.write.csv('qqq')
File "D:\venvs\spark2\lib\site-packages\pyspark\sql\readwriter.py", line 1240, in csv
self._jwrite.csv(path)
File "D:\venvs\spark2\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:\venvs\spark2\lib\site-packages\pyspark\sql\utils.py", line 190, in deco
return f(*a, **kw)
File "D:\venvs\spark2\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o44.csv.
: org.apache.spark.SparkException: Job aborted.

x.write.format("csv").save("path/where/file/should/go")
will write your dataframe to a csv to the path specified in the save method

Related

SELECT after createOrReplaceTempView in Foundry's Code Workbook

MRE:
Created a dataset from Fusion
Created a transformation in Code Workbook
def unnamed(src):
src.createOrReplaceTempView('view_src')
df = spark.sql(f"""SELECT * FROM view_src""")
return df
Traceback (most recent call last):
File "unnamed", line 1, in <module>
File "unnamed", line 3, in unnamed
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py", line 649, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/opt/conda/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o651.sql.
: com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException: view_src
...
How to get back this table as a dataframe from the temp view?
I have also tried createOrReplaceGlobalTempView.
In Code Repositories the given code snippet works fine.

data_block api :show_batch cause CUDA unknown error

I am a newbie and then trying to learn fastai data block api
Here is the mistake, the code is exactly the same as the tutorial：
coco = untar_data(URLs.COCO_TINY)
path=coco/'train.json'
images, lbl_bbox = get_annotations(coco/'train.json')
img2bbox = dict(zip(images, lbl_bbox))
get_y_func = lambda o:img2bbox[o.name]
data = (ObjectItemList.from_folder(coco)
.split_by_rand_pct()
.label_from_func(get_y_func)
.transform(get_transforms(), tfm_y=True)
.databunch(bs=1, num_workers=0,collate_fn=bb_pad_collate))
data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))
Then the error is：
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\IPython\core\interactiveshell.py", line
3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-25e60680c0ba>", line 15, in <module>
data.show_batch(rows=2, ds_type=DatasetType.Valid, figsize=(6,6))
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\fastai\basic_data.py", line 185, in show_batch
x,y = self.one_batch(ds_type, True, True)
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\fastai\basic_data.py", line 168, in one_batch
try: x,y = next(iter(dl))
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\fastai\basic_data.py", line 75, in __iter__
for b in self.dl: yield self.proc_batch(b)
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\dataloader.py", line 348,__next__
data = _utils.pin_memory.pin_memory(data)
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\_utils\pin_memory.py", line
55, in pin_memory
return [pin_memory(sample) for sample in data]
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\_utils\pin_memory.py", line
55, in <listcomp>
return [pin_memory(sample) for sample in data]
File "D:\Anaconda3\envs\pytorch-gpu\lib\site-packages\torch\utils\data\_utils\pin_memory.py", line
47, in pin_memory
return data.pin_memory()
RuntimeError: CUDA error: unknown error
Regarding this error, the forums on the Internet are all setting the parameters of the DataLoader, but it does not seem to be used here
How would I go about this?

I also got this error. me it fixed by edit the following two lines in the head of notebook:
import os
os.environ['CUDA_VISIBLE_DEVICES']='2'

Sending PySpark results to MQTT broker with foreachRDD and paho

I'm trying to send a DStream with my computed results to an MQTT broker, but foreachRDD keeps crashing.
I'm running Spark 2.4.3 with Bahir for MQTT subscribe compiled from git master. Everything works up to this point. Before trying to publish my results with MQTT, I tried to saveAsFiles(), and that worked (but isn't exactly what I want).
def sendPartition(part):
# code for publishing with MQTT here
return 0
mydstream = MQTTUtils.createStream(ssc, brokerUrl, topic)
mydstream = packets.map(change_format) \
.map(lambda mac: (mac, 1)) \
.reduceByKey(lambda a, b: a + b)
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition)) # line 56
the resulting Error I get is this:
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/util.py", line 68, in call
r = self.func(t, *rdds)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 161, in <lambda>
func = lambda t, rdd: old_func(rdd)
File "/path/to/my/code.py", line 56, in <lambda>
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 806, in foreachPartition
self.mapPartitions(func).count() # Force evaluation
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
vals = self.mapPartitions(func).collect()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.: java.lang.IllegalArgumentException: Unsupported class file major version 55
with lots of java errors following, but I suspect the error is in my code.

Are you able to run other Spark commands? At the end of your stack trace, you see java.lang.IllegalArgumentException: Unsupported class file major version 55. This indicates that you are running Spark on an unsupported version of Java.
Spark is not yet compatible with Java 11 (due to limitations imposed by Scala I think). Try configuring spark to use Java 8. The specifics vary a bit based on what platform you're on. You'll probably need to install Java 8, and change the JAVA_HOME environment variable to point towards the new installation.

Error while collecting the data from dataframe column in Pyspark

i am using Pyspark (Python 3.7 with Spark 2.4) and have a small line of code to collect a date from one of the attributes in Dataframe, i can run the same code from pyspark command line, however in my production code it errors out.
Here is the line of code where i read a dataframe "df" and collecting date from field "job_id".
>>> run_dt = map( lambda r:r[0], df.filter((df['delivery_date'] == '2017-12-31')).select(max(substring(df['job_id'], 9, 10).cast("integer")).alias('last_run')).collect())[0]
>>> print(run_dt)
2017123101
The same code line gives me an error in my production code while evaluation. The error message is-
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\dataframe.py", line 533, in collect
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.collectToPython.

How could I write the right entry point in Spark 2.0 program (Actually pyspark 2.0)?

Today, I wanna try some new features with Spark2.0, here is my program:
#coding:utf-8
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("/Users/lyj/Programs/Apache/Spark2/examples/src/main/resources/people.json")
df.show()
but it errors as follow:
Traceback (most recent call last):
File "/Users/lyj/Programs/kiseliugit/MyPysparkCodes/test/spark2.0.py", line 5, in <module>
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/context.py", line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/java_gateway.py", line 116, in launch_gateway
java_import(gateway.jvm, "org.apache.spark.SparkConf")
File "/Library/Python/2.7/site-packages/py4j/java_gateway.py", line 90, in java_import
return_value = get_return_value(answer, gateway_client, None, None)
File "/Library/Python/2.7/site-packages/py4j/protocol.py", line 306, in get_return_value
value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
KeyError: u'y'
What's wrong with these few lines of codes? Does it have the problem with java environment? Plus, I use IDE PyCharm for developing.

Try to upgrade py4j, pip install py4j --upgrade
It's worked for me.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Can't write anything in PySpark - apache-spark

x.write.format("csv").save("path/where/file/should/go") will write your dataframe to a csv to the path specified in the save method

Related

SELECT after createOrReplaceTempView in Foundry's Code Workbook

data_block api :show_batch cause CUDA unknown error

Sending PySpark results to MQTT broker with foreachRDD and paho

Error while collecting the data from dataframe column in Pyspark

How could I write the right entry point in Spark 2.0 program (Actually pyspark 2.0)?

Categories

Resources