Error when attempting to read Parquet in Spark - apache-spark

I am using Python Spark 2.4.3
I read the CSV and make a dataframe from it and write it to Parquet just fine. The 3rd line is what breaks.
df = spark.read.csv("file.csv", header=True)
df.write.parquet("result_parquet")
parquetFile = spark.read.parquet("result_parquet")
I am getting this:
Py4JJavaError: An error occurred while calling o1312.parquet.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
What am I doing wrong? I got the line straight from the Spark documentation https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#loading-data-programmatically

The problem is I was using Java 11 (not supported fully by Spark). I uninstalled and Installed Java 8 and now it works

Related

Writing avro files using Spark 2.3

I'm somewhat new to Spark, but I understand that read/write of avro files was built into Spark 2.4, but unfortunately I'm limited to version 2.3 right now. I'm having trouble writing to avro and keep getting errors. Am I not installing this properly?
Have used this in spark session setup:
avro_loc = "com.databricks:spark-avro_2.11:4.0.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + avro_loc + ' pyspark-shell'
And I've tried these two versions for the write code I'm attempting:
df.write.mode('overwrite')\
.option('batchsize',10000) \
.avro('{}/df.avro' \
.format(HDFS_LOC))
df.write.format('avro').save('/user/Data/df.avro')
I get these errors for the 1st and 2nd bit of code above, respectively:
AttributeError: 'DataFrameWriter' object has no attribute 'avro'
AnalysisException: 'Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;

How do I disable pyarrow using data bricks

I'm trying to convert a pyspark dataframe to pandas data frame in databricks. My databricks Runtime version is 7.3 LTS (Scala 2.12, Spark 3.0.1)
So I wrote following code
df_temp=spark_temp.toPandas()
But I'm getting error message
UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true,
So I tried following to disable the pyarrow
spark.conf.set(“spark.sql.execution.arrow.enabled”, “false”)
But I'm getting error message
SyntaxError: invalid character in identifier
And it's pointing to spark.sql
Can you help me to resolve the issue
The issue is from those double quotation marks, try this:
spark.conf.set("spark.sql.execution.arrow.enabled", "false")

How to run CreateIndex function in Hyperspace (spark)

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.
Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

Saving data in Elasticsearch using PySpark [duplicate]

This question already has answers here:
How to save dataframe to Elasticsearch in PySpark?
(3 answers)
Closed 3 years ago.
I have a program that takes a dataframe and should save it into Elasticsearch. Here's what it looks like when I save the dataframe:
model_df.write.format(
"org.elasticsearch.spark.sql"
).option(
"pushdown", True
).option(
"es.nodes", "example.server:9200"
).option("es.index.auto.create", True
).mode('append').save("EPTestIndex/")
When I run my program, I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o96.save.
: java.lang.ClassNotFoundException: Failed to find data source:
org.elasticsearch.spark.sql. Please find packages at
http://spark.apache.org/third-party-projects.html
I did some research and thought I needed a jar, so I added these configurations to my SparkSession:
spark = SparkSession.builder.config("jars", "/Users/public/ProjectDirectory/lib/elasticsearch-spark-20_2.11-6.0.1.jar")\
.getOrCreate()
sqlContext = SQLContext(spark)
I initialize the SparkSession in main and write to ES in another package. The package takes the dataframe and runs the write command above. However, even with this I am still getting the same ClassNotFoundExceptioin What might be the issue?
I am running this program in PyCharm, how can I make it so that PyCharm is able to run it?
Elasticsearch exposes a JSON API and a pandas dataframe is not a JSON supported type.
If you had to insert it, you could serialize the dataframe using dataframe.to_json()

Unable to read parquet file locally in spark

I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook.
df = spark.read.parquet("metastore_db/tmp/userdata1.parquet")
I am getting this exception
An error occurred while calling o738.parquet.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Does anyone know how to do it?
Assuming that you are running spark on your local, you should be doing something like
df = spark.read.parquet("file:///metastore_db/tmp/userdata1.parquet")

Resources