How to run CreateIndex function in Hyperspace (spark) - apache-spark

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.

Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

Related

Writing avro files using Spark 2.3

I'm somewhat new to Spark, but I understand that read/write of avro files was built into Spark 2.4, but unfortunately I'm limited to version 2.3 right now. I'm having trouble writing to avro and keep getting errors. Am I not installing this properly?
Have used this in spark session setup:
avro_loc = "com.databricks:spark-avro_2.11:4.0.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + avro_loc + ' pyspark-shell'
And I've tried these two versions for the write code I'm attempting:
df.write.mode('overwrite')\
.option('batchsize',10000) \
.avro('{}/df.avro' \
.format(HDFS_LOC))
df.write.format('avro').save('/user/Data/df.avro')
I get these errors for the 1st and 2nd bit of code above, respectively:
AttributeError: 'DataFrameWriter' object has no attribute 'avro'
AnalysisException: 'Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;

How do I disable pyarrow using data bricks

I'm trying to convert a pyspark dataframe to pandas data frame in databricks. My databricks Runtime version is 7.3 LTS (Scala 2.12, Spark 3.0.1)
So I wrote following code
df_temp=spark_temp.toPandas()
But I'm getting error message
UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true,
So I tried following to disable the pyarrow
spark.conf.set(“spark.sql.execution.arrow.enabled”, “false”)
But I'm getting error message
SyntaxError: invalid character in identifier
And it's pointing to spark.sql
Can you help me to resolve the issue
The issue is from those double quotation marks, try this:
spark.conf.set("spark.sql.execution.arrow.enabled", "false")

Error when attempting to read Parquet in Spark

I am using Python Spark 2.4.3
I read the CSV and make a dataframe from it and write it to Parquet just fine. The 3rd line is what breaks.
df = spark.read.csv("file.csv", header=True)
df.write.parquet("result_parquet")
parquetFile = spark.read.parquet("result_parquet")
I am getting this:
Py4JJavaError: An error occurred while calling o1312.parquet.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
What am I doing wrong? I got the line straight from the Spark documentation https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#loading-data-programmatically
The problem is I was using Java 11 (not supported fully by Spark). I uninstalled and Installed Java 8 and now it works

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Resources