Apche spark-1.3.1, I am getting exception ClassNotFoundException while UDTF lateral view used - apache-spark

While executing hive query with UDTF using apache spark 1.3.1 then alias name does not take
And If i use UDTF using lateral veiw then i got class not found for Custom UDTF class.

This mentioned issue was reported under spark 1.1.0, 1.1.1, 1.2.1 and 1.3.0, it has been resolved and will be available with spark 1.4.0.
https://issues.apache.org/jira/i#browse/SPARK-4811

Related

How to fix 'Failed to convert the JSON string 'varchar(2)' to a data type.'

We want to move from spark 3.0.1 to 3.1.2. According to migration guide varchar data types are now supported in table schema. Unfortunately data onboarded with new version cant be queried by old spark versions which considered varchar as a string in table schema. According to migration guide applying spark.sql.legacy.charVarcharAsString to true in Spark Session configuration should do the trick but we still get varchar datatype instead of string in hive table schema.
As is:
To be:
What are we missing here?
You should upgrade spark version according to this https://issues.apache.org/jira/browse/SPARK-37452. There is a bug which affect versions 3.1.2, 3.2.0. And it was fixed in versions 3.1.3, 3.2.1, 3.3.0

How to run CreateIndex function in Hyperspace (spark)

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.
Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

How to implement rdd.bulkSaveToCassandra in datastax

I am using datastax cluster with 5.0.5.
[cqlsh 5.0.1 | Cassandra 3.0.11.1485 | DSE 5.0.5 | CQL spec 3.4.0 | Native proto
using spark-cassandra-connector 1.6.8
I tried to implement below code.. import is not working.
val rdd: RDD[SomeType] = ... // create some RDD to save import
com.datastax.bdp.spark.writer.BulkTableWriter._
rdd.bulkSaveToCassandra(keyspace, table)
Can someone suggest me how to implement this code. Are they any dependenceis required for this.
Cassandra Spark Connector has saveToCassandra method that could be used like this (taken from documentation):
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
There is also saveAsCassandraTableEx that allows you to control schema creation, and other things - it's also described in documentation referenced above.
To use them you need to import com.datastax.spark.connector._ described in "Connecting to Cassandra" document.
And you need to add corresponding dependency - but this depends on what build system do you use.
The bulkSaveToCassandra method is available only when you're using DSE's connector. You need to add corresponding dependencies - see documentation for more details. But even primary developer of Spark connector says that it's better use saveToCassandra instead of it.

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Resources