Read Avro in Azure HDI4.0 - azure

I'm trying to read an Avro file using Jupyter notebook in Azure HDInsight 4.0 with Spark 2.4.
I'm not able to provide properly the .jar file to
I've tried the approach suggested in How to use Avro on HDInsight Spark/Jupyter? and in https://learn.microsoft.com/en-in/azure/hdinsight/spark/apache-spark-jupyter-notebook-use-external-packages but I guess they are related to Spark 2.3
%%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-avro_2.11:4.0.0" }}
This produce the error message:
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

The solution that seem to work is
%%configure -f
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.0" }}

Related

How to read a pickle file in pyspark on Databricks

I have a pickle file on Azure Storage Blob, that I want to read in spark. While reading the file it is giving some error.
df = spark.read.format('pickle').load(path)
Following is the error that I am receiving
java.lang.ClassNotFoundException: Failed to find data source: pickle. Please find packages at
http://spark.apache.org/third-party-projects.html
Version details
Spark 3.0.1, Scala 2.12
Any help would be appreciated.

Missing library to ingest data into Azure Data Explorer with PySpark

I am trying to ingest data into Azure Data Explorer through PySpark with PyCharm IDE. However, I am having a lot of problems related to missing libraries when running my code.
According to Azure Data Explorer connector's page, I need to install the connector's jar and the two dependencies jar kusto-ingest and kusto-data.
After download all these 3 jar's and importing them to PySpark, I can't proceed with my data ingestion, it keeps returning me missing library errors. The first one is the azure-storage lib, then I've installed and imported the jar, it asks for adal4j lib, I do the same and it asks oauth2 lib, then json lib, azure-client-authentication lib, javax mail lib, and so on.
I've installed more than 10 jars and I still can't run this ingestion. Am I doing something wrong?
My PySpark version is 2.4. You can see my code below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.appName("Teste") \
.config('spark.jars', 'kusto-spark_2.4_2.11-2.5.2.jar,kusto-data-2.4.1.jar,kusto-ingest-2.4.1.jar,azure-storage-8.3.0.jar,json-20180813.jar,adal4j-1.6.5.jar') \
.getOrCreate()
# loading a test csv file
df = spark.read.csv('MOCK_DATA.csv', header=True, sep=',')
df.write.format("com.microsoft.kusto.spark.datasource")\
.option("kustoCluster", "myclustername")\
.option("kustoDatabase", "mydatabase")\
.option("kustoTable", "mytable")\
.option("kustoAadAppId", "myappid")\
.option("kustoAadAppSecret", "mysecret")\
.option("kustoAadAuthorityID", "myautorityid")\
.mode("Append")\
.save()
When working with a non-maven installation you need to use a JAR with all the dependencies.
You can take it from the github releases:
https://github.com/Azure/azure-kusto-spark/releases
or build yourself if its missing from the specific version by cloning the repo and running
mvn assembly:single

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4
You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here
Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

How can I read a XML file Azure Databricks Spark

I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances.
So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW.
Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported..
So any help pushing me a a good direction is appreciated.
One way is to use the databricks spark-xml library :
Import the spark-xml library into your workspace
https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :
I found this one is really helpful.
https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
install in your databricks cluster at the 'library' tab.
install it via launching spark-shell in the notebook itself.
I got one solution of reading xml file in databricks:
install this library : com.databricks:spark-xml_2.12:0.11.0
using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.
Using this command (%fs head "") you will get the rootTag and rowTag.
df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")
display(df)
reference image for solution to read xml file in databricks

Query Hive table created with built-in Serde from Spark app

I have an hadoop cluster deployed using Hortonwork's HDP 2.2 (Spark 1.2.1 & Hive 0.14)
I have developped a simple Spark app that is supposed to retrieve the content of a Hive table, perform some actions and output to a file. The Hive table was imported using Hive's built-in SerDe.
When I run the app on the cluster I get the following exception :
ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1982)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
...
Basically, Spark doesn't find Hive's SerDe (org.apache.hadoop.hive.serde2.OpenCSVSerde)
I didn't find any jar to include at the app's execution and no mention of a similar problem anywhere. I have no idea how to tell Spark where to find it.
Make a shaded JAR of your application which includes hive-serde JAR. Refer this
add jar file in spark config spark.driver.extraClassPath.
Any external jar must be added here , then spark environment will automatically load them.
Or use spark-shell --jars command
example
spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar
The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look !
--jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Resources