Read/Load avro file from s3 using pyspark

Read/Load avro file from s3 using pyspark - apache-spark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4

You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here

Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

Related

Missing library to ingest data into Azure Data Explorer with PySpark

I am trying to ingest data into Azure Data Explorer through PySpark with PyCharm IDE. However, I am having a lot of problems related to missing libraries when running my code.
According to Azure Data Explorer connector's page, I need to install the connector's jar and the two dependencies jar kusto-ingest and kusto-data.
After download all these 3 jar's and importing them to PySpark, I can't proceed with my data ingestion, it keeps returning me missing library errors. The first one is the azure-storage lib, then I've installed and imported the jar, it asks for adal4j lib, I do the same and it asks oauth2 lib, then json lib, azure-client-authentication lib, javax mail lib, and so on.
I've installed more than 10 jars and I still can't run this ingestion. Am I doing something wrong?
My PySpark version is 2.4. You can see my code below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.appName("Teste") \
.config('spark.jars', 'kusto-spark_2.4_2.11-2.5.2.jar,kusto-data-2.4.1.jar,kusto-ingest-2.4.1.jar,azure-storage-8.3.0.jar,json-20180813.jar,adal4j-1.6.5.jar') \
.getOrCreate()
# loading a test csv file
df = spark.read.csv('MOCK_DATA.csv', header=True, sep=',')
df.write.format("com.microsoft.kusto.spark.datasource")\
.option("kustoCluster", "myclustername")\
.option("kustoDatabase", "mydatabase")\
.option("kustoTable", "mytable")\
.option("kustoAadAppId", "myappid")\
.option("kustoAadAppSecret", "mysecret")\
.option("kustoAadAuthorityID", "myautorityid")\
.mode("Append")\
.save()

When working with a non-maven installation you need to use a JAR with all the dependencies.
You can take it from the github releases:
https://github.com/Azure/azure-kusto-spark/releases
or build yourself if its missing from the specific version by cloning the repo and running
mvn assembly:single

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

I have created one spark application to integrate with kafka and get stream of data from kafka.
But, when i try to import import org.apache.spark.streaming.kafka._ an error occur that Cannot resolve symbol kafka so what should i do to import this library

Depending on your Spark and Scala version you need to include the spark-kafka integration library to your dependencies.
Spark Structured Streaming
If you plan to use Spark Structured Streaming you need to add the following to your dependencies as described here:
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.0.1
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up. For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below. For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.
Spark Streaming
If you plan to work Spark Streaming (Direct API) you can follow the guidance given here:
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.12
version = 3.0.1

Read Avro in Azure HDI4.0

I'm trying to read an Avro file using Jupyter notebook in Azure HDInsight 4.0 with Spark 2.4.
I'm not able to provide properly the .jar file to
I've tried the approach suggested in How to use Avro on HDInsight Spark/Jupyter? and in https://learn.microsoft.com/en-in/azure/hdinsight/spark/apache-spark-jupyter-notebook-use-external-packages but I guess they are related to Spark 2.3
%%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-avro_2.11:4.0.0" }}
This produce the error message:
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

The solution that seem to work is
%%configure -f
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.0" }}

How can I read a XML file Azure Databricks Spark

I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances.
So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW.
Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported..
So any help pushing me a a good direction is appreciated.

One way is to use the databricks spark-xml library :
Import the spark-xml library into your workspace
https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :

I found this one is really helpful.
https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
install in your databricks cluster at the 'library' tab.
install it via launching spark-shell in the notebook itself.

I got one solution of reading xml file in databricks:
install this library : com.databricks:spark-xml_2.12:0.11.0
using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.
Using this command (%fs head "") you will get the rootTag and rowTag.
df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")
display(df)
reference image for solution to read xml file in databricks

Query Hive table created with built-in Serde from Spark app

I have an hadoop cluster deployed using Hortonwork's HDP 2.2 (Spark 1.2.1 & Hive 0.14)
I have developped a simple Spark app that is supposed to retrieve the content of a Hive table, perform some actions and output to a file. The Hive table was imported using Hive's built-in SerDe.
When I run the app on the cluster I get the following exception :
ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1982)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
...
Basically, Spark doesn't find Hive's SerDe (org.apache.hadoop.hive.serde2.OpenCSVSerde)
I didn't find any jar to include at the app's execution and no mention of a similar problem anywhere. I have no idea how to tell Spark where to find it.

Make a shaded JAR of your application which includes hive-serde JAR. Refer this

add jar file in spark config spark.driver.extraClassPath.
Any external jar must be added here , then spark environment will automatically load them.
Or use spark-shell --jars command
example
spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar

The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look !
--jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read/Load avro file from s3 using pyspark - apache-spark

You just need to import that package org.apache.spark:spark-avro_2.11:4.0.0 Check which version you need here

Related

Missing library to ingest data into Azure Data Explorer with PySpark

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

Read Avro in Azure HDI4.0

How can I read a XML file Azure Databricks Spark

Query Hive table created with built-in Serde from Spark app

Categories

Resources