How can I read a XML file Azure Databricks Spark - azure

I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances.
So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW.
Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported..
So any help pushing me a a good direction is appreciated.

One way is to use the databricks spark-xml library :
Import the spark-xml library into your workspace
https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :

I found this one is really helpful.
https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
install in your databricks cluster at the 'library' tab.
install it via launching spark-shell in the notebook itself.

I got one solution of reading xml file in databricks:
install this library : com.databricks:spark-xml_2.12:0.11.0
using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.
Using this command (%fs head "") you will get the rootTag and rowTag.
df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")
display(df)
reference image for solution to read xml file in databricks

Related

Apache Spark Connector - where to install on Databricks

This Apache Spark connector: SQL Server & Azure SQL article from Azure team describes how to use this connector.
Question: If you want to use the above connector in Azure Databricks, where will you install it?
Remarks: The above article tells you to install it from here and import it in, say, your notebook using com.microsoft.azure:spark-mssql-connector_2.12:1.2.0. But it does not tell you where to install. I'm probably not understanding the article correctly. I need to use it in an Azure Databricks and would like to know where to install the connector jar (compiled) file.
You can do this in the cluster setup. See this documentation: https://databricks.com/blog/2015/07/28/using-3rd-party-libraries-in-databricks-apache-spark-packages-and-maven-libraries.html
In short, when setting up the cluster, you can add third party libraries by their Maven coordinates - "com.microsoft.azure:spark-mssql-connector_2.12:1.2.0" is an example of a Maven coordinate.

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

I have created one spark application to integrate with kafka and get stream of data from kafka.
But, when i try to import import org.apache.spark.streaming.kafka._ an error occur that Cannot resolve symbol kafka so what should i do to import this library
Depending on your Spark and Scala version you need to include the spark-kafka integration library to your dependencies.
Spark Structured Streaming
If you plan to use Spark Structured Streaming you need to add the following to your dependencies as described here:
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.0.1
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up. For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below. For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.
Spark Streaming
If you plan to work Spark Streaming (Direct API) you can follow the guidance given here:
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.12
version = 3.0.1

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4
You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here
Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

Load external jars to Zeppelin from s3

Pretty simple objective. Load my custom/local jars from s3 to zeppelin notebook (using zeppelin from AWS EMR).
Location of the Jar
s3://my-config-bucket/process_dataloader.jar
Following zeppelin documentation I opened the interpreter like in the following image and spark.jars in the properties name and its value is s3://my-config-bucket/process_dataloader.jar
I restarted the interpreter and then in the notebook I tried to import the jar using the following
import com.org.dataloader.DataLoader
but it throws the following
<console>:23: error: object org is not a member of package com
import com.org.dataloader.DataLoader
Any suggestions for solving this problem?
A bit late thought but for anyone else who might need this in future try below option,
https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url.
%spark.dep
z.reset()
z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")

HDInsight and Talend Open Studio for Big Data

I am currently working on a project in which I need to connect Talend open Studio for Big Data (v 6.3.1) to an Azure’s HDInsight (3.5) Hadoop Cluster. So far, I am trying a simple example which consists in creating an Hive Table.
For that, I am using the following diagram:
The hive connection was configured as followed:
… and please find below the specifications of the tHiveCreateTable_1 node:
By running this process:
· The specified container and deployment Blob is created (see image below) - which make me believe that everything is ok with the Windows Storage Configuration
· However the tHiveCreateTable_1 node has an error (see image below)
· I strongly believe that it´s something related with the Hostname and Port;
· I tried to use the host name of the cluster and the hostname of the Hive server that we can find in Ambari (see image below)
· But none of them worked as expected.
Has any one tried something similar to this?
Note: It seems reasonably important to say that the Azure version supported by Talend is 3.4, however, I am using 3.5, it might be it.
Many thanks for your help in advance.
According to the offical docuemnt about the differences between Hadoop components and versions available with HDInsight, HDInsight 3.5 is based on Hortonworks Data Platform(HDP) 2.5, but HDI 3.4 is based on HDP 2.4. However, there is not big version difference for their Hive componets or other componets. So, my suggestion is that you can try to create a HDI 3.4 using the same Azure Storage account for your current HDI 3.5, without more effects for your needs.

Resources