How to read a pickle file in pyspark on Databricks - apache-spark

I have a pickle file on Azure Storage Blob, that I want to read in spark. While reading the file it is giving some error.
df = spark.read.format('pickle').load(path)
Following is the error that I am receiving
java.lang.ClassNotFoundException: Failed to find data source: pickle. Please find packages at
http://spark.apache.org/third-party-projects.html
Version details
Spark 3.0.1, Scala 2.12
Any help would be appreciated.

Related

Missing library to ingest data into Azure Data Explorer with PySpark

I am trying to ingest data into Azure Data Explorer through PySpark with PyCharm IDE. However, I am having a lot of problems related to missing libraries when running my code.
According to Azure Data Explorer connector's page, I need to install the connector's jar and the two dependencies jar kusto-ingest and kusto-data.
After download all these 3 jar's and importing them to PySpark, I can't proceed with my data ingestion, it keeps returning me missing library errors. The first one is the azure-storage lib, then I've installed and imported the jar, it asks for adal4j lib, I do the same and it asks oauth2 lib, then json lib, azure-client-authentication lib, javax mail lib, and so on.
I've installed more than 10 jars and I still can't run this ingestion. Am I doing something wrong?
My PySpark version is 2.4. You can see my code below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.appName("Teste") \
.config('spark.jars', 'kusto-spark_2.4_2.11-2.5.2.jar,kusto-data-2.4.1.jar,kusto-ingest-2.4.1.jar,azure-storage-8.3.0.jar,json-20180813.jar,adal4j-1.6.5.jar') \
.getOrCreate()
# loading a test csv file
df = spark.read.csv('MOCK_DATA.csv', header=True, sep=',')
df.write.format("com.microsoft.kusto.spark.datasource")\
.option("kustoCluster", "myclustername")\
.option("kustoDatabase", "mydatabase")\
.option("kustoTable", "mytable")\
.option("kustoAadAppId", "myappid")\
.option("kustoAadAppSecret", "mysecret")\
.option("kustoAadAuthorityID", "myautorityid")\
.mode("Append")\
.save()
When working with a non-maven installation you need to use a JAR with all the dependencies.
You can take it from the github releases:
https://github.com/Azure/azure-kusto-spark/releases
or build yourself if its missing from the specific version by cloning the repo and running
mvn assembly:single

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4
You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here
Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

Read Avro in Azure HDI4.0

I'm trying to read an Avro file using Jupyter notebook in Azure HDInsight 4.0 with Spark 2.4.
I'm not able to provide properly the .jar file to
I've tried the approach suggested in How to use Avro on HDInsight Spark/Jupyter? and in https://learn.microsoft.com/en-in/azure/hdinsight/spark/apache-spark-jupyter-notebook-use-external-packages but I guess they are related to Spark 2.3
%%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-avro_2.11:4.0.0" }}
This produce the error message:
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
The solution that seem to work is
%%configure -f
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.0" }}

Load external jars to Zeppelin from s3

Pretty simple objective. Load my custom/local jars from s3 to zeppelin notebook (using zeppelin from AWS EMR).
Location of the Jar
s3://my-config-bucket/process_dataloader.jar
Following zeppelin documentation I opened the interpreter like in the following image and spark.jars in the properties name and its value is s3://my-config-bucket/process_dataloader.jar
I restarted the interpreter and then in the notebook I tried to import the jar using the following
import com.org.dataloader.DataLoader
but it throws the following
<console>:23: error: object org is not a member of package com
import com.org.dataloader.DataLoader
Any suggestions for solving this problem?
A bit late thought but for anyone else who might need this in future try below option,
https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url.
%spark.dep
z.reset()
z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")

How can I read a XML file Azure Databricks Spark

I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances.
So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW.
Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported..
So any help pushing me a a good direction is appreciated.
One way is to use the databricks spark-xml library :
Import the spark-xml library into your workspace
https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :
I found this one is really helpful.
https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
install in your databricks cluster at the 'library' tab.
install it via launching spark-shell in the notebook itself.
I got one solution of reading xml file in databricks:
install this library : com.databricks:spark-xml_2.12:0.11.0
using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.
Using this command (%fs head "") you will get the rootTag and rowTag.
df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")
display(df)
reference image for solution to read xml file in databricks

Resources