Reading AVRO from Azure Datalake in Databricks - apache-spark

I am trying to read eventhub data (AVRO) format. I am having issues loading data into a dataframe in databricks.
Here's the code I am using. Please let me know if I am doing anything wrong
path='/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/*.avro'
df = spark.read.format("com.databricks.spark.avro") \
.load(path)
Error
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI:
I did try using some code to remove the error, but I am getting the syntax errors
import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession
.builder()
.config("spark.sql.warehouse.dir","/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/")
.getOrCreate()
SyntaxError: invalid syntax
File "<command-265213674761208>", line 2
SparkSession spark = SparkSession

Relative path in absolute URI
You need to specify the protocol rather than use /mnt
For example, wasb://some/path/ if reading from Azure blobstore
You can also exclude *.avro since the Avro reader should already pick up all Avro files in the path
https://docs.databricks.com/data/data-sources/read-avro.html#python-api
And if you want to read from EventHub, that exposes a Kafka API, not a filepath, AFAIK

Related

Writing avro files using Spark 2.3

I'm somewhat new to Spark, but I understand that read/write of avro files was built into Spark 2.4, but unfortunately I'm limited to version 2.3 right now. I'm having trouble writing to avro and keep getting errors. Am I not installing this properly?
Have used this in spark session setup:
avro_loc = "com.databricks:spark-avro_2.11:4.0.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + avro_loc + ' pyspark-shell'
And I've tried these two versions for the write code I'm attempting:
df.write.mode('overwrite')\
.option('batchsize',10000) \
.avro('{}/df.avro' \
.format(HDFS_LOC))
df.write.format('avro').save('/user/Data/df.avro')
I get these errors for the 1st and 2nd bit of code above, respectively:
AttributeError: 'DataFrameWriter' object has no attribute 'avro'
AnalysisException: 'Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

Py4JJavaError: An error occurred while calling o389.csv

I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?
The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders while reading from Azure Blob Storage

I am trying to read a CSV file stored in Azure Storage Account. For that, I have installed a spark on my Virtual Machine and trying to read a CSV file in a dataframe from pyspark.
I read somewhere how to do that and I followed the steps and copied the latest hadoop-azure & azure-storage JAR files on my /jar directories. Then, I came up with this error:-
NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
I searched for this error and found that I need to refer hadoop-azure-2.8.5.jar instead of latest hadoop-azure JAR. So, I replaced this JAR with the latest hadoop-azure jar and again executed my pyspark code.
After executing my code, I encountered with another error: -
: java.lang.NoSuchMethodError:
org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;
Also, below is my pyspark code: -
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
storage_account_name = "<storage_account_name>"
storage_account_access_key = "<storage_account_access_key>"
spark.conf.set("fs.azure.account.key." + storage_account_name + ".blob.core.windows.net",storage_account_access_key)
spark._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.my_account.blob.core.windows.net", "storage_account_access_key")
df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://<container_name>#<storage_account_name>.blob.core.windows.net/<path_to_csv>/sample_file.csv")
df.show()
I searched for this and tried various hadoop-azure JAR versions. The one which worked for me was hadoop-azure-2.7.0.jar.
With this JAR version, I was able to read the CSV file from Blob storage.

spark.table fails with java.io.Exception: No FileSystem for Scheme: abfs

We have a custom file system class which is an extension of hadoop.fs.FileSystem. This file system has a uri scheme of abfs:///. External hive tables have been created over this data.
CREATE EXTERNAL TABLE testingCustomFileSystem (a string, b int, c double) PARTITIONED BY dt
STORED AS PARQUET
LOCATION 'abfs://<host>:<port>/user/name/path/to/data/'
Using loginbeeline, I'm able to query the table and it would fetch the results.
Now I'm trying to load the same table into a spark dataframe using spark.table('testingCustomFileSystem') and it would throw the following exception
java.io.IOException: No FileSystem for scheme: abfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:77)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:75)
at scala.collection.immutable.Stream.map(Stream.scala:418)
The jar containing the CustomFileSystem (defining the abfs:// scheme) was loaded into the classpath and was also available.
How does the spark.table parse a hive table definition in a metastore and resolve the uri?.
After looking into the configurations in spark, I happened to notice by setting the following hadoop configuration, I was able to resolve.
hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)
In Spark, this setting is done during the sparkSession creation (just used only the appName and
like
val spark = SparkSession
.builder()
.setAppName("Name")
.setMaster("yarn")
.getOrCreate()
spark.sparkContext
.hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)
and it worked !

Resources