spark.table fails with java.io.Exception: No FileSystem for Scheme: abfs - apache-spark

We have a custom file system class which is an extension of hadoop.fs.FileSystem. This file system has a uri scheme of abfs:///. External hive tables have been created over this data.
CREATE EXTERNAL TABLE testingCustomFileSystem (a string, b int, c double) PARTITIONED BY dt
STORED AS PARQUET
LOCATION 'abfs://<host>:<port>/user/name/path/to/data/'
Using loginbeeline, I'm able to query the table and it would fetch the results.
Now I'm trying to load the same table into a spark dataframe using spark.table('testingCustomFileSystem') and it would throw the following exception
java.io.IOException: No FileSystem for scheme: abfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:77)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:75)
at scala.collection.immutable.Stream.map(Stream.scala:418)
The jar containing the CustomFileSystem (defining the abfs:// scheme) was loaded into the classpath and was also available.
How does the spark.table parse a hive table definition in a metastore and resolve the uri?.

After looking into the configurations in spark, I happened to notice by setting the following hadoop configuration, I was able to resolve.
hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)
In Spark, this setting is done during the sparkSession creation (just used only the appName and
like
val spark = SparkSession
.builder()
.setAppName("Name")
.setMaster("yarn")
.getOrCreate()
spark.sparkContext
.hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)
and it worked !

Related

Class org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1 not found while trying to write dataframe to Hive native parquet table

Conf
spark.conf.set('spark.sql.hive.convertMetastoreParquet', "true")
Hive table
spark.sql("create table table_name (ip string, user string) PARTITIONED BY (date date) STORED AS PARQUET")
InsertInto
df.write.insertInto("table_name", overwrite=True)
Error
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1
Btw insert into ORC table is good. Running on cluster with client mode.
Is your hive-site.xml file present in the Spark config folder?
Edit:
Can you try with:
df.write.mode("overwrite").partitionBy("date").saveAsTable("db.table_name")
It should not be necessary to set any configuration beforehand and to run the SQL create statement.

Reading AVRO from Azure Datalake in Databricks

I am trying to read eventhub data (AVRO) format. I am having issues loading data into a dataframe in databricks.
Here's the code I am using. Please let me know if I am doing anything wrong
path='/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/*.avro'
df = spark.read.format("com.databricks.spark.avro") \
.load(path)
Error
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI:
I did try using some code to remove the error, but I am getting the syntax errors
import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession
.builder()
.config("spark.sql.warehouse.dir","/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/")
.getOrCreate()
SyntaxError: invalid syntax
File "<command-265213674761208>", line 2
SparkSession spark = SparkSession
Relative path in absolute URI
You need to specify the protocol rather than use /mnt
For example, wasb://some/path/ if reading from Azure blobstore
You can also exclude *.avro since the Avro reader should already pick up all Avro files in the path
https://docs.databricks.com/data/data-sources/read-avro.html#python-api
And if you want to read from EventHub, that exposes a Kafka API, not a filepath, AFAIK

Get data from subfolders of an unpartitioned hive table into a dataframe in spark

There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.
This table can be queried even though the table is not partitioned by setting few properties in hive like below,
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.
I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.
This can be achieved by setting the following property in spark,
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext.
And I tested this in spark 1.6.2

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Spark Sql 1.3.0 + parquet

USING SPARK-SQL:
i've created a table without parquet in hdfs and everything is ok.
i've created the same table structure but with "store as parquet", also i've created the parquet files and upload to hdfs and "load inpath 'hdfs://servever/parquet_files'
but when i try to execute "select * from table_name";
i've this exception
Exception in thread "main" java.sql.SQLException: java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/user/hive/warehouse/table_name, expected: file:///
any tip??
Fixed including hadoop configuration files (core-site.xml and hdfs-site.xml) in spark

Resources