Spark Sql 1.3.0 + parquet - apache-spark

USING SPARK-SQL:
i've created a table without parquet in hdfs and everything is ok.
i've created the same table structure but with "store as parquet", also i've created the parquet files and upload to hdfs and "load inpath 'hdfs://servever/parquet_files'
but when i try to execute "select * from table_name";
i've this exception
Exception in thread "main" java.sql.SQLException: java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/user/hive/warehouse/table_name, expected: file:///
any tip??

Fixed including hadoop configuration files (core-site.xml and hdfs-site.xml) in spark

Related

Class org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1 not found while trying to write dataframe to Hive native parquet table

Conf
spark.conf.set('spark.sql.hive.convertMetastoreParquet', "true")
Hive table
spark.sql("create table table_name (ip string, user string) PARTITIONED BY (date date) STORED AS PARQUET")
InsertInto
df.write.insertInto("table_name", overwrite=True)
Error
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1
Btw insert into ORC table is good. Running on cluster with client mode.
Is your hive-site.xml file present in the Spark config folder?
Edit:
Can you try with:
df.write.mode("overwrite").partitionBy("date").saveAsTable("db.table_name")
It should not be necessary to set any configuration beforehand and to run the SQL create statement.

spark.table fails with java.io.Exception: No FileSystem for Scheme: abfs

We have a custom file system class which is an extension of hadoop.fs.FileSystem. This file system has a uri scheme of abfs:///. External hive tables have been created over this data.
CREATE EXTERNAL TABLE testingCustomFileSystem (a string, b int, c double) PARTITIONED BY dt
STORED AS PARQUET
LOCATION 'abfs://<host>:<port>/user/name/path/to/data/'
Using loginbeeline, I'm able to query the table and it would fetch the results.
Now I'm trying to load the same table into a spark dataframe using spark.table('testingCustomFileSystem') and it would throw the following exception
java.io.IOException: No FileSystem for scheme: abfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:77)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:75)
at scala.collection.immutable.Stream.map(Stream.scala:418)
The jar containing the CustomFileSystem (defining the abfs:// scheme) was loaded into the classpath and was also available.
How does the spark.table parse a hive table definition in a metastore and resolve the uri?.
After looking into the configurations in spark, I happened to notice by setting the following hadoop configuration, I was able to resolve.
hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)
In Spark, this setting is done during the sparkSession creation (just used only the appName and
like
val spark = SparkSession
.builder()
.setAppName("Name")
.setMaster("yarn")
.getOrCreate()
spark.sparkContext
.hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)
and it worked !

Failed to open HDFS file after load data from Spark

I'm Using Java-Spark.
I'm loading Parquet data into Hive table as follow:
ds.write().mode("append").format("parquet").save(path);
Then I make
spark.catalog().refreshTable("mytable");//mytable is External table
And after I'm trying to see the data from Impala I got the following exception:
Failed to open HDFS file
No such file or directory. root cause: RemoteException: File does not exist
After I make on impala refresh mytable I can see the data.
How can I make the refresh command from Spark?
I'm try also
spark.sql("msck repair table mytable");
And still not working for me.
Any suggestions?
Thanks.

Spark in docker parquet error No predefined schema found

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....
My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Does Presto support Parquet format?

Running CDH4 cluster with Impala, I created parquet table and after adding parquet jar files to hive, I can query the table using hive.
Added same set of jars to /opt/presto/lib and restarted coordinator and workers.
parquet-avro-1.2.4.jar
parquet-cascading-1.2.4.jar
parquet-column-1.2.4.jar
parquet-common-1.2.4.jar
parquet-encoding-1.2.4.jar
parquet-format-1.0.0.jar
parquet-generator-1.2.4.jar
parquet-hadoop-1.2.4.jar
parquet-hive-1.2.4.jar
parquet-pig-1.2.4.jar
parquet-scrooge-1.2.4.jar
parquet-test-hadoop2-1.2.4.jar
parquet-thrift-1.2.4.jar
Still getting this error when running parquet select query from Presto:
> select * from test_pq limit 2;
Query 20131116_144258_00002_d3sbt failed : org/apache/hadoop/hive/serde2/SerDe
Presto now supports Parquet automatically.
Try to add the jars in presto plugin dir instead of presto lib dir.
Presto auto loads jars from plugins dirs.

Resources