Unable to read parquet file locally in spark - apache-spark

I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook.
df = spark.read.parquet("metastore_db/tmp/userdata1.parquet")
I am getting this exception
An error occurred while calling o738.parquet.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Does anyone know how to do it?

Assuming that you are running spark on your local, you should be doing something like
df = spark.read.parquet("file:///metastore_db/tmp/userdata1.parquet")

Related

Problem when calling createOrReplaceGlobalTempView on a dataframe

On my new windows machine, I am using Spark 3.1.1 (using winutils for Hadoop) to create from a csv file a global temp view like so :
DF.createOrReplaceGlobalTempView("firstTable");
Where DF is a Dataset<Row> containing the csv data.
I get the following error on the global view creation :
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741515:
What's the problem ?
Thanks

Py4JJavaError: An error occurred while calling o389.csv

I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?
The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.

Error encountered writing to Kudu using Spark / Scala

I'm trying to write data into Kudu from Spark and I'm getting this error
java.lang.UnsupportedOperationException at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:296) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:174) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
Code example:
df.write.options(Map("kudu.master"-> "ip:7051",
"kudu.table"-> "test_kudu")).mode("append").kudu
Used libraries:
org.apache.kudu:kudu-spark2_2.11:1.8.0
org.apache.kudu:kudu-client:1.8.0
Thanks!

Unable to save data frame as Hive table, throwing file not found exception

When I am trying to save data frame as Hive table in pyspark
df_writer.saveAsTable('hive_table', format='parquet', mode='overwrite')
I am getting following error:
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
does not exist:
hdfs://hostname:8020/apps/hive/warehouse/testdb.db/hive_table at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
I have the path till 'hdfs://hostname:8020/apps/hive/warehouse/testdb.db/'
Please provide your inputs
Try using DataFrameWriter as
df.write.mode(SaveMode.Append).insertInto(s"${dbName}.${t.table}")

Spark in docker parquet error No predefined schema found

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....
My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Resources