External table on parquet files. .count() works, .show() fails - apache-spark

I defined an external table on a group of partitioned parquet files as such:
CREATE EXTERNAL TABLE foobarbaz (
src_file string,
[...]
temperature string
)
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION '{1}'
If I then run
df = spark.table(foobarbaz)
print(df.count())
I get the correct non-zero result.
If I run
df = spark.table(foobarbaz)
df.show()
PySpark raises
py4j.protocol.Py4JJavaError: An error occurred while calling o95.showString. [...] Caused by: java.lang.UnsupportedOperationException
Why?
full traceback

I found an issue specific to my situation that may still be relevant to future readers. I extracted the schema using parquet-tools. One column was listed as int96 so in the schema definition I naively used int type for this column. Closer investigation revealed the column was of type datetime. Changing the schema definition accordingly resolved the issue.

Related

Reading Parquet file with Pyspark returns java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths

I am trying to load parquet files in the following directories:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
This is what I wrote in Pyspark
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.parquet(s3_bucket_location_of_data)
but I received the following error:
Py4JJavaError: An error occurred while calling o109.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
After reading other StackOverflow posts like this, I tried the following:
base_path="s3://dir1/" # I have tried to set this to "s3://dir1/model=m1/version=newest/versionnumber=3/scores/" as well, but it didn't work
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.option("basePath", base_path).parquet(s3_bucket_location_of_data)
but that returned a similar error message as above. I am new to Spark/Pyspark and I don't know what I could possibly be doing wrong here. Thank you in advance for your answers!
You don't need to specify the detailed path. Just load the files from the base_path.
df = spark.read.parquet("s3://dir1")
df.filter("model = 'm1' and version = 'newest' and versionnumber = 3")
The directory structure is already partitioned by 3 columns, model, version and versionnumber. So read the base and filter the partition, then you could read all the parquet files under the partition path.

why is my glue table creating with the wrong path?

I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:
spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")
The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location
s3://bucket_name/databases/my_db/my_perfectly_created_table
there is exactly one table that creates itself with this location:
s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__
I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:
s3://bucket_name/databases/my_db/my_problematic_table
so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).
Have any of you guys dealt with this before?
Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:
Input Format org.apache.hadoop.mapred.SequenceFileInputFormat
Output Format org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
while my other tables had the config specific for parquet:
Input Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
I tried to create the table forcing the config for parquet with the following command:
val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
val my_table = spark.read.format("parquet").load(path)
val ddlSchema = my_table.toDF.schema.toDDL
spark.sql(s"""
|CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
|ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
|OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
|LOCATION '$path'
|""".stripMargin
)
but it threw the following error:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column
so the problem was the naming of those columns, "1", "2" & "3" within that struct.
Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.
Hope this helps anyone

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

I have a Hive Orc table with a definition similar to the following definition
CREATE EXTERNAL TABLE `example.example_table`(
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3a://path/to/table')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://path/to/table'
TBLPROPERTIES (
...
)
I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:
org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
...
When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?
It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.
If you want to pick a specific path other than the default, then try using LOCATION.
Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:
Table(
tableName:
dbName:
owner:
createTime:1548335003
lastAccessTime:0
retention:0
sd:StorageDescriptor(cols:
location:[*path_to_table*]
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(
name:null
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
parameters:{
serialization.format=1
path=[*path_to_table*]
}
)
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false
)
partitionKeys:[]
parameters:{transient_lastDdlTime=1548335003}
viewOriginalText:null
viewExpandedText:null
tableType:MANAGED_TABLE
rewriteEnabled:false
)
We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:
spark.sql.hive.convertMetastoreParquet
You can just run
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Or maybe
spark.conf("spark.sql.hive.convertMetastoreParquet", False)
Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

Spark: write timestamp to parquet and read it from Hive / Impala

I need to write a timestamp into parquet, then read it with Hive and Impala.
In order to write it, I tried eg
my.select(
...,
unix_timestamp() as "myts"
.write
.parquet(dir)
Then to read I created an external table in Hive:
CREATE EXTERNAL TABLE IF NOT EXISTS mytable (
...
myts TIMESTAMP
)
Doing so, I get the error
HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
I also tried to replaced the unix_timestamp() with
to_utc_timestamp(lit("2018-05-06 20:30:00"), "UTC")
and same problem. In Impala, it returns me:
Column type: TIMESTAMP, Parquet schema: optional int64
Whereas timestamp are supposed to be int96.
What is the correct way to write timestamp into parquet?
Found a workaround: a UDF that returns a java.sql.Timestamp objects, with no casting, then spark will save it as int96.

Resources