HIVE Parquet error - apache-spark

I'm trying to insert the content of a dataframe to a partitioned parquet-formatted hive table using
df.write.mode(SaveMode.Append).insertInto(myTable)
with hive.exec.dynamic.partition = 'true' and hive.exec.dynamic.partition.mode = 'nonstrict'.
I keep getting an parquet.io.ParquetEncodingException saying that
empty fields are illegal, the field should be ommited completely
instead.
The schema includes arrays (array<<struct<<int, string>>>>), and the df do contain some empty entries for these fields.
However, when I insert the df content into a non-partitioned table, I do not get an error.
How to this fix this issue.
I have attached error

Related

Databricks schema enforcement issues

As suggested in the article about schema enforcement, a declared schema helps detecting issues early.
The two issues described below however are preventing me from creating a descriptive schema.
Comments on a table column are seen as a difference in the schema
# Get data
test_df = spark.createDataFrame([('100000146710')], ['code'])
# ... save
test_df.write.format("delta").mode("append").save('/my_table_location')
# Create table: ... BOOM
spark.sql("""
CREATE TABLE IF NOT EXISTS my_table (
code STRING COMMENT 'Unique identifier'
) USING DELTA LOCATION '/my_table_location'
""")
This will fail with AnalysisException: The specified schema does not match the existing schema at /my_table_location . The only solution I found is to drop the columnt comments.
Not null struct field shows as nullable
json_schema = StructType([
StructField("code", StringType(), False)
])
json_df = (spark.read
.schema(json_schema)
.json('/my_input.json')
)
json_df.printSchema()
will show
root
|-- code: string (nullable = true)
So despite the schema declaration stating that a field is not null, the field shows as nullable in the dataframe. Because of this, adding a NOT NULL constraint on the table column will trigger the AnalysisException error.
Any comments or suggestions are welcome.
With the execution of
test_df.write.format("delta").mode("append").save('/my_table_location')
You have already created a new Delta table with its specific schema as defined by test_df. This new table delta.`/my_table_location` already has the schema of code STRING.
If you would like to create a comment within the schema, perhaps first create the table as you would like it defined, e.g.
spark.sql("""
CREATE TABLE my_table
code STRING COMMENT 'unique identifier'
USING DELTA LOCATION '/my_table_location'
""")
And then insert your data from your test_df into it, e.g.
test_df.createOrReplaceView("test_df_view")
spark.sql("""
INSERT INTO my_table (code) SELECT code FROM test_df_view
""")

How to insert value in already created Database table through pandas `df.to_sql()`

I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table
Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.

Spark DataFrame ORC Hive table reading issue

I am trying to read a Hive table in Spark. Below is the Hive Table format:
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
When I am trying to read it using the Spark SQL with the below command:
val c = hiveContext.sql("""select
a
from c_db.c cs
where dt >= '2016-05-12' """)
c. show
I am getting the below warning:-
18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0,
_col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,
The read starts but it is very slow and getting network time out.
When i am trying to read the Hive table directory directly i am getting the below error.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31,
_col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.
Really appreciate your suggestion.
I found workaround with reading table such way:
val schema = spark.table("db.name").schema
spark.read.schema(schema).orc("/path/to/table")
The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.
I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably.
You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code.
Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.
Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table.
Thanks

Hive - insert into table partition throwing error

I am trying to create a partitioned table in Hive on spark and load it with data available in other table in Hive.
I am getting following error while loading the data:
Error: org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException:
Partition spec {cardsuit=, cardcolor=, cardSuit=SPA, cardColor=BLA}
contains non-partition columns;
following are the commands used to execute the task:-
create table if not exists hive_tutorial.hive_table(color string, suit string,value string) comment 'first hive table' row format delimited fields terminated by '|' stored as TEXTFILE;
LOAD DATA LOCAL INPATH 'file:///E:/Kapil/software-study/Big_Data_NoSql/hive/deckofcards.txt' OVERWRITE INTO TABLE hive_table; --data is correctly populated(52 rows)
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create table if not exists hive_tutorial.hive_table_partitioned(color string, suit string,value int) comment 'first hive table' partitioned by (cardSuit string,cardColor string) row format delimited fields terminated by '|' stored as TEXTFILE;
INSERT INTO TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
--alternatively i tried
INSERT OVERWRITE TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
sample of data:-
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
I am using spark 2.2.0 and java version 1.8.0_31.
I have checked and tried answers given in similar thread but could not solve my problem:-
SemanticException Partition spec {col=null} contains non-partition columns
Am I missing something here?
After carefully reading the scripts above while creating tables ,value column is type of int in partitioned table where as string in original. Error was misleading !!!

Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'

I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue

Resources