I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).
The definition of my original table is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:
Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
When the partion keys were date_year, date_month, MSCK worked properly.
Table definition of the table I am getting the error for is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
After this the following query is empty:
Select * From `my_hive_db`.`my_table` Limit 10;
I therefore ran the following command:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.
You seem to not understand how Hive's partitioning work.
Hive stores data into files on HDFS (or S3, or some other distributed folders).
If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
If you create a table partitioned by a column p_col, the files will look like
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.
Let's say you have folders on s3 that look like this:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
You create an external table with
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).
From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col.
And this error is perfectly normal, since what you are doing doesn't make sense.
As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.
You should instead try something like this:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
and then populate your new table with dynamic partitioning
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1
I have a csv file with contents as below which has a header in the 1st line .
id,name
1234,Rodney
8984,catherine
Now I was able create a table in hive to skip header and read the data appropriately.
Table in Hive
CREATE EXTERNAL TABLE table_id(
`tmp_id` string,
`tmp_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");
Results in Hive
select * from table_id;
OK
1234 Rodney
8984 catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)
But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.
>>> spark.sql("select * from table_id").show(10,False)
+------+---------+
|tmp_id|tmp_name |
+------+---------+
|id |name |
|1234 |Rodney |
|8984 |catherine|
+------+---------+
Now, how can I ignore these showing up in the results in pyspark.
I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.
Can someone suggest me a way.... Thanks 🙏 in Advance !!
u can use below two properties:
serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.
CREATE EXTERNAL TABLE `student_test_score_1`(
student string,
age string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'delimiter'=',',
'field.delim'=',',
'header'='true',
'skip.header.line.count'='1',
'path'='hdfs:<path>')
LOCATION
'hdfs:<path>'
TBLPROPERTIES (
'spark.sql.sources.provider'='CSV')
This is know issue in Spark-11374 and closed as won't fix.
In query you can have where clause to select all records except 'id' and 'name'.
spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)
Another way would be using reading files from HDFS with .option("header","true").
I am trying to read a Hive table in Spark. Below is the Hive Table format:
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
When I am trying to read it using the Spark SQL with the below command:
val c = hiveContext.sql("""select
a
from c_db.c cs
where dt >= '2016-05-12' """)
c. show
I am getting the below warning:-
18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0,
_col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,
The read starts but it is very slow and getting network time out.
When i am trying to read the Hive table directory directly i am getting the below error.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31,
_col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.
Really appreciate your suggestion.
I found workaround with reading table such way:
val schema = spark.table("db.name").schema
spark.read.schema(schema).orc("/path/to/table")
The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.
I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably.
You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code.
Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.
Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table.
Thanks
I'm trying to insert the content of a dataframe to a partitioned parquet-formatted hive table using
df.write.mode(SaveMode.Append).insertInto(myTable)
with hive.exec.dynamic.partition = 'true' and hive.exec.dynamic.partition.mode = 'nonstrict'.
I keep getting an parquet.io.ParquetEncodingException saying that
empty fields are illegal, the field should be ommited completely
instead.
The schema includes arrays (array<<struct<<int, string>>>>), and the df do contain some empty entries for these fields.
However, when I insert the df content into a non-partitioned table, I do not get an error.
How to this fix this issue.
I have attached error
I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue