Creating hive table on spark output on HDFS - apache-spark

I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.
How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,
CREATE EXTERNAL TABLE `mydb.mytable`
(`col1` string,
`col2` decimal(38,0),
`create_date` timestamp,
`update_date` timestamp)
PARTITIONED BY (`my_date` string)
STORED AS ORC
LOCATION '/tmp/out/'
and the solution suggests to Alter the table as,
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.
How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?
How can I make Hive pick the data under the directory /tmp/data?

I can suggest two solutions:
If you can change your Spark job, then you can partition your data by hour (e.g. /tmp/data/1, /tmp/data/2), add Hive partitions for each hour and just write to relevant partition
you can write bash script responsible for adding Hive partitions which can be achieved by:
listing HDFS subdirectories using command hadoop fs -ls /tmp/data
listing hive partitions for table using command: hive -e 'show partitions table;'
comparing above lists to find missing partitions
adding new Hive partitions with command provided above: ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

Related

DATABRICKS SQL - can't read data from partitioned parquet file

I'm trying to read parquet files structured as:
filename/year=2020/month=12/day=1
files are under the following Mounted AzureStorage as following logic: /mnt/silver/root_folder/folder_A/parquet/year=2020/month=01/day=1
I'm trying to create a table, using this sintax:
CREATE TABLE tablename
(
FIELD1 string,
...
,FIELDn Date
,Year INT
,Month INT
,Day INT
)
USING org.apache.spark.sql.parquet
LOCATION '/mnt/silver/root_folder/folder_A/parquet/'
OPTIONS( 'compression'='snappy')
PARTITIONED BY (Year, Month, Day)
But all options I tried for LOCATION gets no Results.
I already tried:
/mnt/silver/folder/folder/parquet/* and also many variations of it.
Any suggestion please?
You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data in partitions.
From documentation:
When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore
P.S. when you use Delta, that's done automatically, so that's one of the good reasons for using it :-)

what caused different pattern in hive table partition?

We have spark job but also randomly run hive query in current hadoop cluster
I have seen the same hive table has different partition pattern like below:
i.e. if the table is partition by date, so
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/
gave result
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
however if find data from different partition date
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/
list files with different name patter
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0
What I can tell the difference not only in one partition the data files come with part- prefix and the other is like 00000n_0, also there are a lot more amount of files for part- file but each file is quite small.
I also found aggregation on part- files are a lot slower than 00000n_0 files
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
When spark streaming writes data in Hive it creates lots of small files named as part- in Hive and which keep on the increase. This will give performance issue while querying on Hive table. Hive takes too much time to give result due to large no of small files in the partition.
When spark job write data in Hive it looks like -
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
But here different file pattern is due to compaction logic on the partition's file to compact the small file into a large. Here n in 00000n_0 is the no of reducer.
Sample compaction script, which compacts the small file into a big file within partition for example table under-sample database -
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=268435456; --256MB reducer size.
CREATE TABLE example_tmp
STORED AS parquet
LOCATION '/user/hive/warehouse/sample.db/example_tmp'
AS
SELECT * FROM example
INSERT OVERWRITE table sample.example PARTITION (part_date) select * from sample.example_tmp;
DROP TABLE IF EXISTS sample.example_tmp PURGE;
The above script will compact the small files into some big file within the partition. And filename will be 00000n_0
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
There might be someone run compaction logic on the partition using Hive. Or might be reload the partition data using Hive. This is not an issue, data remains the same.

Spark - Stream kafka to file that changes every day?

I have a kafka stream I will be processing in spark. I want to write the output of this stream to a file. However, I want to partition these files by day, so everyday it will start writing to a new file. Can something like this be done? I want this to be left running and when a new day occurs, it will switch to write to a new file.
val streamInputDf = spark.readStream.format("kafka")
.option("kafka.bootstrapservers", "XXXX")
.option("subscribe", "XXXX")
.load()
val streamSelectDf = streamInputDf.select(...)
streamSelectDf.writeStream.format("parquet)
.option("path", "xxx")
???
Adding partition from spark can be done with partitionBy provided in
DataFrameWriter for non-streamed or with DataStreamWriter for
streamed data.
Below are the signatures :
public DataFrameWriter partitionBy(scala.collection.Seq
colNames)
DataStreamWriter partitionBy(scala.collection.Seq colNames)
Partitions the output by the given columns on the file system.
DataStreamWriter partitionBy(String... colNames) Partitions the
output by the given columns on the file system.
Description :
partitionBy public DataStreamWriter partitionBy(String... colNames)
Partitions the output by the given columns on the file system. If
specified, the output is laid out on the file system similar to Hive's
partitioning scheme. As an example, when we partition a dataset by
year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize
physical data layout. It provides a coarse-grained index for skipping
unnecessary data reads when queries have predicates on the partitioned
columns. In order for partitioning to work well, the number of
distinct values in each column should typically be less than tens of
thousands.
Parameters: colNames - (undocumented) Returns: (undocumented) Since:
2.0.0
so if you want to partition data by year and month spark will save the data to folder like:
year=2019/month=01/05
year=2019/month=02/05
Option 1 (Direct write):
You have mentioned parquet - you can use saving as a parquet format with:
df.write.partitionBy('year', 'month','day').format("parquet").save(path)
Option 2 (insert in to hive using same partitionBy ):
You can also insert into hive table like:
df.write.partitionBy('year', 'month', 'day').insertInto(String tableName)
Getting all hive partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
To get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveParquetTable")
Conclusion :
I would suggest option 2 ... since Advantage is later you can query data based on partition (aka query on raw data to know what you have received) and underlying file can be parquet or orc.
Note :
Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly.
Based on this answer spark should be able to write to a folder based on the year, month and day, which seems to be exactly what you are looking for. Have not tried it in spark streaming, but hopefully this example gets you on the right track:
df.write.partitionBy("year", "month", "day").format("parquet").save(outPath)
If not, you might be able to put in a variable filepath based on current_date()

Write files inside Hive table hdfs folder and make them available to be queried from Hive

I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file.
However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one:
in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive.
Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " ..
For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder

What is the metastore for in Spark?

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.

Resources