Create Hive ORC table from ORC files of other server - azure

we have 2 clusters one Map R and another our own. We want created new setup in our own hardware using the Map R data.
I have copied all the orc files from the Map R cluster and followed the same folder structure
Created a orc formatted table with location of #1
then executed this command "MSCK REPAIR TABLE <>"
above steps passed without error, but when i query the partitions then job fails with below error
java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 4958903
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193)
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
Can some one tell me can we create HIVE ORC partition tables directly from the orc files?
My storage is Azure data lake.

According to your description, based on my understanding, I think you want to copy all orc files from a cluster to another and load these orc files as a hive table.
For doing it, please just try to follow the command below to create external table for loading orcfile data.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
If not aware of the columns list of an orc file, you can refer to the Hive manual ORC File Dump Utility to print the ORC file metadata in JSON format via hive --orcfiledump -j -p <location-of-orc-file-or-directory>.

Related

Use Unmanaged table in Delta lake on Top of ADLS Gen2

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.
I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"
But if I would want to partition this file for query optimization
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"
It gives me error like the one mentioned in the screenshot (find the attachment).
Error in Text :-
org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://mydirectory#myADLS.dfs.core.windows.net/level1/Level2/Table1.parquet.snappy;
There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)
Am I giving a wrong Syntax or this can be even done?
Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.
For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this
/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
in this case, subfolder is the partitions created inside parent.
CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)
will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.
Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)

Can Hive Read data from Delta lake file format?

I started going through DELTA LAKE file format, is hive capable of reading data from this newly introduced delta file format? If so could you please let me know the serde you were using.
Hive support is available with Delta Lake file format. First, step is to add the jars from https://github.com/delta-io/connectors, in our hive path. And then create a table using following format.
CREATE EXTERNAL TABLE test.dl_attempts_stream
(
...
)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION
Delta Format picks up partition by default, so no need to mention partition while creating a table.
NOTE: If data is being inserted via a Spark job, please provide hive-site.xml, and enableHiveSupport in Spark Job, to create Delta Lake table in Hive.

Hive - Copy database schema with partitions and recreate in another hive instance

I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).
Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS

Unable to Merge Small ORC Files using Spark

I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.
I tried to load ORC files to the spark and save with overwrite method
val fileName = "/user/db/table_data/" //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)
But mode(SaveMode.Overwrite) is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite) method, it was throwing error file already exists.
Can anyone help me to proceed?
As suggested by #Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.

Write files inside Hive table hdfs folder and make them available to be queried from Hive

I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file.
However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one:
in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive.
Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " ..
For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder

Resources