Loading 600billion records from 1 hive table into another - apache-spark

I have a hive external table in 1 database with around 600 billion records and 100 columns. I need to copy the data as is to same table in other database. I am trying to write a spark code but it is taking forever. is there any recommendation how I would write the code? I am new to spark!

Do not copy, let it sit where it is. Create external table in another database with location pointing to the data location.
USE YOUR_DATABASE;
CREATE EXTERNAL TABLE abc ... LOCATION 'hdfs://your/data';
Recover partitions if necessary using MSCK REPAIR TABLE abc; or ALTER TABLE abc RECOVER PARTITIONS; if you are on EMR.
If you absolutely need to copy data to another location (and if you are on the Amazon paid EC2 cluster you need reason for spending money on this) use distcp (distributed copy tool):
hadoop distcp hdfs://your/data hdfs://your/data2

Related

How to perform MSCK REPAIR TABLE to load only specific partitions

I have data in AWS S3 for more than 2 months that is partitioned and stored by day. I want to start using the data using the external table that I created.
Currently I see only a couple of partitions and I want to make sure my metadata picks up all the partitions. I tried using msck repair table tablename using hive after logging in to EMR Cluster's master node. However, may be due to data volume, it is taking a lot of time to execute that command.
Can I do msck repair table so that I can load only a specific day? does msck allow to load specific partitions?
You can use
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
...as described in Hive DDL doc.

Hive - Copy database schema with partitions and recreate in another hive instance

I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).
Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS

SparkSQL on hive partitioned external table on amazon s3

I am planning to use SparkSQL (not pySpark) on top of data in Amazon S3. So I believe I need to create Hive external table and then can use SparkSQL. But S3 data is partitioned and want to have the partitions reflected in Hive external table also.
What is the best way to manage the hive table on a daily basis. Since
, everyday new partitions can be created or old partitions can be
overwritten and what to do , so as to keep the Hive external table
up-to-date?
Create a intermediate table and load to your hive table with insert overwrite partition on date.

How do I find if a Hive table is defined as an external table through PySpark?

For context - data lives on S3, written as Hive tables. I'm running some Jupyter notebooks on my local machine that's supposed to point to the S3 data as Hive tables, with the metdata being stored on some relational DB on a Spark cluster.
When I run some local scripts/Jupyter notebooks on my local machine to create and load some tables, it's saying that I've created some external tables even though I didn't create them as external tables.
When I run spark.sql("show tables in target_db").show(20, False) I see nothing. Then I create the table without the external option, then run the show command again, which outputs :
+----------+-------------------+-----------+
|database |tableName |isTemporary|
+----------+-------------------+-----------+
|target_db |mytable |false |
+----------+-------------------+-----------+
and run my script, which errors out, saying : org.apache.spark.sql.AnalysisException: Operation not allowed: TRUNCATE TABLE on external tables: ``target_db``.``mytable``;
I dropped the table on the cluster itself, so I think there's no issue with that. How is Spark thinking that my table is an external table? Do I need to change how I'm creating the table?
You should access the data from s3 by creating external table. Say this table is called T1.
If the table T1 definition uses partitions then you need to repair the table to load the partitions.
You cannot truncate external table T1. You should only read from it.
Tables created with CREATE EXTERNAL TABLE ... statement are external.
Tables created with CREATE TABLE are not.
You can check which one with a SHOW CREATE TABLE table_name
or with a DESCRIBE FORMATTED table_name, the field called Type can be MANAGED or EXTERNAL.

HDFS memory not deleting when table dropped HIVE

Hi I am relatively new to HIVE and HDFS so apologies in advance if I am not wording this correctly.
I have used Microsoft Azure to create a virtual machine. I am then logging into this using putty and Ambari Sandbox.
In Ambari I am using HIVE, all is working fine but I am having major issues with memory allocation.
When I drop a table in Hive I will then go into my 'Hive View' and delete the table from the trash folder. However this is freeing up no memory within the HDFS.
The table is now gone from my HIVE database and also from the trash folder but no memory has been freed.
Is there somewhere else where I should be deleting the table from?
Thanks in advance.
According to your description, as #DuduMarkovitz said, I also don't know what HDFS memory you said is, but I think that you want to say is the table data files on HDFS.
Per my experience, I think the table you dropped in Hive is an external table, not an internal table. You can get the feature below from Hive offical document for External Tables.
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The difference between interal table and external table, you can refer to here.
So if you want to recycle the external table data from HDFS after dropped the external table, you need to use the commend below for HDFS to remove it manually.
hadoop fs -rm -f -r <your-hdfs-path-url>/apps/hive/warehouse/<database name>/<table-name>
Hope it helps.
Try DESCRIBE FORMATTED <table_name> command. It should show you location of file in HDFS. Check if this location is empty.

Resources