Hi I am relatively new to HIVE and HDFS so apologies in advance if I am not wording this correctly.
I have used Microsoft Azure to create a virtual machine. I am then logging into this using putty and Ambari Sandbox.
In Ambari I am using HIVE, all is working fine but I am having major issues with memory allocation.
When I drop a table in Hive I will then go into my 'Hive View' and delete the table from the trash folder. However this is freeing up no memory within the HDFS.
The table is now gone from my HIVE database and also from the trash folder but no memory has been freed.
Is there somewhere else where I should be deleting the table from?
Thanks in advance.
According to your description, as #DuduMarkovitz said, I also don't know what HDFS memory you said is, but I think that you want to say is the table data files on HDFS.
Per my experience, I think the table you dropped in Hive is an external table, not an internal table. You can get the feature below from Hive offical document for External Tables.
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The difference between interal table and external table, you can refer to here.
So if you want to recycle the external table data from HDFS after dropped the external table, you need to use the commend below for HDFS to remove it manually.
hadoop fs -rm -f -r <your-hdfs-path-url>/apps/hive/warehouse/<database name>/<table-name>
Hope it helps.
Try DESCRIBE FORMATTED <table_name> command. It should show you location of file in HDFS. Check if this location is empty.
Related
I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).
Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
I have a hive external table in 1 database with around 600 billion records and 100 columns. I need to copy the data as is to same table in other database. I am trying to write a spark code but it is taking forever. is there any recommendation how I would write the code? I am new to spark!
Do not copy, let it sit where it is. Create external table in another database with location pointing to the data location.
USE YOUR_DATABASE;
CREATE EXTERNAL TABLE abc ... LOCATION 'hdfs://your/data';
Recover partitions if necessary using MSCK REPAIR TABLE abc; or ALTER TABLE abc RECOVER PARTITIONS; if you are on EMR.
If you absolutely need to copy data to another location (and if you are on the Amazon paid EC2 cluster you need reason for spending money on this) use distcp (distributed copy tool):
hadoop distcp hdfs://your/data hdfs://your/data2
I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file.
However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one:
in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive.
Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " ..
For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder
As part of a data integration process I am working on, I have a need to persist a Spark SQL DataFrame as an external Hive table.
My constraints at the moment:
Currently limited to Spark 1.6 (v1.6.0)
Need to persist the data in a specific location, retaining the data even if the table definition is dropped (hence external table)
I have found what appears to be a satisfactory solution to write the dataframe, df, as follows:
df.write.saveAsTable('schema.table_name',
format='parquet',
mode='overwrite',
path='/path/to/external/table/files/')
Doing a describe extended schema.table_name against the resulting table confirms that it is indeed external. I can also confirm that the data is retained (as desired) even if the table itself is dropped.
My main concern is that I can't really find a documented example of this anywhere, nor can I find much mention of it in the official docs -
particularly the use of a path to enforce the creation of an external table.
(https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter).
Is there a better/safer/more standard way to persist the dataframe?
I rather creating the Hive tables myself (e.g. CREATE EXTERNAL TABLE IF NOT EXISTS) exactly as I need and then in Spark just do: df.write.saveAsTable('schema.table_name', mode='overwrite').
This way you have control about the table creation and don't depend on the HiveContext doing what you need. In the past there where issues with the Hive tables created this way and the behavior can change in the future since that API is generic and cannot guarantee the underlying implementation by HiveContext.
Am trying to dump my cassandra data from one cluster to another cluster, for that am using sstableloader.
Everything goes fine in the dumping process except the data files which are created for my secondary index columns. When ever I try to dump it, it fails with "COLUMN FAMILY DOES NOT EXISTS".
I have created the schema from the source cluster.
I know the format of the .db file will be keyspace-columnfamily-generation-number-Data.db but the .db files for my indexed CF will have keyspace-columnfamily-index_name-generation-number-Data.db. So, it is searching for the CF name includes index_name.
How to dump these files using sstableloader?
You should not dump the index files. Your target cluster will reproduce the secondary indexes after it loads the SSTables given that the schema definition in the target cluster is the same as the source cluster. The reason you see that message is because under the hood secondary indexes are defined as a local column family.
So, dump the "Data.db" files, define your schema in your target cluster, load sstables and restart. The rest should be taken care of for you.