Am trying to dump my cassandra data from one cluster to another cluster, for that am using sstableloader.
Everything goes fine in the dumping process except the data files which are created for my secondary index columns. When ever I try to dump it, it fails with "COLUMN FAMILY DOES NOT EXISTS".
I have created the schema from the source cluster.
I know the format of the .db file will be keyspace-columnfamily-generation-number-Data.db but the .db files for my indexed CF will have keyspace-columnfamily-index_name-generation-number-Data.db. So, it is searching for the CF name includes index_name.
How to dump these files using sstableloader?
You should not dump the index files. Your target cluster will reproduce the secondary indexes after it loads the SSTables given that the schema definition in the target cluster is the same as the source cluster. The reason you see that message is because under the hood secondary indexes are defined as a local column family.
So, dump the "Data.db" files, define your schema in your target cluster, load sstables and restart. The rest should be taken care of for you.
Related
I am trying to understand how data is stored and managed in the DataBricks environment. I have a fairly decent understanding of what is going on under the hood but have seen some conflicting information online, therefore would like to get a detailed explanation to solidify my understanding. To ask my questions, I'd like to summarize what I have done as as part of one of exercises in the Apache Spark Developer course.
As a part of the exercise, I have followed the following steps on the Databricks platform:
Started my cluster
Read a parquet file as a DataFrame
Stored the DataFrame as a Delta Table in my user directory in DBFS
Made some changes to the Delta Table created in the previous step
Partitioned the same Delta table based on a specific column e.g. State and saved in the same user directory in DBFS using the overwrite mode
After following the above steps, here's how my DBFS directory looks:
DBFS Delta Log Directory
In the root folder which I used to store the Delta Table (picture above) I have the following types folders/files
Delta log folder
Folders with the 'State' name (step 5. previous section), Each state folder also contains 4 parquet files which I suspect are partitions of the dataset
Four Separate parquet files which I suspect are files from when I created this delta table (in Step 3 of the previous section)
Based on the above exercise following are my questions:
Is the data that I see in the above directory - State named folders that contain the partitions, parquet files, delta log etc distributed across my nodes (The answer I presume is yes).
The four parquet files in the root folder (from when I created the delta table, before the partition) - assuming they are distributed across my nodes - are they stored in my Node's RAM?
Where is the data from the delta_log folder stored? If it's across my nodes - is it stored in RAM or disk memory?
Where is the data (parquet files/partitions under each state name folder - from screenshot above) stored? If this is also distributed across my nodes is it in memory (RAM) or on the disk?
Some of the answers I looked at online say that all the partitions are stored in memory (RAM). By that logic, once I turn off my cluster - they should be removed from memory , right?
However, even when I turn off my cluster I am able to view all the data in DBFS (exactly similar to the picture I have included above) . I suspect once the cluster is turned off, the RAM would be cleared therefore, I should not be able to see any data that is in my RAM. Is my understanding incorrect?
Would appreciate if you can answer my questions in order with as much detail as possible.
When you write out the data to DBFS it is stored in some form of permanent object storage separate from your cluster. This is why it is still there after the cluster shuts down. What storage this is depends on which cloud you are running your Databricks workspace.
This is the main idea of separating compute and storage, your clusters are the compute and the storage elsewhere. When you read in and process the data only then it is distributed across your nodes for processing. Once your cluster shuts down all data on the nodes RAM or disk is gone unless you've written it out to some form of permanent storage.
I need to replicate a local node with a SimpleStrategy to a remote node in other Cassandra's DB. Does anyone have any idea where I begin?
The main complexity here, if you're writing data into both clusters is how to avoid overwriting the data that has changed in the cloud later than your local setup. There are several possibilities to do that:
If structure of the tables is the same (including the names of the keyspaces if user-defined types are used), then you can just copy SSTables from your local machine to the cloud, and use sstableloader to replay them - in this case, Cassandra will obey the actual writetime, and won't overwrite changed data. Also, if you're doing deletes from tables, then you need to copy SSTables before tombstones are expired. You may not copy all SSTables every time, just the files that has changed since last data upload. But you always need to copy SSTables from all nodes from which you're doing upload.
If structure isn't the same, then you can either look to using DSBulk or Spark Cassandra Connector. In both cases you'll need to export data with writetime as well, and then load it also with timestamp. Please note that in both cases if different columns have different writetime, then you will need to load that data separately because Cassandra allows to specify only one timestamp when updating/inserting data.
In case of DSBulk you can follow the example 19.4 for exporting of data from this blog post, and example 11.3 for loading (from another blog post). So this may require some shell scripting. Plus you'll need to have disk space to keep exported data (but you can use compression).
In case of Spark Cassandra Connector you can export data without intermediate storage if both nodes are accessible from Spark. But you'll need to write some Spark code for reading data using RDD or DataFrame APIs.
We have spark job but also randomly run hive query in current hadoop cluster
I have seen the same hive table has different partition pattern like below:
i.e. if the table is partition by date, so
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/
gave result
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
however if find data from different partition date
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/
list files with different name patter
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0
What I can tell the difference not only in one partition the data files come with part- prefix and the other is like 00000n_0, also there are a lot more amount of files for part- file but each file is quite small.
I also found aggregation on part- files are a lot slower than 00000n_0 files
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
When spark streaming writes data in Hive it creates lots of small files named as part- in Hive and which keep on the increase. This will give performance issue while querying on Hive table. Hive takes too much time to give result due to large no of small files in the partition.
When spark job write data in Hive it looks like -
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
But here different file pattern is due to compaction logic on the partition's file to compact the small file into a large. Here n in 00000n_0 is the no of reducer.
Sample compaction script, which compacts the small file into a big file within partition for example table under-sample database -
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=268435456; --256MB reducer size.
CREATE TABLE example_tmp
STORED AS parquet
LOCATION '/user/hive/warehouse/sample.db/example_tmp'
AS
SELECT * FROM example
INSERT OVERWRITE table sample.example PARTITION (part_date) select * from sample.example_tmp;
DROP TABLE IF EXISTS sample.example_tmp PURGE;
The above script will compact the small files into some big file within the partition. And filename will be 00000n_0
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
There might be someone run compaction logic on the partition using Hive. Or might be reload the partition data using Hive. This is not an issue, data remains the same.
I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).
Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
Hi I am relatively new to HIVE and HDFS so apologies in advance if I am not wording this correctly.
I have used Microsoft Azure to create a virtual machine. I am then logging into this using putty and Ambari Sandbox.
In Ambari I am using HIVE, all is working fine but I am having major issues with memory allocation.
When I drop a table in Hive I will then go into my 'Hive View' and delete the table from the trash folder. However this is freeing up no memory within the HDFS.
The table is now gone from my HIVE database and also from the trash folder but no memory has been freed.
Is there somewhere else where I should be deleting the table from?
Thanks in advance.
According to your description, as #DuduMarkovitz said, I also don't know what HDFS memory you said is, but I think that you want to say is the table data files on HDFS.
Per my experience, I think the table you dropped in Hive is an external table, not an internal table. You can get the feature below from Hive offical document for External Tables.
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The difference between interal table and external table, you can refer to here.
So if you want to recycle the external table data from HDFS after dropped the external table, you need to use the commend below for HDFS to remove it manually.
hadoop fs -rm -f -r <your-hdfs-path-url>/apps/hive/warehouse/<database name>/<table-name>
Hope it helps.
Try DESCRIBE FORMATTED <table_name> command. It should show you location of file in HDFS. Check if this location is empty.