Apache Hive Add TIMESTAMP partition using alter table statement - apache-spark

i'm currently running MSCK HIVE REPAIR SCHEMA.TABLENAME for all my tables after data is loaded.
As the partitions are growing, this statement is taking much longer (some times more than 5 mins) for one table. I know it scans and parses through all partitions in s3 (where my data is) and then adds the latest partitions into hive metastore.
I want to replace MSCK REPAIR with ALTER TABLE ADD PARTITION statement. MSCK REPAIR works perfectly fine with adding latest partitions, however i'm facing problem with TIMESTAMP value in the partition when using ALTER TABLE ADD PARTITION.
I have a table with four partitions (part_dt STRING, part_src STRING, part_src_file STRING, part_ldts TIMESTAMP).
After running **MSCK REPAIR, the SHOW PARTITIONS command gives me below output
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39
But, when i drop the above partition from metastore, and recreate it using ALTER TABLE ADD PARTITION
hive> alter table hub_cont add partition(part_dt='20181016',part_src='asfs',part_src_file='kjui',part_ldts='2019-05-02 06:30:39');
OK
Time taken: 1.595 seconds
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39.0
Time taken: 0.128 seconds, Fetched: 1 row(s)
It is adding .0 at the end of timestamp value. When i query the table for this partition, it is giving me 0 records.
Is there way to add parition that has timestamp value without getting this zero added at the end. I'm unable to figure out how MSCK REPAIR is handling this case that is ALTER TABLE statement not able to.

The same should happen if you insert dynamic partitions, it will create new partitions with .0 because default timestamp string representation format includes milliseconds part, REPAIR TABLE finds new folders and adds partition to the metastore and also works correct because timestamp string without milliseconds is quite compatible with the timestamp...
The solution is to use STRING instead of TIMESTAMP and remove milliseconds explicitly.
But first of all double-check that you have really millions of rows in single partition and really need timestamp grain partition, not DATE and this partition column is really significant (for example if it is functionally dependent on another partition column part_src_file, you can completely get rid of it). Too many partitions will cause performance degradation.

Related

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

Cassandra failure during read query

I have a Cassandra Table with ~500 columns and primary key ((userId, version, shredId), rowId) where shredId is used to distribute data evenly into different partitions. Table also has a default TTL of 2 days to expire data as data are for real-time aggregation. The compaction strategy is TimeWindowCompactionStrategy.
The workflow is:
write data to input table (with consistency EACH_QUORUM)
Run spark aggregation (on rows with same the userId and version)
write aggregated data to output table.
But I'm getting Cassandra failure during read query when size of data gets large; more specifically, once there are more than 210 rows in one partition, read queries fail.
How can I tune my database and change properties to fix this?
After investigation and research, the issued is caused by null values been inserted for some empty column. this creates large amount of tombstones and eventually timeout the query.

Replace a hive partition from Spark

Is there a way I can replace (an existing) a hive partition from a Spark program? Replace only the latest partition, rest of the partitions remains the same.
Below is the idea which I am trying to work upon,
We get transnational data from our RDBMS systems coming into HDFS every min. There will be a spark program (running every 5 or 10 min) which reads the data, performs the ETL and writes the output into a Hive Table.
Since overwriting entire hive table would be huge,
we would like to overwrite the hive table for today's partition only.
End of Day the source and destination partitions would be changed to next day.
Thanks in advance
As you know the hive table location, append the currentdate to location as your table is partitioned on date and overwrite the hdfs path.
df.write.format(source).mode("overwrite").save(path)
Msck repair hive table
once it is completed

Total row count in Cassandra

I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?
You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.
As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).

Spark and cassandra, range query on clustering key

I have cassandra table with following structure:
CREATE TABLE table (
key int,
time timestamp,
measure float,
primary key (key, time)
);
I need to create a Spark job which will read data from previous table, within specified start and end timestamp do some processing, and flush results back to cassandra.
So my spark-cassandra-connector will have to do a range query on clustering cassandra table column.
Are there any performance differences if I do:
sc.cassandraTable(keyspace,table).
as(caseClassObject).
filter(a => a.time.before(startTime) && a.time.after(endTime).....
so what I am doing is loading all the data into Spark and applying filtering
OR if I do this:
sc.cassandraTable(keyspace, table).
where(s"time>$startTime and time<$endTime)......
which filters all the data in Cassandra and then loads smaller subset to Spark.
The selectivity of a range query will be around 1%
It is impossible to include partition key in the query.
Which of these two solutions is preferred?
sc.cassandraTable(keyspace, table).where(s"time>$startTime and time<$endTime)
Will be MUCH faster. You are basically doing a percentage (if you only pull 5% of the data 5% of the total work) of the full grab in the first command to get the same data.
In the first case you are
Reading all of the data from Cassandra.
Serializing every object and then moving it to Spark.
Then finally filtering everything.
In the second case you are
Reading only the data you actually want from C*
Serializing only this tiny subset
There is no step 3
As an additional comment you can also put your case class type right in the call
sc.cassandraTable[CaseClassObject](keyspace, table)

Resources