Hive/Impala performance with string partition key vs Integer partition key - apache-spark

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?

Well, it makes a difference if you look up the official Impala documentation.
Instead of elaborating, I will paste the section from the doc, as I think it states it quite well:
"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."
Reference: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html

No, there is no such recommendation. Consider this:
The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.
Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.
There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.
Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.
1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

Related

How does Cassandra store variable data types like text

assumption is, Cassandra will store fixed length data in column family. like a column family: id(bigint), age(int), description(text), picture(blob). Now description and picture have no limit. How does it store that? Does Cassandra externalize through an ID -> location way?
For example, looks like, in relational databases, a pointer is used to point to the actual location of large texts. See how it is done
Also, looks like, in mysql, it is recommended to use char instead of varchar for better performance. I guess simply because, there is no need for an "id lookup". See: mysql char vs varchar
enter code here
`
Cassandra stores individual cells (column values) in its on-disk files ("sstables") as a 32-bit length followed by the data bytes. So string values do not need to have a fixed size, nor are stored as pointers to other locations - the complete string appears as-is inside the data file.
The 32-bit length limit means that each "text" or "blob" value is limited to 2GB in length, but in practice, you shouldn't use anything even close to that - with Cassandra documentation suggesting you shouldn't use more than 1MB. There are several problems with having very large values:
Because values are not stored as pointers to some other storage, but rather stored inline in the sttable files, these large strings get copied around every time sstable files get rewritten, namely during compaction. It would be more efficient to keep the huge string on disk in a separate files and just copy around pointers to it - but Cassandra doesn't do this.
The Cassandra query language (CQL) does not have any mechanism for store or retrieving a partial cell. So if you have a 2GB string, you have to retrieve it entirely - there is no way to "page" through it, nor a way to write it incrementally.
In Scylla, large cells will result in large latency spikes because Scylla will handle the very large cell atomically and not context-switch to do other work. In Cassandra this problem will be less pronounced but will still likely cause problems (the thread stuck on the large cell will monopolize the CPU until preempted by the operating system).

Wide partition pattern in Cassandra

What is 'wide partition pattern' in Cassandra? In book 'Defiinitive Cassandra' it seems its a recommended thing, but in some online articles I see its something to be avoided.
So what actually it is and it is preferrable or not?
Partition in Cassandra represent grouping of similar kind of rows. In Cassandra it is recommended to model your data such that you should have similar kind of rows fall in same partition. This is called wide partition pattern.
Searching in Cassandra is super fast using partition key. So wide partition pattern is recommened. But with this recommendation comes a warning that your wide partitions should not become too large.
The reason for warning (avoiding large partitions) is that searching becomes too slow as search within partition is slow. Also it puts lot of pressure on heap.
For better understanding, would recommend reading this blog https://thelastpickle.com/blog/2019/01/11/wide-partitions-cassandra-3-11.html
wide partition refers a partition which contains many cells/values (number_of_cells = column * row)
The commonly used time series design pattern is a great example for wide partition table design: You use the date bucket as the partition key, each bucket will contain all timestamps for the date range defined by that bucket.
This model illustrates why it's called "wide" partition
another dynamodb example, the idea is the same (sort key in dynamodb is the same as clustering key in cassandra)

Align Dataset partitioning to table partitioning scheme

I am writing to a table partitioned by month. I know that my data is ≈100MB per partition, no skew - it is going to fit within single HDFS block and I want to ensure that every partition gets a single file written. I also know the exact number of months in my dataset (which is something between 1 and 10), therefore:
ds.repartition(nMonths, $"month").write.<options>.insertInto(<...>)
This works. However I'm thinking from here... As Spark uses key's hash to determine the partition, I have no guarantee that every partition will receive a single month's data. The more partitions I have, the less likely this actually is - right?
Does it make sense then to increase the number of partitions above number of distinct keys?
ds.repartition(nMonths * 3, $"month").write.<options>.insertInto(<...>)
Lots of partitions will be empty, but this shouldn't be that much of a pain (should it?) and we're reducing the probability that some unlucky partitions get 3x/4x data, increasing overall execution time. Does this make sense? Is there any rule of thumb regarding the factor? Or any other approach to achieve the same?
If you want to be super-safe you can use range partitioning, something like:
ds.repartitionByRange(nMonths,$"month").write...
This way you also won't be having empty partitions, which in turn means you won't produce zero-size files in HDFS too.

What is better in cassandra many tables of same structure or one table with many rows

Suppose that I have 1000 entities with exactly the same structure. For example all entities have three fields:
String id;
String name;
int amount;
Also I expect that there will be huge amount of every type of entity in the system.
So I have two variants right now:
For each entity create separate table which looks like:
CREATE TABLE <SOME_ENTITY_NAME> (
id text PRIMARY KEY,
name text,
amount int
)
I'll create only one table but with composite priamry key:
CREATE TABLE ALL_ENTITIES_TABLE (
entity_name text,
id text,
name text,
amount int,
PRIMARY KEY ((entity_name, id))
);
Of course, supporting only one table is more simplier, but what is with performance?
So, the question is what variant is better in terms of performance, taking into account that each type of entity will have millions(may be billions) of records?
There is a limitation on the number of the tables that can be created in the Cassandra cluster. Usual recommendation is too keep this number lower than 200, with ~500 is like a "hard stop"...
The reason for this is that every table requires allocation of additional memory, and other resources to keep auxiliary data, like, key/row caches, bloom filters, etc. Depending on the Cassandra version, every table may require 1-2Mb of memory.
So in your case, the 2nd design is better because you keep all data in single table, and your partition key will allow to spread data evenly between nodes of cluster.
In my opinion the first approach is incorrect in terms of maintainability. Too much of dynamically created tables should be tough to maintain. Also, If you use your partitioning/clustering order properly (as per the need of data retrieval) it should be easier and efficient to query. Also if you are using 3.x version of Cassandra, secondary indexes can come in handy.
NOTE: Secondary indexes don't allow sorting.
Cassandra was designed around the fact that disk space is the cheapest resource among all. You must build your data model around the queries that you will be using the most regardless of whether this model would consume more disk space or not - as long as it serves the purpose of your queries in the most efficient way. I wouldn't be able to answer your question without taking a look at the queries you will be using. In general, you must feel free to create as many tables as needed as long as it serves the purpose of your queries. I would recommend having a look here.

Real time complex queries on Cassandra

We're looking for a tool (preferably open source) which helps us to perform complex queries (advanced filtering and joins, no need full SQL) in real time.
Assume that all the data needed fits in memory, and we want to avoid, if possible, the overhead of map reduce tools.
To be more specific, we need to load n partitions of a single table, and join them by clustering column.
Variables Table:
Variable ID: Partition key
Person ID: Clustering key
Variable Value
Desired output columns:
Person ID, Variable 1 Value, Variable 2 Vale, ..., Variable N Value
We can achieve it by an in-memory load-filter-join process, but we were wondering if there's any tool out there with this use case covered out of the box and with a fair performance.
We've tested Spark, but the partitioning of Spark C* connector is based on the primary key, so each Variable ID would be loaded in a different Spark node, and the join process would be really slow (all the data would travel all over the Spark cluster).
Any tips? known tools?
I believe that you have a number of options to perform this task:
Rethink your database schema, denormalize it. var_id:person_id:value rows are not the best table schema if you want to query by person_id (and it smells really bad as an entity-attribute-value db antipattern):
EAV gives a flexibility to the developer to define the schema as needed and this is good in some circumstances. On the other hand it performs very poorly in the case of an ill-defined query and can support other bad practices. In other words, EAV gives you enough rope to hang yourself and in this industry, things should be designed to the lowest level of complexity because the guy replacing you on the project will likely be an idiot.
You can use schema with multiple columns (cassandra can handle a lot of them):
create table person_data (
person_id int primary key,
var1 text,
var2 text,
var3 text,
var4 text,
....
);
If you don't have a predefined set of variables, you can use cql3 collections like map for storing the data in a more flexible way.
Create a secondary index on person_id (even it's a clustering key already). You can query for all data for a specific user without using joins, but with some issues:
As your query will hit multiple partitions, it will require not a single disk seek, but a series of them, so your query latency may be higher than you're expecting.
secondary indexes are not free: C* must perform more work under the hood if you insert a row to a table with indexed columns.
Use external index like ElasticSearch/Solr if you plan to have a lot of complex queries which do not fit well into cql3.

Resources