How to check the number of partitions that a table is distributed in a DolphinDB database? - partition

How can I know over how many partitions in a DolphinDB database that a table is distributed? For example, if I created a database with 100 partitions and a table in the database only has data in 4 partitions, how do I get the number of 4?

this will do:
sqlDS(<select * from t>).size()

Related

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

How to repartition into fixed number of partition per column in Spark?

I need to read data from one hive table and insert it into another Hive table. The schema of both the tables is the same. The table is partitioned by date & country. The size of each partition is ~500MB. I want to insert these data in a new table where the files inside each partition are roughly 128 MB (i.e 4 files)
Step 1: Read data from the source table in Spark.
Step 2: Repartition by column(country, date) and the number of partitions to 4.
df.repartition(4, col("country_code"), col("record_date"))
I am getting only 1 partition per country_code & record_date.
Whatever you are doing in the step 2 will repartition your data to 4 partitions in the memory but it won't save 4 files if you do df.write.
In order to do that you can use below code:
df.repartition(4, col("country_code"),col("record_date"))
.write
.partitionBy(col("country_code"),col("record_date"))
.mode(SaveMode.Append).saveAsTable("TableName")

Cassandra (DSE) - Need suggestion on using PER PARTITION LIMIT on huge data

I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!

Cassandra - Composite Partition Keys and Performance

I am working on the keyspace and tables for a Cassandra environment. I understand the size limitations of Cassandra and dealing with Partition keys to keep it optimized. However, I am having a disagreement with a developer regarding how to handle the keys. Is there any downside in having a key that would include a large number of data rather than a small amount of data. For example,
I have 100k records. I can create a key that will partition this into 10k; I could also create a key that will partition this into 10 records (by day). So either I store 10k and 10 partitions or 10 records and 10,000 partitions.
Keep in mind that having more columns in the key requires you to specify those columns in your select statements, which sometimes isn't desired. The more partitions the better - whether by picking a better single column or having multiple columns.
Cassandra reads data via the partition key, and can get help with performance if clustering columns are used. If you have a large partition, the entire partition must be read (memory and disk) and then merged for the output. If you have large partitions, this will definitely slow you down.

Cassandra failure during read query

I have a Cassandra Table with ~500 columns and primary key ((userId, version, shredId), rowId) where shredId is used to distribute data evenly into different partitions. Table also has a default TTL of 2 days to expire data as data are for real-time aggregation. The compaction strategy is TimeWindowCompactionStrategy.
The workflow is:
write data to input table (with consistency EACH_QUORUM)
Run spark aggregation (on rows with same the userId and version)
write aggregated data to output table.
But I'm getting Cassandra failure during read query when size of data gets large; more specifically, once there are more than 210 rows in one partition, read queries fail.
How can I tune my database and change properties to fix this?
After investigation and research, the issued is caused by null values been inserted for some empty column. this creates large amount of tombstones and eventually timeout the query.

Resources