Getting total number of key-value pairs in RocksDB - database-administration

Is it possible to efficiently get the number of key-value pairs stored in a RocksDB key-value store?
I have looked through the wiki, and haven't seen anything discussing this topic thus far. Is such an operation even possible?

Codewisely, you could use db->GetProperty("rocksdb.estimate-num-keys", &num) to obtain the estimated number of keys stored in a rocksdb.
Another option is to use the sst_dump tool with --show_properties argument to get the number of entries, although the result would be per file basis. For example, the following command will show the properties of each SST file under the specified rocksdb directory:
sst_dump --file=/tmp/rocksdbtest-691931916/dbbench --show_properties --command=none
And here's the sample output:
Process /tmp/rocksdbtest-691931916/dbbench/000005.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 845
# entries: 27857
raw key size: 668568
raw average key size: 24.000000
raw value size: 2785700
raw average value size: 100.000000
data block size: 3381885
index block size: 28473
filter block size: 0
(estimated) table size: 3410358
filter policy name: N/A
# deleted keys: 0
Process /tmp/rocksdbtest-691931916/dbbench/000008.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 845
# entries: 27880
raw key size: 669120
...
Combine with some shell commands, you will be able to get the total number of entries:
sst_dump --file=/tmp/rocksdbtest-691931916/dbbench --show_properties --command=none | grep entries | cut -c 14- | awk '{x+=$0}END{print "total number of entries: " x}'
And this will generate the following output:
total number of entries: 111507

There is no way to get the count exactly. But in rocksdb 3.4 which released recently, it expose an way to get an estimate count for keys, you can try it.
https://github.com/facebook/rocksdb/releases

Related

If you run a scan on DynamoDB with an AttributesToGet argument are you charged for the data footprint of every item or just the requested attributes?

Suppose you run the following code on a table with 1,000 items that are 400KB in size, and suppose that the attribute name for 'column1' + the actual data are 10 bytes:
import boto3
def get_column_1_items():
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('DynamoTable')
resp = table.scan(AttributesToGet=['column1'])
return resp['Items']
Will you be charged for retrieving 1000 * 400 KB = 400 MB of data retrieval, or for retrieving 1,000 * 10B = 10KB by running this query?
Based on the doc,
Note that AttributesToGet has no effect on provisioned throughput consumption. DynamoDB determines capacity units consumed based on item size, not on the amount of data that is returned to an application.
You will be charged for retrieving 400 MB of data.
Also be aware that a single Scan request can retrieve a maximum of 1 MB of data. So in order to retrieve 400 MB of data, you need multiple requests.

Percona Toolkit query digest not reading all queries in slow query log

I have a collection of slow query logs from RDS and I put them together into a single file. Trying to run it through pt-query-digest, following instructions here, but it reads the whole file as a single query.
Command:
pt-query-digest --group-by fingerprint --order-by Query_time:sum collider-slow-query.log > slow-query-analyze.txt
Output, showing that it only analyzed one query:
# Overall: 1 total, 1 unique, 0 QPS, 0x concurrency ______________________
Here's just a short snippet containing 2 queries from the file being analyzed to demonstrate there are many queries:
2019-05-03T20:44:21.828Z # Time: 2019-05-03T20:44:21.828954Z
# User#Host: username[username] # [ipaddress] Id: 19
# Query_time: 17.443164 Lock_time: 0.000145 Rows_sent: 5 Rows_examined: 121380
SET timestamp=1556916261;
SELECT wp_posts.ID FROM wp_posts LEFT JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1 AND wp_posts.ID NOT IN (752921) AND (
wp_term_relationships.term_taxonomy_id IN (40)
) AND wp_posts.post_type = 'post' AND ((wp_posts.post_status = 'publish')) GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 5;
2019-05-03T20:44:53.597Z # Time: 2019-05-03T20:44:53.597137Z
# User#Host: username[username] # [ipaddress] Id: 77
# Query_time: 35.757909 Lock_time: 0.000054 Rows_sent: 2 Rows_examined: 199008
SET timestamp=1556916293;
SELECT post_id, meta_value FROM wp_postmeta
WHERE meta_key = '_wp_attached_file'
AND meta_value IN ( 'family-guy-vestigial-peter-slice.jpg','2015/08/bobs-burgers-image.jpg','2015/08/bobs-burgers-image.jpg' );
Why isn't it reading all the queries? Is there a problem with my concatenation?
I had the same problem as well. It turns out that some additional timestamp is stored in the SQL (through this variable: https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_log_timestamps).
This makes pt-query-digest not see individual queries. It should be easily fixed by either turning off the timestamp variable or remove the timestamp:
2019-10-28T11:17:18.412 # Time: 2019-10-28T11:17:18.412214
# User#Host: foo[foo] # [192.168.8.175] Id: 467836
# Query_time: 5.839596 Lock_time: 0.000029 Rows_sent: 1 Rows_examined: 0
use foo;
SET timestamp=1572261432;
SELECT COUNT(*) AS `count` FROM `foo`.`invoices` AS `Invoice` WHERE 1 = 1;
By removing the first timestamp (the 2019-10-28T11:17:18.412 part), it works again.

Pandas - Adding dataframe with same name using to_hdf doubled file size

I am newbie in Pandas module. I created dataframe and save it with name "dirtree" using to_hdf:
df.to_hdf("d:/datatree full.h5", "dirtree")
I repeated actions above. After that, when I check file size, it is doubled. I guess my second dataframe was appended to old dataframe, but checking dataframes in store and counting rows says there are no extra dataframe or rows. How could it be?
My codes to check store:
store = pd.HDFStore('d:/datatree.h5')
print(store)
df = pd.read_hdf('d:/datatree.h5', 'dirtree')
df.text.count() # text is one of the columns in df
I could reproduce this issue in the following way:
Original sample DF:
In [147]: df
Out[147]:
a b c
0 0.163757 -1.727003 0.641793
1 1.084989 -0.958833 0.552059
2 -0.419273 -1.037440 0.544212
3 -0.197904 -1.106120 -1.117606
4 0.891187 1.094537 100.000000
let's save it to HDFStore:
In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')
file size: 6992 bytes
let's do it one more time:
In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')
file size: 6992 bytes NOTE: it did NOT change
now let's open HDFStore:
In [150]: store = pd.HDFStore('c:/temp/test_dup.h5')
In [151]: store
Out[151]:
<class 'pandas.io.pytables.HDFStore'>
File path: c:/temp/test_dup.h5
/x frame (shape->[5,3])
file size: 6992 bytes NOTE: it did NOT change
let's save DF to HDFStore one more time, but notice that the store is open:
In [156]: df.to_hdf('c:/temp/test_dup.h5', 'x')
In [157]: store.close()
file size: 12696 bytes # BOOM !!!
Root cause:
when we do: store = pd.HDFStore('c:/temp/test_dup.h5') - it's open with the default mode 'a' (append), so it's ready to modify the store and when you write to the same file, but not using this store it makes a copy in order to protect the open store...
How to avoid it:
use mode='r' when you open a store:
In [158]: df.to_hdf('c:/temp/test_dup2.h5', 'x')
In [159]: store2 = pd.HDFStore('c:/temp/test_dup2.h5', mode='r')
In [160]: df.to_hdf('c:/temp/test_dup2.h5', 'x')
...
skipped
...
ValueError: The file 'c:/temp/test_dup2.h5' is already opened, but in read-only mode. Please close it before reopening in append mode.
or a better way to manage your HDF files - is to use store:
store = pd.HDFStore(filename) # it's stored in the `'table'` mode per default !
store.append('key_name', df, data_columns=True)
...
store.close() # don't forget to flush changes to disk !!!

Graphite storage schema not working

I have configured the following storage schema in Graphite /etc/carbon/storage-schemas.conf file with the assumption that it would allow me to keep data with 60s precision during 356 days. Although when I convert data back using Whisper-Fetch, I get 60s precision for only one week of data. Any idea if I need to set this up in another file or am I missing something?
Storage schema
[collectd]
retentions = 60s:365d
Whisper info
whisper-info memory-buffered.wsp
maxRetention: 31536000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 855412
Archive 0
retention: 86400
secondsPerPoint: 10
points: 8640
size: 103680
offset: 52
Archive 1
retention: 604800
secondsPerPoint: 60
points: 10080
size: 120960
offset: 103732
Archive 2
retention: 31536000
secondsPerPoint: 600
points: 52560
size: 630720
offset: 224692
Your whisper-info.py output shows that it's not using the schema you claim. The most likely answer is that the Whisper file was created before changing the schema. In this case you need to either delete the file (and let it get created again) or use whisper-resize.py to apply the new schema.

CouchDB sorting - Collation Specification

Collation Specification
Using a CouchDB view it seems my keys aren't sorted as per the collation specification.
rows:
[0] key: ["bylatest", -1294536544000] value: 1
[1] key: ["bylatest", -1298817134000] value: 1
[2] key: ["bylatest", -1294505612000] value: 1
I would of expect the second entry to come after the third.
Why is this happening?
I get this result for a view emitting the values you indicate.
{"total_rows":3,"offset":0,"rows":[
{"id":"29e86c6bf38b9068c56ab1cd0100101f","key":["bylatest",-1298817134000],"value":1},
{"id":"29e86c6bf38b9068c56ab1cd0100101f","key":["bylatest",-1294536544000],"value":1},
{"id":"29e86c6bf38b9068c56ab1cd0100101f","key":["bylatest",-1294505612000],"value":1}
]}
The rows are different from both of your examples. They fit the collation specification by beginning with the smallest (greatest magnitude negative) value and ending with the greatest value (least magnitude negative in this case).
Would you perhaps include the documents and map/reduce functions you are using?

Resources