I have configured the following storage schema in Graphite /etc/carbon/storage-schemas.conf file with the assumption that it would allow me to keep data with 60s precision during 356 days. Although when I convert data back using Whisper-Fetch, I get 60s precision for only one week of data. Any idea if I need to set this up in another file or am I missing something?
Storage schema
[collectd]
retentions = 60s:365d
Whisper info
whisper-info memory-buffered.wsp
maxRetention: 31536000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 855412
Archive 0
retention: 86400
secondsPerPoint: 10
points: 8640
size: 103680
offset: 52
Archive 1
retention: 604800
secondsPerPoint: 60
points: 10080
size: 120960
offset: 103732
Archive 2
retention: 31536000
secondsPerPoint: 600
points: 52560
size: 630720
offset: 224692
Your whisper-info.py output shows that it's not using the schema you claim. The most likely answer is that the Whisper file was created before changing the schema. In this case you need to either delete the file (and let it get created again) or use whisper-resize.py to apply the new schema.
Related
I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.
Suppose you run the following code on a table with 1,000 items that are 400KB in size, and suppose that the attribute name for 'column1' + the actual data are 10 bytes:
import boto3
def get_column_1_items():
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('DynamoTable')
resp = table.scan(AttributesToGet=['column1'])
return resp['Items']
Will you be charged for retrieving 1000 * 400 KB = 400 MB of data retrieval, or for retrieving 1,000 * 10B = 10KB by running this query?
Based on the doc,
Note that AttributesToGet has no effect on provisioned throughput consumption. DynamoDB determines capacity units consumed based on item size, not on the amount of data that is returned to an application.
You will be charged for retrieving 400 MB of data.
Also be aware that a single Scan request can retrieve a maximum of 1 MB of data. So in order to retrieve 400 MB of data, you need multiple requests.
Hi tensorflow experts,
I see the following training speed profile using dataset API and prefetching of 128, 256, 512, or 1024 batches (each of 128 examples):
INFO:tensorflow:Saving checkpoints for 0 into
INFO:tensorflow:loss = 0.969178, step = 0
INFO:tensorflow:global_step/sec: 70.3812
INFO:tensorflow:loss = 0.65544295, step = 100 (1.422 sec)
INFO:tensorflow:global_step/sec: 178.33
INFO:tensorflow:loss = 0.47716027, step = 200 (0.560 sec)
INFO:tensorflow:global_step/sec: 178.626
INFO:tensorflow:loss = 0.53073615, step = 300 (0.560 sec)
INFO:tensorflow:global_step/sec: 132.039
INFO:tensorflow:loss = 0.4849593, step = 400 (0.757 sec)
INFO:tensorflow:global_step/sec: 121.437
INFO:tensorflow:loss = 0.4055175, step = 500 (0.825 sec)
INFO:tensorflow:global_step/sec: 122.379
INFO:tensorflow:loss = 0.28230205, step = 600 (0.817 sec)
INFO:tensorflow:global_step/sec: 122.163
INFO:tensorflow:loss = 0.4917924, step = 700 (0.819 sec)
INFO:tensorflow:global_step/sec: 122.509
The initial spike of 178 steps per second is reproducible across multiple runs and different prefetching amount. I am trying to understanding the underlying multi-threading mechanism on why that happens.
Additional information:
my cpu usage peaks at 1800% on a 48 core machine. My gpu usage is consistently at only 9%. So it's pretty amazing that both of these are not exhausted. So I am wondering if the mutex in queue_runner is causing the cpu processing to not realize its full potential, as described here?
Thanks,
John
[update] I also observed the same spike when I use prefetch_to_device(gpu_device, ..), with similar buffer sizes. Surprisingly, prefetch_to_device only slows things down, by about 10%.
NFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into
INFO:tensorflow:loss = 1.3881096, step = 0
INFO:tensorflow:global_step/sec: 52.3374
INFO:tensorflow:loss = 0.48779136, step = 100 (1.910 sec)
INFO:tensorflow:global_step/sec: 121.154
INFO:tensorflow:loss = 0.3451385, step = 200 (0.827 sec)
INFO:tensorflow:global_step/sec: 89.3222
INFO:tensorflow:loss = 0.37804496, step = 300 (1.119 sec)
INFO:tensorflow:global_step/sec: 80.4857
INFO:tensorflow:loss = 0.49938473, step = 400 (1.242 sec)
INFO:tensorflow:global_step/sec: 79.1798
INFO:tensorflow:loss = 0.5120025, step = 500 (1.263 sec)
INFO:tensorflow:global_step/sec: 81.2081
It's common to see spikes in steps per second at the start of each training run, as the cpu had time to fill up the buffer. Your step per seconds are very reasonable compared to the start, but the lack of cpu usage might indicate a bottleneck.
First question is, whether or not you are using the Dataset API in combination with the estimator. From your terminal output I suspect you do, if not I would start by changing your code to use the Estimator class. If you are already using the Estimator class, then make sure you are following the best performance practices as documented here.
If your are doing all of the above already, then there is a bottleneck in you pipeline. Due to the low CPU usage I would guess you are experiencing an I/O bottleneck. You might have your Dataset on a slow medium (hard-drive) or you aren't using a serialized format and are saturating the IOPS (again hard-drive or network storage). In either case, start by using a serialized data format such as TF-records and upgrade your storage to SSD or multiple hard drives in raid 1,0,10 your pick.
I'm trying to import historical data for 60 day per hour, but data succsessfully importing only for last 24 hours, configuration bellow:
Storage schema in Graphite /etc/carbon/storage-schemas.conf
[default]
pattern = .*
retentions = 5m:15d,15m:1y,1h:10y,1d:100y
Storage aggregation /etc/carbon/storage-aggregation.conf
[all_sum]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = sum
Restarting carbon-cache and removing old whisper data is not solving problem.
I checked .wsp files with wisper-info.py:
# whisper-info /var/lib/graphite/whisper/ran/3g/newerlang.wsp
maxRetention: 3153600000
xFilesFactor: 0.0
aggregationMethod: sum
fileSize: 1961584
Archive 0
retention: 1296000
secondsPerPoint: 300
points: 4320
size: 51840
offset: 64
Archive 1
retention: 31536000
secondsPerPoint: 900
points: 35040
size: 420480
offset: 51904
Archive 2
retention: 315360000
secondsPerPoint: 3600
points: 87600
size: 1051200
offset: 472384
Archive 3
retention: 3153600000
secondsPerPoint: 86400
points: 36500
size: 438000
offset: 1523584
Any idea if I need to set this up in another file or am I missing something?
Is it possible to efficiently get the number of key-value pairs stored in a RocksDB key-value store?
I have looked through the wiki, and haven't seen anything discussing this topic thus far. Is such an operation even possible?
Codewisely, you could use db->GetProperty("rocksdb.estimate-num-keys", &num) to obtain the estimated number of keys stored in a rocksdb.
Another option is to use the sst_dump tool with --show_properties argument to get the number of entries, although the result would be per file basis. For example, the following command will show the properties of each SST file under the specified rocksdb directory:
sst_dump --file=/tmp/rocksdbtest-691931916/dbbench --show_properties --command=none
And here's the sample output:
Process /tmp/rocksdbtest-691931916/dbbench/000005.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 845
# entries: 27857
raw key size: 668568
raw average key size: 24.000000
raw value size: 2785700
raw average value size: 100.000000
data block size: 3381885
index block size: 28473
filter block size: 0
(estimated) table size: 3410358
filter policy name: N/A
# deleted keys: 0
Process /tmp/rocksdbtest-691931916/dbbench/000008.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 845
# entries: 27880
raw key size: 669120
...
Combine with some shell commands, you will be able to get the total number of entries:
sst_dump --file=/tmp/rocksdbtest-691931916/dbbench --show_properties --command=none | grep entries | cut -c 14- | awk '{x+=$0}END{print "total number of entries: " x}'
And this will generate the following output:
total number of entries: 111507
There is no way to get the count exactly. But in rocksdb 3.4 which released recently, it expose an way to get an estimate count for keys, you can try it.
https://github.com/facebook/rocksdb/releases