Graphite importing historical data only for 1 day - linux

I'm trying to import historical data for 60 day per hour, but data succsessfully importing only for last 24 hours, configuration bellow:
Storage schema in Graphite /etc/carbon/storage-schemas.conf
[default]
pattern = .*
retentions = 5m:15d,15m:1y,1h:10y,1d:100y
Storage aggregation /etc/carbon/storage-aggregation.conf
[all_sum]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = sum
Restarting carbon-cache and removing old whisper data is not solving problem.
I checked .wsp files with wisper-info.py:
# whisper-info /var/lib/graphite/whisper/ran/3g/newerlang.wsp
maxRetention: 3153600000
xFilesFactor: 0.0
aggregationMethod: sum
fileSize: 1961584
Archive 0
retention: 1296000
secondsPerPoint: 300
points: 4320
size: 51840
offset: 64
Archive 1
retention: 31536000
secondsPerPoint: 900
points: 35040
size: 420480
offset: 51904
Archive 2
retention: 315360000
secondsPerPoint: 3600
points: 87600
size: 1051200
offset: 472384
Archive 3
retention: 3153600000
secondsPerPoint: 86400
points: 36500
size: 438000
offset: 1523584
Any idea if I need to set this up in another file or am I missing something?

Related

Spark SQL launches unequal number of jobs for identical queries

I have two tables finance.movies and finance.dummytable_3.
Both has been created using Spark SQL and their meta information is same
Both has two files
The total file size of finance.movies is 312363 while that of finance.dummytable_3 is 1209
But when I execute a simple "select *" on both the tables, finance.dummytable_3 whose total file size is less execute 2 jobs while finance.movies executes only one job. Screenshot follows.
hive> describe formatted finance.movies;
OK
# col_name data_type comment
movieid int
title_movie string
genres string
# Detailed Table Information
Database: finance
Owner: root
CreateTime: Thu Apr 16 02:18:51 EDT 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/movies
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 2
spark.sql.create.version 2.3.0.2.6.5.0-292
spark.sql.sources.provider parquet
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"movieid\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"title_movie\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"genres\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
totalSize 312363
transient_lastDdlTime 1587017931
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/movies
serialization.format 1
Time taken: 3.877 seconds, Fetched: 35 row(s)
hive> describe formatted finance.dummytable_3;
OK
# col_name data_type comment
colname1 string
colname2 string
colname3 string
# Detailed Table Information
Database: finance
Owner: root
CreateTime: Wed Apr 15 02:40:18 EDT 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/dummytable_3
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 2
spark.sql.create.version 2.3.0.2.6.5.0-292
spark.sql.sources.provider parquet
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"colName1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"colName2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"colName3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
totalSize 1209
transient_lastDdlTime 1586932818
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/dummytable_3
serialization.format 1
Time taken: 0.373 seconds, Fetched: 35 row(s)
When I execute the following query the number of jobs being executed is 2
spark-sql --master yarn --name spark_load_test -e "select * from finance.dummytable_3 limit 2;"
While when I execute the following query, the number of job being executed is just one. Despite both the tables having two files and identical struture.
spark-sql --master yarn --name spark_load_test -e "select * from finance.movies limit 2;"

Why do I see a spike of steps per second in tensorflow training initially?

Hi tensorflow experts,
I see the following training speed profile using dataset API and prefetching of 128, 256, 512, or 1024 batches (each of 128 examples):
INFO:tensorflow:Saving checkpoints for 0 into
INFO:tensorflow:loss = 0.969178, step = 0
INFO:tensorflow:global_step/sec: 70.3812
INFO:tensorflow:loss = 0.65544295, step = 100 (1.422 sec)
INFO:tensorflow:global_step/sec: 178.33
INFO:tensorflow:loss = 0.47716027, step = 200 (0.560 sec)
INFO:tensorflow:global_step/sec: 178.626
INFO:tensorflow:loss = 0.53073615, step = 300 (0.560 sec)
INFO:tensorflow:global_step/sec: 132.039
INFO:tensorflow:loss = 0.4849593, step = 400 (0.757 sec)
INFO:tensorflow:global_step/sec: 121.437
INFO:tensorflow:loss = 0.4055175, step = 500 (0.825 sec)
INFO:tensorflow:global_step/sec: 122.379
INFO:tensorflow:loss = 0.28230205, step = 600 (0.817 sec)
INFO:tensorflow:global_step/sec: 122.163
INFO:tensorflow:loss = 0.4917924, step = 700 (0.819 sec)
INFO:tensorflow:global_step/sec: 122.509
The initial spike of 178 steps per second is reproducible across multiple runs and different prefetching amount. I am trying to understanding the underlying multi-threading mechanism on why that happens.
Additional information:
my cpu usage peaks at 1800% on a 48 core machine. My gpu usage is consistently at only 9%. So it's pretty amazing that both of these are not exhausted. So I am wondering if the mutex in queue_runner is causing the cpu processing to not realize its full potential, as described here?
Thanks,
John
[update] I also observed the same spike when I use prefetch_to_device(gpu_device, ..), with similar buffer sizes. Surprisingly, prefetch_to_device only slows things down, by about 10%.
NFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into
INFO:tensorflow:loss = 1.3881096, step = 0
INFO:tensorflow:global_step/sec: 52.3374
INFO:tensorflow:loss = 0.48779136, step = 100 (1.910 sec)
INFO:tensorflow:global_step/sec: 121.154
INFO:tensorflow:loss = 0.3451385, step = 200 (0.827 sec)
INFO:tensorflow:global_step/sec: 89.3222
INFO:tensorflow:loss = 0.37804496, step = 300 (1.119 sec)
INFO:tensorflow:global_step/sec: 80.4857
INFO:tensorflow:loss = 0.49938473, step = 400 (1.242 sec)
INFO:tensorflow:global_step/sec: 79.1798
INFO:tensorflow:loss = 0.5120025, step = 500 (1.263 sec)
INFO:tensorflow:global_step/sec: 81.2081
It's common to see spikes in steps per second at the start of each training run, as the cpu had time to fill up the buffer. Your step per seconds are very reasonable compared to the start, but the lack of cpu usage might indicate a bottleneck.
First question is, whether or not you are using the Dataset API in combination with the estimator. From your terminal output I suspect you do, if not I would start by changing your code to use the Estimator class. If you are already using the Estimator class, then make sure you are following the best performance practices as documented here.
If your are doing all of the above already, then there is a bottleneck in you pipeline. Due to the low CPU usage I would guess you are experiencing an I/O bottleneck. You might have your Dataset on a slow medium (hard-drive) or you aren't using a serialized format and are saturating the IOPS (again hard-drive or network storage). In either case, start by using a serialized data format such as TF-records and upgrade your storage to SSD or multiple hard drives in raid 1,0,10 your pick.

Graphite storage schema not working

I have configured the following storage schema in Graphite /etc/carbon/storage-schemas.conf file with the assumption that it would allow me to keep data with 60s precision during 356 days. Although when I convert data back using Whisper-Fetch, I get 60s precision for only one week of data. Any idea if I need to set this up in another file or am I missing something?
Storage schema
[collectd]
retentions = 60s:365d
Whisper info
whisper-info memory-buffered.wsp
maxRetention: 31536000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 855412
Archive 0
retention: 86400
secondsPerPoint: 10
points: 8640
size: 103680
offset: 52
Archive 1
retention: 604800
secondsPerPoint: 60
points: 10080
size: 120960
offset: 103732
Archive 2
retention: 31536000
secondsPerPoint: 600
points: 52560
size: 630720
offset: 224692
Your whisper-info.py output shows that it's not using the schema you claim. The most likely answer is that the Whisper file was created before changing the schema. In this case you need to either delete the file (and let it get created again) or use whisper-resize.py to apply the new schema.

spark reduceByKeyAndWindow window have error interval

I use stream 2.1 spark 1.6 scala program, the partial code is listed below.
Follow documentation saying that window size 3 and sliding size 2 and then windowed stream will appear at time 3 and time 5 but I got at time 4 and time 6 but windows size 3 and sliding size 1 is OK.
val inputData: mutable.Queue[RDD[String]] = mutable.Queue()
var outputCollector = new ArrayBuffer[(String, Int)]
val inputStream = ssc.queueStream(inputData)
val patternStream: DStream[(String, Int)] = inputStream.flatMap(line => {
line.replace(",", "").map(x => (x.toString(), 1))
})
val groupStream = patternStream.reduceByKeyAndWindow(_+_, _-_, Seconds(wt1), Seconds(st1))
inputStream.print()
patternStream.foreachRDD(rdd=>{
rdd.collect().foreach(print)
println("\n")
})
groupStream.foreachRDD(rdd => {
println("window stream")
rdd.filter(s => s._2>0).sortByKey().collect().foreach(i=> {outputCollector += (i)} )
})
window size 3 sliding size 1 it is ok
Time: 1000 ms
window stream (a,1)(b,1)(f,1)(g,1)
Time: 2000 ms
window stream (a,1)(b,1)(d,1)(e,1)(f,1)(g,1)
Time: 3000 ms
window stream (a,1)(b,1)(c,1)(d,2)(e,1)(f,1)(g,1)
Time: 4000 ms
window stream (a,1)(c,1)(d,3)(e,1)
Time: 5000 ms
window stream (a,1)(c,2)(d,3)
Time: 6000 ms
window stream (a,2)(c,2)(d,2)(f,1)(g,1)
window size 3 slice size 2
window stream appear in 2 4 6 have some error
Time: 1000 ms
Time: 2000 ms
window stream (a,1)(b,1)(d,1)(e,1)(f,1)(g,1)
Time: 3000 ms
############## no widowed streamhere
Time: 4000 ms
window stream (a,1)(c,1)(d,3)(e,1)
Time: 5000 ms
############ no widowed stream here
Time: 6000 ms
window stream (a,2)(c,2)(d,2)(f,1)(g,1)

Getting total number of key-value pairs in RocksDB

Is it possible to efficiently get the number of key-value pairs stored in a RocksDB key-value store?
I have looked through the wiki, and haven't seen anything discussing this topic thus far. Is such an operation even possible?
Codewisely, you could use db->GetProperty("rocksdb.estimate-num-keys", &num) to obtain the estimated number of keys stored in a rocksdb.
Another option is to use the sst_dump tool with --show_properties argument to get the number of entries, although the result would be per file basis. For example, the following command will show the properties of each SST file under the specified rocksdb directory:
sst_dump --file=/tmp/rocksdbtest-691931916/dbbench --show_properties --command=none
And here's the sample output:
Process /tmp/rocksdbtest-691931916/dbbench/000005.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 845
# entries: 27857
raw key size: 668568
raw average key size: 24.000000
raw value size: 2785700
raw average value size: 100.000000
data block size: 3381885
index block size: 28473
filter block size: 0
(estimated) table size: 3410358
filter policy name: N/A
# deleted keys: 0
Process /tmp/rocksdbtest-691931916/dbbench/000008.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 845
# entries: 27880
raw key size: 669120
...
Combine with some shell commands, you will be able to get the total number of entries:
sst_dump --file=/tmp/rocksdbtest-691931916/dbbench --show_properties --command=none | grep entries | cut -c 14- | awk '{x+=$0}END{print "total number of entries: " x}'
And this will generate the following output:
total number of entries: 111507
There is no way to get the count exactly. But in rocksdb 3.4 which released recently, it expose an way to get an estimate count for keys, you can try it.
https://github.com/facebook/rocksdb/releases

Resources