I have some data flows need to be calculated. I am thinking about use spark stream to do this job. But there is one thing I am not sure and feel worry about.
My requirements is like :
Data comes in as CSV files every 5 minutes. I need report on data of recent 5 minutes, 1 hour and 1 day. So If I setup a spark stream to do this calculation. I need a interval as 5 minutes. Also I need to setup two window 1 hour and 1 day.
Every 5 minutes there will be 1GB data comes in. So the one hour window will calculate 12GB (60/5) data and the one day window will calculate 288GB(24*60/5) data.
I do not have much experience on spark. So this worries me.
Can spark handle such big window ?
How much RAM do I need to calculation those 288 GB data? More than 288 GB RAM? (I know this may depend on my disk I/O, CPU and the calculation pattern. But I just want some estimated answer based on experience)
If calculation on one day / one hour data is too expensive in stream. Do you have any better suggestion?
I'm currently using 1-node cluster with DataStax Opscenter 5.2.1 (Cassandra 2.2.3) installed on Windows.
There is not too much data is sent to the cluster, and here is the graph (last 20 minutes) of write requests that I can see in Opscenter. The graph looks normal and expected for me:
write_requests(20min)
However, when I've switched the data range to last 1 hour, as turns out there were much more write requests (according to cluste(max) line):
write_requests(1h)
I'm confused, could someone clarify what cluster(max) means in my case? Why these values are so big in comparison with cluster(total) or cluster(min)?
The first graph (20 minute) uses an average. The 1h graph will have 3 lines - min per sample, average, and max per sample.
What you're likely seeing is that something (perhaps opscenter itself) is doing a flood of writes, about 700/second for a few seconds, and on the 20 minute graph it gets averaged out, but with the min/max lines, you'll see the outliers.
Is it anyhow possible to get the count, how often DynamoDB throughput (write units/read units) was downscaled within the last 24 hours?
My idea is to downscale as soon as an hugo drop e.g. 50% in the needed provisioned write units occur. I have really peaky traffic. Thus it is interessting to me to downscale after every peak. However I have a analytics jobs running at night which is provisioning a huge amount of read units making it necessary to be able to downscale after it. Thus I need to limit downscales to 3 times within 24 hours.
The number of decreases is returned in a DescribeTable result as part of the ProvisionedThroughputDescription.
I've got pretty unusual latency patterns in my production setup:
the whole cluster (3 machines: 48 gig ram, 7500 rpm disk, 6 cores) shows latency spikes every 10 minutes, all machines at the same time.
See this screenshot.
I checked the logfiles and it seems as there are no compactions taking place at that time.
I've got 2k reads and 5k reads/sec. No optimizations have been made so far.
Caching is set to "ALL", hit rate for row cache is at ~0,7.
Any ideas? Is tuning memtable size an option?
Best,
Tobias
We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.
So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.
What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.
We are using:
graphite 0.9.6 with statsD (etsy)
After some research through the documentation, and some conversations with others, I've found the problem - and the solution.
The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.
My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.
To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.
Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.
As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.
StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:
As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.
I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.
After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.
http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf