Packet profile from netflow - statistics

I have netflow data from previous month in files per 5 minutes and I would like to do a packet profile of all this traffic. I need percentage representation of 1 packet flows, 2 packet flows etc. It is possible to do it in categories like 1 packet flow, 1-100 packet flows, 100 and more... Its not so important. But my question is how to do it. How to do percentage representation of data which I can't add together? Something like do percentage representation for every file and then do some type of average from it?

What do you mean with "I can't add together"? Actually you can do that with nfdump, if you look at the manual: -R expr /dir/file1:file2 Read all files from file1 to file2. For istance
nfdump -R /yournetflowfolder/nfcapd.201204051609:nfcapd.201204051639
will gather NetFlow informations from 16:09 to 16:39. Then you can do whatever query you need on that data.

It sounds like you're describing a histogram: You create 'bins' of the size you describe with the raw counts. The sum of the counts for the bins is the total number of sessions. To get the percentages of the total traffic, you just normalize by dividing each bin by the total flow count.
So, if you do a two-bin histogram where the first bin is the count of all sessions with < 100 packet flows and the other 100+ packet flows (note that there can't be gaps or overlaps), and it works out to 30 flows in the former and 60 in the latter, then the total number of flows is 90, and you have 33% of the flows being fewer than 100 packets.
When working with multiple files, the trick is to always use the same bin delineations and to store and work with the raw counts as long as possible and only derive the %s as the very last step. You can add together histograms with no trouble as long as their bins mean the same thing, and then when you normalize the result, you have for each bin the total percent for all files. If you're going to need to add a file, just keep track of the raw counts so that you can re-normalize when there's new data.
You can do this in a tool like Matlab pretty easily, but be careful because many of these tools will very kindly auto-determine bin widths for you. So, the histogram for one file might have bins {x < 100, 100 <= x < 200, x >= 200} and another file, {x < 90, 90 <= x < 180, x >=180} and you won't be able to add the results together.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

Optimization of a Packet Generation Script

I have a dataset full of 250+ million entries of netflow data. My goal is to develop an efficient way of generating packets for this netflow data, and I've decided to use Scapy as my means of emulating packets. For each of the entries in my dataset, I've found that a vast majority of packets are sent between some time frame capping at five minutes, and some of these entries have packet net totals as high as hundreds of thousands!
So here's the issue: I'll need a way to keep track of thousands of different entries at once to ensure that each entries' packets are sent out at the correct delta time until all packets have been sent.
For example:
Let's say that we have 3 entries in our dataset. The first is 100 packets that spans 20 seconds. The next is 200 packets that spans 20 seconds. The last is 500 packets that spans 60 seconds. The delta time between each packet in our dataset is 0.2s, 0.1s, and 0.12s respectively (20/100, 20/200, 60/500). Now I have only three entries that need to deliver these packets at the rate of these delta times. An example of the first few packets of each entry appended to a list would be as follows: [0.0s from entry 1, 0.0s from entry 2, 0.0s from entry 3, 0.1s from entry 2, 0.12s from entry 3, 0.2s from entry 1, 0.2s (0.1 x 2) from entry 2, 0.24s (0.12 x 2) from entry 3, 0.3s (0.1 x 3) from entry 2, 0.36s (0.12 x 3) from entry 3, 0.4s (0.2 x 2) from entry 1]
What is the best system I can come up with that will allow this process to happen with all packets being outputted to the same list concurrently?
Thus far, I've tried simply going entry by entry (adding packets with correct delta times) and adding this to a huge list of packets, then going onto the next entry and so on and so forth. This method works, however not only are the packets out of timely order and in need of a sort algorithm, but it's extremely slow and takes 15+ minutes to resolve just the first 100 entries of this 250 MILLION entry dataset! I think that I could definitely pipeline this process as to work on appending multiple entries to the list at once, but I'm simply not sure how to go about doing so.
The current code is on my github, https://github.com/NolanRudolph/UONetflow/blob/master/PyScripts/netflowPackager.py, however I don't believe one needs to read over my code to solve this problem. I believe someone's vast knowledge of Python could definitely offer an optimized solution within seconds. I've heard about utilizing sets which has hashmapping, but I'm not sure how I'd integrate this into exporting a multitude of entries' packets in an efficient manner.
I can see how this question could become very confusing really quick, so please don't hesitate to ask clarifying questions. Any help would be greatly appreciated. Thank you so much!

Understanding audio file spectrogram values

I am currently struggling to understand how the power spectrum is stored in the kaldi framework.
I seem to have successfully created some data files using
$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
compute-spectrogram-feats --verbose=2 \
scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
copy-feats --compress=$compress $write_num_frames_opt ark:- \
ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp
Which gives me a large file with data point for different audio files, like this.
The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.
The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms.
The number of data points in the first set is 25186.
Given these informations, can I interpret the output in some way?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?
Or am I interpreting it incorrectly?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.
So is this the same case? 16000/25186 = 0.6... Hz/bin?
The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.
Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):
fs = 16000; % 16 kHz sampling rate
hop_size = 0.010; % 10 millisecond
[X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
view(2);
xlabel('Time (seconds)');
ylabel('Frequency (Hz)');
gives this graph (the dark red regions are the areas of highest intensity):

Calculating the throughput (requests/sec) and plot it

I'm using JMeter client to test the throughtput of a certain workload (PHP+MySQL, 1 page) on a certain server. Basically I'm doing a "capacity test" with an increasing number of threads over the time.
I installed the "Statistical Aggregate Report" JMeter plugin and this was the result (ignore the "Response time" line):
At the same time I used the "Simple Data Writer" listener to write a log file ("JMeter.csv"). Then I tried to "manually" calculate the throughput for every second of the test.
Each line of "JMeter.csv" has this format:
timestamp elaspedtime responsecode success bytes
1385731020607 42 200 true 325
... ... ... ... ...
The timestamp is referred to the time when the request is made by the client, and not when the request is served by the server. So I simply did: totaltime = timestamp + elapsedtime.
In the next step I converted the totaltime to a date format, like: 13:17:01.
I have more than 14K samples and with Excel I was able to do this quickly.
Then I counted how many samples there were for each second. Example:
totaltime samples (requestsServed/second)
13:17:01 204
13:17:02 297
... ...
When I tried to plot the results I obtained the following graphic:
As you can notice it is far different from the first graphic.
Given that the first graphic is correct, what is the mistake of my formula/procedure to calculate the throughput?
It turns out that this plugin is plotting something that I don't know... I tried many times and my considerations were actually correct. Be careful with this plugin (or check its source code).
Throughput can be view in Jmeter Summary Report and you can calculate by saving your Test Results file in xml file in Summary Report.
Throughput = Number of samples/(Max (ts+t) - Min ts)*1000
Throughput = (Number of samples/The difference between Maximum and minimum response time)*1000
By this formula you can calculate Throughput for each and every http requests in Summary Report.
Example:
Max Response Time = 1485538701633+569 = 1485538702202
Min Response Time = 1485538143112
Throughput = (2/1485538702202-1485538143112)*1000
Throughput = (2/1505) *1000
Throughput = 0.00132890*1000
Throughput = 1.3/sec
You can read more with examples Here(http://www.wikishown.com/how-to-calculate-throughput-in-jmeter/), i got a good idea about Throughput Calculation.

Creating a measure that combines a percentage with a low decimal number?

I'm working on a project in Tableau (which uses functions very similar to Excel, if that helps) where I need a single measurement derived from two different measurements, one of which is a low decimal number (2.95 on the high end, 0.00667 on the low) and the other is a percentage (ranging from 29.8 to 100 percent).
Put another way, I have two tables detailing bus punctuality -- one is for high frequency routes and measured in Excess Waiting Time (EWT, in minutes), the other for low frequency routes and measured in terms of percent on time. I have a map of all the routes, and want to colour the lines based on how punctual that route is (thinner lines for routes with a low EWT or a high percentage on time; thicker lines for routes with high EWT or low percentage on time). In preparation for this, I've combined both tables and zero'd out the non-existent value.
I thought I'd do something like log(EWT + PercentOnTime), but am realizing that might not give the value I'm wanting (especially because I ultimately need an inverse of one or the other, since low EWT is favourable and high % on time favourable).
Any idea how I'd do this? Thanks!
If you are combining/comparing the metrics in an even manner and the data is relatively linear then all you need to do is normalise them.
If you have the EWT expected ranges (eg. 0.00667 to 2.95). Then a 2 would be
(2 - 0.00667)/(2.95 - 0.00667) = 0.67723 but because EWT is the inverse semantically to punctuality we need to use 1-0.67723 = 0.32277.
If you do the same for the Punctuality percentage range:
Eg. 80%
(80 - 29.8)/(100-29.8) = 0.7169
You can compare these metrics because they are normalised (between 0-1 : multiply by 100 to get percentages) if you are assuming the underlying metrics (1-EWT) and on time percentage (OTP) are analogous.
Thus you can combine these into a single table. You will want to ignore all zero'd values as this is actually an indication you have no data at these points.
you'll have to use an if statement to say something like :
if OTP > 0 then ([OTP-29.8])/(100-29.8) else (1-(([EWT]-0.00667)/(2.95-0.00667)))
Hope this helps.

Resources