Strategy for data handling/custom colors in ShaderMaterial - colors

I need strategic help to get going.
I need to animate a large network of THREE.LineSegments (about 150k segments, static) with custom (vertex) colors (RGBA). For each segment, I have 24 measured values over 7 days[so 5.04 × 10^7 measured values, or 2.016 × 10^8 vec4 color buffer array size, about 770Mb in float32array size].
The animation is going through each hour for each day in 2.5 second steps and needs to apply an interpolated color to each segment on a per-frame-basis (via time delta). To be able to apply an alpha value to the vertex colors I need to use a THREE.ShaderMaterial with a vec4 color attribute.
What I don´t get my head around is how to best handle that amount of data per vertex. Some ideas are to
calculate the RGBA values in render loop (between current color array and the one for the coming hour via interpolation with a time delta) and update the color buffer attribute[I expect a massive drop in framerate]
have a currentColor and nextColor buffer attribute (current hour and next hour), upload both anew in every step (2.5 sec) to the GPU and do the interpolation in the shader (with aditional time delta uniform)[seems the best option to me]
upload all data to the GPU initially, either with multiple custom attributes or one very large buffer array, and do the iteration and interpolation in the shader[might be even better if possible; I know I can set the offset for the buffer array to read in for each vertex, but not sure if that works as I think it does...]
do something in between, like upload the data for one day in a chunk instead of either all data or hourly data
Do the scenario and the ideas make sense? If so:
What would be the most suitable way of doing this?
I appreciate any additional advice.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

The 95% of non-normally distributed points around a mean/median

I asked users to tap a location repeatedly. To calculate the size of a target in that location, such that 95% of users will hit that target successfully, I usually measure 2 std of the tap offsets from the centroid. That works if the tap offsets are normally distributed, but my data now is not distributed normally. How can I figure out the equivalent of a 2 std around the mean/median?
If you're only measuring in one dimension, the region encompassed by +/-2 std in a Normal distribution corresponds fairly well to the central 95% of the distribution. Perhaps it's worth working with quantiles instead - take the interval corresponding to that within the 2.5th and 97.5th percentiles - this will be robust to skew or any other departure from normality.

Understanding audio file spectrogram values

I am currently struggling to understand how the power spectrum is stored in the kaldi framework.
I seem to have successfully created some data files using
$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
compute-spectrogram-feats --verbose=2 \
scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
copy-feats --compress=$compress $write_num_frames_opt ark:- \
ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp
Which gives me a large file with data point for different audio files, like this.
The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.
The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms.
The number of data points in the first set is 25186.
Given these informations, can I interpret the output in some way?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?
Or am I interpreting it incorrectly?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.
So is this the same case? 16000/25186 = 0.6... Hz/bin?
The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.
Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):
fs = 16000; % 16 kHz sampling rate
hop_size = 0.010; % 10 millisecond
[X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
view(2);
xlabel('Time (seconds)');
ylabel('Frequency (Hz)');
gives this graph (the dark red regions are the areas of highest intensity):

Array of RDDs? One RDD for a time window

I have a question about bucketing time events with Spark, and the best way to handle it.
So I'm ingesting a very large dataset, with specific start/stop times for each event.
For instance, I might load in three weeks of data. Within the main time window, I divide that into buckets of smaller intervals. So 3 weeks divided into 24 hour time buckets, with an array that looks like [(start_epoch, stop_epoch), (start_epoch, stop_epoch), ...]
Within each time bucket I map/reduce my events down into a smaller set.
I'd like to keep the events split up by the time bucket they belong to.
What is the best way to handle this? Each map/reduce operation results in a new RDD so I'm effectively left with a large array of RDDs.
Is it "safe" to just loop over that array from the driver, and then do other transformations/actions on each RDD to get results each time window?
Thanks!
I would suggest to think about it a bit differently:
You want to read your data, and then "keyBy" time rounded to hour resolution. And then you can reduceByKey(or combineByKey if you want another type in output).
While working with spark it's not necessary to collect items into arrays by some key(even antipattern)
RDD[Event] -> keyBy ts rounded to hour -> RDD[(hour, event)] -> reduceByKey(i.e. hour) -> RDD[(hour, aggregated view of all events in this hour)]

Creating a measure that combines a percentage with a low decimal number?

I'm working on a project in Tableau (which uses functions very similar to Excel, if that helps) where I need a single measurement derived from two different measurements, one of which is a low decimal number (2.95 on the high end, 0.00667 on the low) and the other is a percentage (ranging from 29.8 to 100 percent).
Put another way, I have two tables detailing bus punctuality -- one is for high frequency routes and measured in Excess Waiting Time (EWT, in minutes), the other for low frequency routes and measured in terms of percent on time. I have a map of all the routes, and want to colour the lines based on how punctual that route is (thinner lines for routes with a low EWT or a high percentage on time; thicker lines for routes with high EWT or low percentage on time). In preparation for this, I've combined both tables and zero'd out the non-existent value.
I thought I'd do something like log(EWT + PercentOnTime), but am realizing that might not give the value I'm wanting (especially because I ultimately need an inverse of one or the other, since low EWT is favourable and high % on time favourable).
Any idea how I'd do this? Thanks!
If you are combining/comparing the metrics in an even manner and the data is relatively linear then all you need to do is normalise them.
If you have the EWT expected ranges (eg. 0.00667 to 2.95). Then a 2 would be
(2 - 0.00667)/(2.95 - 0.00667) = 0.67723 but because EWT is the inverse semantically to punctuality we need to use 1-0.67723 = 0.32277.
If you do the same for the Punctuality percentage range:
Eg. 80%
(80 - 29.8)/(100-29.8) = 0.7169
You can compare these metrics because they are normalised (between 0-1 : multiply by 100 to get percentages) if you are assuming the underlying metrics (1-EWT) and on time percentage (OTP) are analogous.
Thus you can combine these into a single table. You will want to ignore all zero'd values as this is actually an indication you have no data at these points.
you'll have to use an if statement to say something like :
if OTP > 0 then ([OTP-29.8])/(100-29.8) else (1-(([EWT]-0.00667)/(2.95-0.00667)))
Hope this helps.

Resources