I have a dataset:
I want to apply the clustering technique to create clusters of every 5 minutes' data and want to calculate the average of the last column i.e. percentage congestion.
How to create such clusters of every 5 minutes? I want to use this analysis further for decision making. The decision will be made on the basis of average percentage calculated.
That is a simple aggregation, and not clustering.
Use a loop, read one record at a time, and every 5 minutes output the average and reinitialize the accumulators.
Or round ever time to the 5 minute granularity. Then take the average of now identical keys. That would be a SQL GROUP_BY.
Related
I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.
sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()
I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?
You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.
Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.
In azure TSI, we have 3 main query types, getSeries, getEvents, and getAggregate. I am trying to query all of the historical data for many series. I already found out that I must do these queries in a loop, 1 for each series, which is terrible. However, I now need to be able to parse the query with an interval. For example, if I am sending TSI data every 5 seconds and I want to get 1 months worth of data, I don't need every 5 seconds data points. I could do every day instead. If I use getAggregate, with filter: null and interval: "P1D", It returns a ton of null values every 1 day and doesn't return any data. The same thing happens if I reduce this interval to 60M or even 1M. I then used getEvents and it returns all the data points. I could then create a function to parse this but the query speed will be much worse because I would prefer to parse this in the query itself. Is there a way to achieve this?
Ideally, if there is 20 data points, 5 seconds apart and nothing else for that day, it would average these into 1 datapoint for the day. Currently, with getAggregate, it returns null values instead.
I have data of customer care executives which tells about how many calls they have attend from one time another time continuously. I need to find out whether particular executive is either busy or free in a particular period. The timings for the office is 10:00 to 17:00, so I have sliced the time with one hour in each slice from 10:00 to 17:00.
The data that I have would look like as:
Note:
The data given here is some part of original data and we have 20 executives and they have 5 to 10 rows of data. For simplification we have used 3 executives with less than 5 rows for each one.
The start timings do not follow any ascending or descending order
Please suggest the formulas with out any sorting and filtering on the given data
Required: The result table should give whether particular executive is busy or free in every hour. If he is on call for one minute it should give busy for that entire one hour period
The result should be like:
The same file is attached here:
Thanks In Advance!!!
You need to put in an extra logical test in your OR function that tests for start times less than the time interval start and end times greater than the time interval end. So in cell G31 your formula should read:
=IF(OR(COUNTIFS($A$3:$A$14,A31,$C$3:$C$14,">0",$D$3:$D$14,">=14:00",$D$3:$D$14,"<15:00"),COUNTIFS($A$3:$A$14,A31,C$3:$C$14,">0",$E$3:$E$14,">=14:00",$E$3:$E$14,"<15:00"),COUNTIFS($A$3:$A$14,A31,C$3:$C$14,">0",$D$3:$D$14,"<14:00",$E$3:$E$14,">=15:00")),"Busy","Free")
I'm wondering if there is a way to automatically select the amount of past data when calculating features.
For example, I might want to predict when a customer is going to make their next purchase, so it would be good to know a count of purchases or average purchase price by different date cutoffs. e.g. Purchases in the last 12 months, last 3 months, 7 days etc.
What is the best way to approach this with featuretools?
You can create a feature matrix thats uses only a certain amount of historical data using the training window parameter in featuretools.dfs. When training window is set, Featuretools will use the historical data between the cutoff time and cutoff_time - training_window. Here's the example from the documentation:
window_fm, window_features = ft.dfs(entityset=es,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour")
When determining which data is valid for use, the training window will check if the time in the time_index column is within the training window.
I have a question for statistical method which i cant find in my textbook. I want to compare data of two groups. For example, both group have data of day 0, but one group have data of day 2, and another day 6. How can I analyse the outcome with the data and the date? i.e. I want to show that the if data taken on day XX are YY, it has an impact on the outcome.
Thanks in advance.
I'd use a repeated measures ANOVA in this case. However, since you don't have a complete dataset, day X and Y would be just operationalized as the endpoint of your dependent variable. If you'd have measures of all days I'd include.all of them in the analysis in order to fully compare the two timelines. You could then also compare the days of interest directly by using post-hoc tests (e.g. Bonferroni)