Mobile Data Analysis in Excel - excel

I collected the mobile data consumption using DATA USAGE in android. Spread over days of the weeks (Monday to Sunday), I want to analyse two apps i.e. Facebook and Messenger, to check whether there was a significant data usage difference depending on the days of the weeks. Should I be using t-test or some other method? What's the best method that can be used in excel to analyse this.
P.s. Help will be much appreciated. Thanks

If you believe your data is normally distributed then statistically speaking it sounds like you're going to want to use the t-test method. You don't know the population's standard deviation so that would be my choice. However, this data should be taken over at least 30 weeks if you want the data for each weekday to be somewhat accurate.

Related

statistical test for an effect across groups, data are nested, not of normal distribution

What is the best statistical test for an effect across groups when the data are nested but may not be of normal distribution? I get a highly significant effect using Kruskall Wallis test, but using it there is no account to that the data points are from several locations, each contributed for several years, and in every year the data were pulled into age groups.
I think you can categorize the data by year, and change the data structure so that the data will be non-nested, making it easier to process. I agree that Kruskal-Wallis' test is a good choice of the cross-group effect test.

How to get unsampled data from Google Analytics API in a specific day

I am building a package that uses the Google Analytics API for Python.
But, in severous cases when I have multiple dimensions the extraction by day is sampled.
I know that if I use sampling_level = LARGE will use a sample more accurate.
But, somebody knows if has a way to reduce a request that you can extract one day without sampling?
Grateful
setting sampling to LARGE is the only method we have to decide the amount of sampling but as you already know this doesn't prevent it.
The only way to reduce the chances of sampling is to request less data. A reduced number of dimensions and metrics as well as a shorter date range are the best ways to ensure that you dont get sampled data
This is probably not the answer you want to hear but, one way of getting unsampled data from Google analytics is to use unsampled reports. However this requires that you sign up for Google Marketing Platform. With these you can create an unsampled report request using the API or the UI.
There is also a way to export the data to Big Query. But you lose the analysis that Google provides and will have to do that yourself. This too requires that you sign up for Google Marketing Platform.
there are several tactics of building unsampled reports, most popular is splitting your report into shorter time ranges up to hours. Mark Edmondson did a great work on anti-sampling in his R package so you might find it useful. You may start with this blog post https://code.markedmondson.me/anti-sampling-google-analytics-api/

Node Js: Not getting the details using dataSources with datasets

I tried to get the step count by date wise. When I took the data from google fit using
API:
https://www.googleapis.com/fitness/v1/users/me/dataSources/derived:com.google.step_count.delta:com.google.android.gms:estimated_steps/datasets/1457548200000000000-1457631000000000000&token=1111111111
I can get only limited step count but not all the steps on that date. Why this kind of problem's are occurs to get the google fit data.
Can any one suggest me the better way to get all the data from google fit.
Using derived:com.google.step_count.delta:com.google.android.gms:estimated_steps datasource
will give you varying results depending on the scenario. The cause of this is mainly from the sensors used. Maybe this is the reason why you think that you have limited results.
estimated_steps also takes into account activity, and estimates steps
when there are none. For instance, assume the user walked for 30
minutes, but the hardware step counter only recorded 10 steps. We
know that number is inaccurate so instead we estimate, say 3000 steps
during that time.
This was noted and discussed in this SO post.

manova or regression or ?; how to in excel

I need to conduct analysis on one factor - which is number of days per project. I have around 30000 of projects with the number of days for each.
The projects are grouped in: categories(there are 10 categories), scales(A/B/C), regions(EU or Asia), months (12-in 1 year) and also one 0-1 factor.
I need to run analysis on this whole database, to find out which factors are important for the number of days and how they are influencing it.
I think that linear regression is one of the way to do it but I don't know how to use it (I'm going to work in excel).
I'm not sure if MANOVA is the right method and also how to conduct the analysis using it.
Are the methods correct and is there some guide how to run them in excel? Are there any more useful methods to do it?
Since it is 1-factor I think it would be ANOVA rather than MANOVA. Excel's Analysis ToolPak has a nice ANOVA utility. Here is a nice tutorial on using it.
Over the years there has been repeated criticisms of the numerical stability of Excel's statistical computation. The worst of the problems seem to have been largely fixed by Excel 2010, but many statisticians still regard Excel with suspicion. With 30,000 data points split between so many categories, I really don't think Excel would be a good tool. It might be good for a preliminary analysis, but with data on that scale you should consider using R.

How to implement difference aggregations / rollups on time series data in Cassandra

I have a situation where I will be collect many time series metrics (electricity used, hours used, hours idle) from operating equipment in a manufacturing plan. I need to create many different rollup numbers on individual and grouped assets. For example, I need to create min, max, average electricity used over 1,5,10,30 days for a given machine. Create same types of metrics for different groups of machines.... Many of the calculated values are driven from the raw values retrieved from the assets.
What is the best approach for calculating these values within a Cassandra environment?
Do I need to create 'batch jobs' that execute the calculations?
It seems as if there are some built in data types (counter) in Cassandra, but seem to be some issues (simply reading some comments on stack overflow)
Has anyone integrate Cassandra with a Twitter storm or something to constantly update the counters?
Thanks
There's an open-source project called Blueflood that does exactly this. You could likely use it directly out of the box to fill your use-case, or fork the repo and modify as necessary.
Documentation and homepage: http://blueflood.io/
Source-code: https://github.com/rackerlabs/blueflood
Irc: #blueflood on Freenode
(Disclaimer: I am a contributor to the project)

Resources