I am wondering if the statistical analysis I did makes any sense - statistics

I am helping with a retrospective study and the data isn't very well organized. Also, I am new to statistics, so I took a stab at analyzing the data myself. We will be getting the help of a statistician later on, but not sure when yet.
We are looking at about 100 patients and each patient was followed up with for a variable amount of time. Throughout each patient's follow-up, there were a variable amount of observations made at various timepoints. The observations included a set of lab values, anthropometric data, and demographic data. So to conduct the analysis, we split up the observations into time bins (eg. 6 months follow-up, 1 year follow up, etc). Then for each time point, we categorized each patient in one of 3 groups based on the outcome of interest. Also, for each time point, we selected one observation to represent one patient during that timepoint (since there could be many within the same time bin). For the analysis, we did the following:
1 . ANOVA within each timepoint to compare the 3 groups of outcomes . Looking at select independent variables of interest.
2 . For the same variables of interest above, do a repeated measures ANOVA to see if it's changing over time.
3 . Test for correlations between the variables of interest mentioned above and other independent variables.
4 . Test each independent variable in a univariate binomial logistic regression to see if it predicts outcome. There were 3 groups, so we did pairwise regressions (eg. (outcome 1 + 2) vs (outcome 3), and (outcome 1) vs (outcome 2 + 3)).
5 . Do a multivariate binomial logistic regression with forward elimination using only the significant independent variables retained from step 4.
6 . If any independent variables of interest are retained in the MV regression, run it again testing for potential interactions with any variables it was correlated with from step 3. We tried to do this by making a new variable that is the product of the two variables and putting it into the regression.
What I'm trying to show with this analysis is that one key independent variable explains the difference in outcomes among the patients. So far my analysis seems to be doing this, as it seems to be one of the few variables retained at step 6 and with a good significance value. So sorry if this is very confusing to read.

Related

How to deal with non triplicated data in triplicated dataset

My problem is the following. I have a dataset with 10 variables and 8 samples. Each sample has been analysed for triplicate, therefore I have a dataset of 24 rows. However, some analyses (variable) were not performed in triplicate. In the case where the analysis was only done once, I have to introduce NA in order to fill the blanks. In the cases where the analysis was performed more than three times, I have to introduce new rows that add NA to the analysis which were in fact done three times.
My ulterior goal is to apply ANOVA to this dataset.I have thought about repeating the value in the case where I only have 1 analysis, and randomly eliminating values in the cases where I have more than 3 analysis, but I have the feeling this is not the most orthodox way to proceed.
I hope it is clear enough.
Thanks in advance!

Predict yearly harvest - Regression

ey guys I need your help. I want to predict rice production in India using a simple regression. For this I have a dataset with the yield and production data for the last 40 years. As explanatory variables I have the daily data on rainfall, temperature etc. Now to my problem. Obviously the number of observations of the y-variable (40) does not match the number of observations of the x-variable (about 15,000). Thus a regression is not feasible. What is the best way to proceed?
Average the weather data over the year and thus estimate the y-variable, i.e. a kind of undersampling of the x-variable. Of course, this means that important data such as outliers are lost.
Add the annual production values for each weather entry in the associated year. This would give us the same y value 365 times. Doesn't sound reasonable to me either.
What other ideas do you guys have? If interested, I'll be happy to attach the datasets as well.

Small data anomaly detection algo

I have the following 3 cases of a numeric metric on a time series(t,t1,t2 etc denotes different hourly comparisons across periods)
If you notice the 3 graphs t(period of interest) clearly has a drop off for image 1 but not so much for image 2 and image 3. Assume this is some sort of numeric metric(raw metric or derived) and I want to create a system/algo which specifically catches case 1 but not case 2 or 3 with t being the point of interest. While visually this makes sense and is very intuitive I am trying to design a way to this in python using the dataframes shown in the picture.
Generally the problem is how do I detect when the time series is behaving very differently from any of the prior weeks.
Edit: When I say different what I really mean is, my metric trends together across periods in t1 to t4 but if they dont and try to separate out of the envelope, that to me is an anomaly. If you notice chart 1 you can see t tries to split out from rest of the tn this is an anomaly for me. in other cases t is within the bounds of other time periods. Hope this helps.
With small data the best is if you can come up with a good transformation into a simpler representation.
In this case I would try the following:
Distance to the median along the time-axis. Then a summary of that, could be median, Mean-Squared-Error etc
Median of the cross-correlation of the signals

Regresson Analysis for Net promoter Score (NPS)

Issue: I am trying to calculate with Excel a regression Analysis for NPS scores but outcome, I am sure is not right. Not sure I am using the right Valirable.
Background:
The Net Promoter Score(NPS) is an index ranging from -100 to 100 that measures the willingness of customers to recommend a company's products or services to others. “On a scale of 0 to 10, how likely are you to recommend this company’s product or service to a friend or a colleague?” Based on their rating, customers are then classified in 3 categories: detractors, passives and promoters, and NPS is calculated.
What I am doing:
Using the scale of 0 to 10 as a dependent variable.
Using time a service lasts (In days).
When I use Excel scatter plot to see the dispersion of the dots, basically, it represents 10 lines of dots from X axis. (I think I am doing something wrong):
Any idea what would be happening? Am I using the wrong dependent variable?
Thanks!
I am not sure what you mean by "Using time a service lasts (In days)."
Regression analysis attempts to fit the various predictor and explanatory variables in the equation, y=f(x), where f(x) is a function of x's(independent/explanatory variables) and y is the predictor/dependent variable.
If you are trying to run NPS scores against time, you need to consider time when NPS is calculated/captured along with NPS score.
So in that case, column with NPS values would be independent variable(to be put in X range in Excel). And the column with "time" would be dependent variable(to be put in Y range in Excel)

Time-Series Process, Definition

The deifinition of time series is as follows :
Lets say for example there is a data (lets say monthly sale) collected per month over 20 years. How many random variables are there? Is it 12?
The phrase SEQUENCE OF RANDOM VARIABLES are very confusing. Someone please explain.
A sequence is just an ordered, countable set. As such it can be indexed (i.e. mapped 1 to 1 and onto) by integers. So a "sequence of FOO" is just set of FOO such that you have FOO[1], FOO[2], FOO[3], ... FOO[n] (and potentially going through negative indices too).
A reasonable model for monthly sales is that each month's sales is a random variable, so in 20 years you have 240 variables. However, bear in mind that identifying each month as a separate variable is a modeling choice, not a mathematical choice. Depending on the problem to be solved, maybe each calendar month is a variable, so you have 12 variables and 20 observations for each variable. Or maybe all that matters is the annual sum, so effectively you have 20 variables. Whether or not any of these is appropriate, or none, can only be answered by considering the problem you are trying to solve -- there is no way to prove one modeling choice is better than another.

Resources