Which statistical method to choose? - statistics

I want to to find out if the level of education has an effect on the answer to the question: "Do you think the climate is changing?"
My level of education variable has 3 levels and there are 5 different possible answers to the question (probably changing, definitely changing etc.).
I am not sure which statistical method is appropriate here

This could depend on how you record your "climate change opinion" variable. If you keep it as an ordinal categorical variable, you could use ordinal logistic regression.
You could keep the variables both as categorical and conduct a Chi-Square Test of Homogeneity.
A path for more specific interpretations would be to assign a numerical value to this variable; such as definitely not changing = 1, maybe changing = 3, definitely changing = 5.
Your null hypothesis could be: "The mean climate change opinion is the same for each education level"
Alt: "At least one education level has a different mean climate change opinion."
You can perform an F-test of our 3 education groups to reveal if there is evidence of at least one group being significantly different from the others. From there you can use the Tukey HSD method to make comparisons between each education level group. This is like performing a t-test between each group.

Related

What Kind of Multi Criteria Decisoin Making methods i need for my problem?

I'm making an application to find the best products to buy based on several criteria. can be called a decision support system.
some examples of the criteria I use are:
location, the more the sending location is with my city, the better.
I have determined the weight of the location, I determine the weight
of my city with a value of 100, the farther the shipping city with
my city, then the weight will be smaller.
the number of reviews owned by a product, more means better
rating value, the higher the rating, the more means better
price, the cheaper the price the better
I was recommended to use a method called AHP, I have read about AHP and although I think AHP is a good method, in my opinion what I want can not be fulfilled entirely with AHP because it does not take into account the nominal value of the rating and price, it only counts one thing importance to another
my questions are :
with the requirements of the criteria, what MCDM method should I use?
Does AHP actually can accommodate my needs? if yes, how? is it using Fuzzy-AHP? if so, I will start learning Fuzzy and things related to it
Thanks for the question. So, AHP*1 is a method used in decision-making (DM) to methodically assign weights to the different criterion. In order to score, rank and select the most-desirable alternative you need to complement AHP with another MCDC method that fulfils those tasks.
There several methods to do that. TOPSIS and ELECTRE, for instance, are commonly used to that purpose. *2-3. I leave you a link on the papers and tutorials of those methods so you understand how they work. -- SEE RESOURCES.
In regards to using fuzzy logic in AHP. While there are several proposals on using FAHP*4, Saaty himself, creator of the AHP states that this is redundant*5-7 since the scale in which criteria are assessed to weighing in AHP already operates with a fuzzy logic.
However, in the case, your criteria are based on qualitative data and therefore you are dealing with uncertainty and potentially, incomplete information, you can use fuzzy numbers in TOPSIS for those variables. You can check the tutorials in resources to understand how to apply those methods.
In recent years, some researchers have argued that fuzzy TOPSIS only considers the membership function. (That is, the closest an imprecise parameter is to reality) and ignores the non-membership and indeterminacy degree *9-10, so how false and not determinable is that parameter. The neutrosophic theory was mainly pioneered by *10 Smarandache.
So, in response, nowadays, neutrosophic TOPSIS is being to be used to deal with uncertainty. I recommend reading the papers below to understand the concept.
So, in summary, I will personally recommend applying AHP and Fuzzy or Neutrosophic TOPSIS to address your problem.
Resources:
Manoj Mathew. Tutorial Youtube FAHP. Fuzzy Analytic Hierarchy Process (FAHP) - Using Geometric Mean. Retrieved from: https://www.youtube.com/watch?v=5k3Wz1AfVWs
Manoj Mathew. Tutorial Youtube FTOPSIS. Fuzzy TOPSIS. Retrieved from: https://www.youtube.com/watch?v=z188EQuWOGU
Manoj Mathew. TOPSIS - Technique for Order Preference by Similarity to Ideal Solution Retrieved from: https://www.youtube.com/watch?v=kfcN7MuYVeI
MCDC in R: https://www.rdocumentation.org/packages/MCDA/versions/0.0.19
MCDC in JS: https://www.npmjs.com/package/electre-js
MCDC in Python: https://github.com/pyAHP/pyAHP
REFERENCES:
1 Saaty, R. W. (1987). The analytic hierarchy process—what it is and how it is used. Mathematical Modelling, 9(3-5), 167.
doi:10.1016/0270-0255(87)90473-8
2 Hwang, C. L., & Yoon, K. (1981). Methods for multiple attribute decision making. In Multiple attribute decision making (pp. 58-191). Springer, Berlin, Heidelberg.
3 Figueira, J., Mousseau, V., & Roy, B. (2005). ELECTRE methods. In Multiple criteria decision analysis: State of the art surveys (pp. 133-153). Springer, New York, NY.
4 Mardani, A., Nilashi, M., Zavadskas, E. K., Awang, S. R., Zare, H., & Jamal, N. M. (2018). Decision Making Methods Based on Fuzzy Aggregation Operators: Three Decades Review from 1986 to 2017.
International Journal of Information Technology & Decision Making, 17(02), 391–466. doi:10.1142/s021962201830001x
5 Saaty, T. L. (1986). Axiomatic Foundation of the Analytic Hierarchy Process. Management Science, 32(7), 841.
doi:10.1287/mnsc.32.7.841
6 Saaty, R. W. (1987). The analytic hierarchy process—what it is and how it is used. Mathematical Modelling, 9(3-5), 167.
doi:10.1016/0270-0255(87)90473-8
7 Aczél, J., & Saaty, T. L. (1983). Procedures for synthesizing ratio judgements. Journal of Mathematical Psychology, 27(1), 93–102. doi:10.1016/0022-2496(83)90028-7
8 Wang, Y. M., & Elhag, T. M. (2006). Fuzzy TOPSIS method based on alpha level sets with an application to bridge risk assessment. Expert systems with applications, 31(2), 309-319
9 Zhang, Z., Wu, C.: A novel method for single-valued neutrosophic multi-criteria decision making with incomplete weight information. Neutrosophic Sets Syst. 4, 35–49 (2014)
10 Biswas, P., Pramanik, S., & Giri, B. C. (2018). Neutrosophic TOPSIS with Group Decision Making. Studies in Fuzziness and Soft Computing, 543–585. doi:10.1007/978-3-030-00045-5_21
10 Smarandache, F.: A Unifying Field in Logics. Neutrosophy: Neutrosophic Probability, Setand Logic. American Research Press, Rehoboth (1998)

How does SAS pick reference group when using CLASS statement?

How does SAS pick reference group when using CLASS statement?
I have a categorical variable and it can take on about 200 different values. Is it good practice to create dummies for only specific characteristics of this variable? I know that the other values are rarely used and in a correlation analysis they are not significant in predicting Y. The example is: There are about 200 different add-ons and the outcome variable is Sale (success vs. no success) the model is a logistic regression. I want to see whether any of these add ons seem to be more popular among customers and therefore are more likely to lead to a sale. Other IV are: how much the customer already pays on a monthly basis, where the customer comes from and which location the sales agent comes from.
How does SAS pick reference group when using CLASS statement?
By default, the first value in sort order is picked as the reference variable. This can be changed with the ref= option.
class var(ref='B')
Is it good practice to create dummies for only specific
characteristics of this variable?
That's a question better asked on Cross Validated

How to deal with violation of proportional hazards assumption in Cox PH, R 3.1.3 survfit

I'm performing survival analysis in R using the 'survival' package and coxph. My goal is to compare survival between individuals with different chronic diseases. My data are structured like this:
id, time, event, disease, age.at.dx
1, 342, 0, A, 8247
2, 2684, 1, B, 3879
3, 7634, 1, A, 3847
where 'time' is the number of days from diagnosis to event, 'event' is 1 if the subject died, 0 if censored, 'disease' is a factor with 8 levels, and 'age.at.dx' is the age in days when the subject was first diagnosed. I am new to using survival analysis. Looking at the cox.zph output for a model like this:
combi.age<-coxph(Surv(time,event)~disease+age.at.dx,data=combi)
Two of the disease levels violate the PH assumption, having p-values <0.05. Plotting the Schoenfeld residuals over time shows that for one disease the hazard falls steadily over time, and with the second, the line is predominantly parallel, but with a small upswing at the extreme left of the graph.
My question is how to deal with these disease levels? I'm aware from my reading that I should attempt to add a time interaction to the disease whose hazard drops steadily, but I'm unsure how to do this, given that most examples of coxph I've come across only compare two groups, whereas I am comparing 8. Also, can I safely ignore the assumption violation of the disease level with the high hazard at early time points?
I wonder whether this is an inappropriate way to structure my data, because it does not preclude a single individual appearing multiple times in the data - is this a problem?
Thanks for any help, please let me know if more information is needed to answer these questions.
I'd say you have a fairly good understanding of the data already and should present what you found. This sounds like a descriptive study rather than one where you will be presenting to the FDA with a request to honor your p-values. Since your audience will (or should) be expecting that the time-course of risk for different diseases will be heterogeneous, I'd think you can just describe these results and talk about the biological/medical reasons why the first "non-conformist" disease becomes less important with time and the other non-conforming condition might become more potent over time. You already done a more thorough analysis than most descriptive articles in the medical literature exhibit. I rarely see description of the nature of non-proportionality.
The last question regarding data "does not preclude a single individual appearing multiple times in the data" may require some more thorough discussion. The first approach would be to stratify by patient ID with the cluster()-function.

Determine coefficients for some function

I have a task that is probably related to data analysis or even neural networks.
We have a data source of our partners, job portal. The source values are arrays of different attributes related to the particular employee:
His\her gender,
Age,
Years of experience,
Portfolio (number of the projects done),
Profession and specialization (web design, web programming, management etc.),
many other (around 20-30 totally)
Every employee has it's own salary (hourly) rate. So, mathematically, we have some function
F(attr1, attr2, attr3, ...) = A*attr1 + B*attr2 + C*attr3 + ...
With unknown coefficient. But we know the result of the function for the specified arguments (let's say, we know that a male programmer with 20 years of experience and 10 works in portfolio has a rate of $40 per hour).
So we have to find somehow these coefficients (A, B, C...), so we can predict the salary of any employee. This is the most important goal.
Another goal is to find which arguments are most important - in other words, which of them cause significant changes to the result of the function. So in the end we have to have something like this: "The most important attributes are years of experience; then portfolio; then age etc.".
There may be a situation when different professions vary too much from each other - for example, we simply may not be able to compare web designers with managers. In this case, we have to split them by groups and calculate these ratings for every group separately. But in the end we need to find 'shared' arguments that will be common for every group.
I'm thinking about neural networks because it's something they may deal with. But I'm completely new to them and have totally no idea what to do.
I'd very appreciate any help - which instruments to use, what algorithms, or even pseudo-code samples etc.
Thank you very much.
That is the most basic example of (linear) regression. You are using a linear function to model your data, and need to estimate the parameters.
Note that this is actually a part of classic mathematical statistics; not data mining yet but much much older.
There are various methods. Given that there likely will be outliers, I would suggest to use RANSAC.
As for the importance, doesn't this boil down to "which is largest, A B or C"?

Learning Optimal Parameters to Maximize a Reward

I have a set of examples, which are each annotated with feature data. The examples and features describe the settings of an experiment in an arbitrary domain (e.g. number-of-switches, number-of-days-performed, number-of-participants, etc.). Certain features are fixed (i.e. static), while others I can manually set (i.e. variable) in future experiments. Each example also has a "reward" feature, which is a continuous number bounded between 0 and 1, indicating the success of the experiment as determined by an expert.
Based on this example set, and given a set of static features for a future experiment, how would I determine the optimal value to use for a specific variable so as to maximise the reward?
Also, does this process have a formal name? I've done some research, and this sounds similar to regression analysis, but I'm still not sure if it's the same thing.
The process is called "design of experiments." There are various techniques that can be used depending on the number of parameters, and whether you are able to do computations between trials or if you have to pick all your treatments in advance.
full factorial - try each combination, the brute force method
fractional factorial - eliminate some of the combinations in a pattern and use regression to fill in the missing data
Plackett-Burman, response surface - more sophisticated methods, trading off statistical effort for experimental effort
...and many others. This is an active area of statistical research.
Once you've built a regression model from the data in your experiments, you can find an optimum by applying the usual numerical optimization techniques.

Resources