Pandas apply kruskal-wallis to numeric columns - python-3.x

I have a dataframe of 27 columns (26 are numeric variables and the 27th column tells me which group each row is associated with). There are 7 groups in total I'm trying to apply the Kruskal-Wallis test to each variable, split by group, to determine if there is a significant difference or not.
I have tried:
df.groupby(['treatment']).apply(kruskal)
which throws an error "Need at least 2 groups two groups in stats.kruskal()".
My other attempts haven't produced an output either. I'll be doing similar analyses on a regular basis and with larger datasets. Can someone help me understand this issue and how to fix it?

With Scipy, you could do like that for each variable:
scipy.stats.kruskal(*[group["variable"].values for name, group in df.groupby("treatment")])

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

ordering a list of names based upon related values, accounting for duplicate values, excel

I have a sheet that pulls metrics regarding employees from an internal source, I am trying to create an end of day report card and would like to display the names in ranking order based upon the metrics.
a small example for demonstration:
Columns A and B: An example of the kind of data in my sheet.
column D: I have used a variation of
=INDEX(A$2:A$6,MATCH(E2,B$2:B$6,0))
Column E:
=LARGE(B$2:B$6,1)
as you can see I am running into trouble with duplicate 'total' values creating an incorrect index number causing the first name that matches to be the result.
column G i have attempted to get around this by using:
=INDEX(A$16:A$20,MATCH(E19,B$16:B$20,0)+COUNTIF(B$16:B$20,E19)-1)
to attempt to account for the duplicates by adding a countif to the index number, of course, is an incorrect approach.
Not shown I also tried adding 0.5/'employee id number' to the total (0.5/an integer gives me a decimal between 0 and 0.5 allowing me to have each number be unique without causing any rounding problems to the displayed total. However, I think the decimal was ignored by the MATCH and it made no effect.
A weird problem has had me puzzled and I appreciate any help!
Have you tried something like this
=INDEX($A$2:$A$6,AGGREGATE(15,6,(ROW($A$2:$A$6)-ROW($A$2)+1)/($E2=$B$2:$B$6),COUNTIF($E$2:E2,E2)))
=INDEX($A$9:$A$13,AGGREGATE(15,6,(ROW($A$9:$A$13)-ROW($A$9)+1)/($E9=$B$9:$B$13),COUNTIF($E$9:E9,E9)))
=INDEX($A$16:$A$20,AGGREGATE(15,6,(ROW($A$16:$A$20)-ROW($A$16)+1)/($E16=$B$16:$B$20),COUNTIF($E$16:E16,E16)))

Deal with Ties when Using Index/Match

I'm currently pulling the top (5) number of numerical values from one sheet and inputting them into a different sheet. Each number is within its own column and there is a name matching that column, EX:
And so, having a tie is common with the data that I'm working with, so it nearly deprecates my formulas.
For getting the name:
=INDEX('Total Cases by Categories'!$B$18:$B$50, MATCH(LARGE('Total Cases by Categories'!$H$18:$H$50, A39),'Total Cases by Categories'!$H$18:$H$50, 0))
For getting the numerical value associated with the name:
=LARGE('Total Cases by Categories'!$H$18:$H, A39)
And so, when there are 2 people with the same numerical value associated within a category, then that person appears twice, I assume because of their position within the sheet.
So something like this happens:
So in the event of a tie, I would want to list both names that have the same amount of points instead of the first name that shows up with the duplicated value.
Any help would be appreciated!
Actually, LARGE will give you both of tied names. It's MATCH that can't look beyond the first. To the best of my knowledge there is no way around that (the difficult one being not to use MATCH). Therefore the solution is to have no ties.
This is achieved with helper columns that contain no identical numbers. This can be achieved by adding an insignificant decimal. Since you are dealing with integers, adding 0.1 would be insignificant for your purposes but 13.1 is different from 13.2. If you need to extract the "real" number from this use INT(13.2).
Using the row number to generate an insignificant decimal is popular for this purpose. In row 1 ROW()/10 will return 0.1. But in row 10 ROW()/10 will return 1.0 which isn't an insignificant number anymore. Therefore you have to work with ROW()/100 or an even larger divisor, depending upon how many rows you have. Try ROW()/10^6 - any decimal will do the tie-breaking job.
You may not like that using ROW() will list tied participants in the order in which they appear in the worksheet. The differentiating decimals can be created by any other means that doesn't create ties in itself.
Normally, the helper columns with the decimals added will be hidden. They contain a formula like =D23 + (ROW()/10000) which manages itself. You can then use that column for the MATCH function to list all participants in the order of LARGE using the helper column or the original. Just make sure that MATCH refers to the helper column.

identify when values in a column change in spotfire

I am trying to create a calculated column that flags/counts the changes in values across rows in another column, in Spotfire. Below is an example of the data types I'm looking at and the desired results.
My hope is that for each Location, and ordered along Time, I can identify when the values of "colors" changes and have running count so that each cluster of similar values between changes is given the same label (Cluster Desire 1) for each Location. It would be best if the running count of clusters can restart at each location but this is not crucial. Any help would be more than appreciated!
I thought of a way to do it, relying on one intermediate column (I used two just to make it a bit clearer).
First: the concatenation of values for each row within its Location: called [concatString]
Concatenate(Concatenate([Color]) over (Intersect([Location],AllPrevious([Time]))),', ')
Spotfire defaults to comma followed by space as a separator: I could not find a way of changing that in this kind of expression.
Then within each [concatString] I remove repeated values. The complication is that the last one did not have the comma+space, and I did not manage to make the regular expression I am using understand that. So my workaround was to add a final comma+space to [concatString]. Hence the extra Concatenate(..).
The formula for the column without repetitions, [consolidatString] is:
RXReplace([concatString],"(\\w+\,\\s)\\1+","$1","g")
Then what we have achieved is an individual value for each line we want to group. We can then simply rank [consolidatString] to achieve the desired column:
DenseRank([consolidatString],[Location])

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

Resources