normalization of categorical variable - python-3.x

I have a dataset which contains gender as Male and Female. I have converted male to 1 and female to 0 using pandas functionality which has now data type int8. now I wanted to normalize columns such as weight and height. So what should be done with the gender column: should it be normalized or not. I am planning to use it in for a linear regression.

So I think you are mixing up normalization with standardization.
Normalization:
rescales your data into a range of [0;1]
Standardization:
rescales your data to have a mean of 0 and a standard deviation of 1.
Back to your question:
For your gender column your points are already ranging between 0 and 1. Therefore your data is already "normalized". So your question should be if you can standarize your data and the answer is: yes you could, but it doesn't really make sense. This question was already discussed here: Should you ever standardise binary variables?

Related

Calculate risk using Cox model coefficients and mean values

I'm trying to understand the example presented in Appendix C here
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6481149/
Equation C1 is clear to me.
But in Equation C2 they use the mean values.
Such mean values are clear to me in the case of categorical variables for example 1.548 is the mean value of the Sex variable (as shown in the Table 3). Please correct me if I'm wrong.
But in numerical variables I don't understand which mean values are they using. For example for the Age variable they use 3.768, if I understand right, that value is the log of the mean age, should be log(44.15)=1.64. Instead the used value is 3.768.
Please could anybody clarify where does this value come from?
In statistics log often means the natural logarithm, sometimes denoted ln. The four values they take the logarithms of are:
Variable
Reported Mean
ln(Mean)
Reported
Age
44.15
3.788
3.768
BMI
25.61
3.243
3.230
BP Syst
138.6
4.932
4.913
Pulse Rate
75.61
4.326
4.311
The calculated values are not exactly equal to the reported values. But it looks close enough that this is probably the calculation they used. Without the data and/or code they used it's hard to say why the results are different. The study mentions excluding 130 participants because of ethics protections. So, perhaps one table was calculated using a slightly different group of participants than the other table?

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

Python based multi-label Classification

I have a data set something like shown below which in real scenario wil have row count something between 10000 to 1000000.
There would be more columns but the core problem revolves round these two fields.
Known Labels
I have known categories -'Apple', 'Blueberry','Orange','Lettuce'
Dataset
DataFrame
({'ROWID':1,2,3,4,5,6,7,8,9,10],
'Category':'Apple','Blueberry'.'Orange','Lettuce','Fruit','Salad','xyz','Fruit'
,'Leaf','Avocado'],
'Details':['Eat one a day ,doctors keep away','Like it in a muffin',
'Tastes yummy','Like it with
salmon','Glass of a juice','Ceser dressing on lettuce','Nothing in my
basket','Like it in a muffin','I like it it with salami','Comes from
Mexico']})
Problem:
I have to create one or many metrics using groupby on category
When the category column has unknown cell value I need to read the text from the 'Details' and predict the best suited label for category.
For example
Salad ->Lettuce, Fruit(Row#5)-> Orange Fruit(Row#8)-> Blueberry
Leaf(Row#9)-> 'Lettuce' It is understood that some of the rows can
not be categorized.
Help Needed:
I am a newbie in data science algorithm, looking for some guidance to identify the right model to solve the problem.
Use Naive Bayes for the Details column, before that do a simple filtering on the Category column and remove rows having known category values.

Normalization of data in Excel from 0 to 1

For my class we are learning normalization of data. My proffessor gave an example of how to do do it with the data below (first table below). But he didn't really show how to get the numbers of the second table in Excel. Can someone show me how he got those numbers? The numbers in the second table are from 0 to 1, where 0 is the wrost and 1 is the best.
https://i.stack.imgur.com/cBs1c.png
DRAW BEST TO WORST CHART ON BOARD – WITH LINEARY NORMALIZATION BETWEEN
https://i.stack.imgur.com/mnmoP.png
To normalize you only need to divide by the greatest value in each column. Thus the biggest value will be equal to one.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources