This question already has answers here:
How to encode categorical features in Apache Spark
(3 answers)
Closed 6 years ago.
I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model.
The document is here
mentions that the numFeatures should be constant.
The problem that I am facing is :
Since most of my variables are categorical, the numFeatures variable should be the final set of variables after encoding and parsing the categorical variables in labeled point format.
Suppose, for a categorical variable x1 I have 10 distinct values in current window.
But in the next window some new values/items gets added to x1 and the number of distinct values increases. How should I handle the numFeatures variable in this case, because it will change now ?
Basically, my question is how should I handle the new values of the categorical variables in streaming model.
Thanks,
Kundan
You should fill the missing columns with zero values and discard any newly encountered values in each window to make sure the number of remains the same as when used for training.
Lets consider a column city having the values [NewYork, Paris, Tokyo] in the training set. This would result in three columns.
If during prediction you find the values [NewYork, Paris, Chicago, RioDeJaneiro] you should discard the values Chicago and "RioDeJaneiro" then fill zero value for the column corresponding to "Tokyo" such that the result still has three columns (one for each of [NewYork, Paris, Tokyo] ).
Related
I'm trying to understand the example presented in Appendix C here
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6481149/
Equation C1 is clear to me.
But in Equation C2 they use the mean values.
Such mean values are clear to me in the case of categorical variables for example 1.548 is the mean value of the Sex variable (as shown in the Table 3). Please correct me if I'm wrong.
But in numerical variables I don't understand which mean values are they using. For example for the Age variable they use 3.768, if I understand right, that value is the log of the mean age, should be log(44.15)=1.64. Instead the used value is 3.768.
Please could anybody clarify where does this value come from?
In statistics log often means the natural logarithm, sometimes denoted ln. The four values they take the logarithms of are:
Variable
Reported Mean
ln(Mean)
Reported
Age
44.15
3.788
3.768
BMI
25.61
3.243
3.230
BP Syst
138.6
4.932
4.913
Pulse Rate
75.61
4.326
4.311
The calculated values are not exactly equal to the reported values. But it looks close enough that this is probably the calculation they used. Without the data and/or code they used it's hard to say why the results are different. The study mentions excluding 130 participants because of ethics protections. So, perhaps one table was calculated using a slightly different group of participants than the other table?
I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.
I am doing regression problem where I have to predict weekly sales of 45 outlet of a departmental store. I have a variable named temperature(values in Celsius) and I want to do feature engineering on this column as a practice so I came with the the idea of creating another column which will also be a temperature column but values will be in farenhite which will be derived from celsius column, but that will lead to multicollinearity between these two variable, Is there any way to treat multicollinearity between two variable or should I go with another approach?
Data and description of variables
Picture 1 and Sample unbalanced paneldata
Picture 1 shows a balanced panel data that I have created using an unbalanced one provided as a sample in the same image, where I had multiple products (ID) for different amount of years (YEAR). For each product, there were a different number of Shops offering the given product (ID). So as stated, this is a balanced set created by sorting out for the same years, same products (ID), and same shops (marked by the orange area in the sample unbalanced paneldata). This is an important assumption that might affect the perception of the issue stated below. The following is therefore a description of the table shown in Picture 1:
Years indicates the amount of period a product lasts for a given product (ID)
Shop 1, Shop 2, Shop 3 indicate different prices for a given product (ID) by different firms
The minimum and second minimum value depict what shops for a given year and product (ID), have the lowest and second lowest price for that given year. This is needed to calculate the Price difference, which is **(Second minimum value - Minimum Value) / (Minimum Value)
An example of this, is given for row 5 (Year 01.01.1995 - ID 101) where Price difference would be (3999-3790)/3790 = 5,51% (In Picture 1)
Issue
In my balanced panel data, (Picture 1), I want to run a fixed effect regression in STATA using xtreg function, where the dependent variable is the Price difference, and number of shops selling a product are the independent variables. This is, so I can say how Price difference as a dependent variable is affected when there is 1 shop selling, when there are two shops selling, and when there are three shops selling.
Another problem is, is my assumption valid at all of creating a balanced panel? Is it correct to create a balanced from the unbalanced paneldata, or must I use the unbalanced panel to create such a variable?
So my main issue is how to create such independent variables, that measure the dimension of number of shops offering products. To
clarify what I mean, I have included an example of a sample fixed
effect regression that may explain the structure that I attempt to
seek, in Picture 2 below:
NOTE (In picture 2 expected cell mean to the right is the same as Price difference in Picture 1, and is used as dependent variable. They are regressed on number of firms/shops as independent variables, and these I have an issue creating)
Picture 2
What I have tried
I have tried, using dummy variables, on shops, but they ended up getting dropped. The dataset provided in picture 1 is a balanced data set as mentioned, which is needed to run (I assume) a fixed effect regression on a paneldata.
End remark
I stated this question earlier in a much more imprecise manner, where I apologiese for any inconvenience. The problem I think, might be that either I have set it up wrong in excel, hence the dummy's are dropped, or something of that nature. It might also be, that I have to use the unbalanced set in order to create this independent variable, so that might also be a problem, that I am attempting to use a balanced set instead of the unbalanced one.
In your unbalanced sample (as we discussed in the comments, the balanced sample will not make sense) we first need to create a variable for the number of shops offering each ID, let us say we have the same data as in the top portion of your Picture 1
egen number_of_firms = rownonmiss(Shop*)
xtset ID year // to use xtreg, we must tell Stata the data are panel
xtreg Price_difference i.number_of_firms
The xtreg is the regression shown in your Picture 2.
If you want the number of firms variable to be formatted a bit more like Picture 2, you can do something like this:
qui levelsof number_of_firms, local(num)
foreach n in `num' {
local lab_def `lab_def' `n' "`n' Firms"
}
label def num_firms `lab_def'
label values number_of_firms num_firms
label var number_of_firms "Number of Firms"
And then run the regression and the output will be formatted with the number of firms lables.
I have a data set containing three columns, first column represents number of trials, second column represents experimental values, and the third column represents corresponding standard deviation.
With each experiment there is an increment in my experimental values. To get the incremental values, I hold my first value as the reference value and subtract this reference value from each subsequent value and use them to create fourth column of these incremental values.
My problem begins right from here. How do I create a new set of incremental standard deviations for the incremental experimental values I got? My apology if the problem is not well defined but hopefully someone will eventually be able to help me out. Many thanks!
Below is my data set,
Trial Mean SD Incr Mean Incre SD
1 45.311 4.668 0
2 56.682 2.234 11.371
3 62.197 2.266 16.886
4 70.550 4.751 25.239
5 80.528 4.412 35.217
6 87.453 4.542 42.142
7 89.979 2.185 44.668
8 96.859 3.476 51.548
To be clear, for other readers, your incremental mean is actually the difference between trial 1 and the other trials.
Variances add directly when you subtract (or add) independent normal distributions. So you first want to convert that standard deviation to a variance by squaring it, and then you can add the variances, and then you can take the square root to turn it back into a standard deviation. Note when using this kind of Pythagorean combination, you are assuming that trial 1 is independent from the trials, so for example, you cannot do things like have some sample in both trials.
Logically this makes sense that your so called "incremental SD" will always be greater than the individual SDs, since the uncertainty of both distributions contributes towards the uncertainty of the difference.