Multi label regression summed to 1 - scikit-learn

I have data of renovation jobs - each job requires at least one of 3 skills - carpenter, painter and ceramics. For each row my labels are share of time each skill is required for each job (summed to 1).
Sample:
Job Description (free text) Location Estimated Cost Main material Carpenter Painter Ceramics
Paint Smiths' House and Parquet Chicago 4000 Parquet 0.1 0.15 0.75
Total renovation and pool New York 15700 Metal 0.6 0.2 0.2
Pink decorations New York 12000 Wallpaper 0.7 0.05 0.25
I want to train the model to predict the shares of the skills.
I was thinking about MultiOutputRegressor of scikit-learn, but my main issue is to oblige the predictions to be >=0 and summed to 1.
Is there an off the shelf solution?

Related

Is there a reason why scikit-learn's classification report doesn't show you the number of predictions?

I use sklearn.metrics.classification_report often at work. One feature that I had to implement myself was to show the number of predictions for each class as well rather than just the support.
For example (I omitted some details for brevity):
<Original>
precision recall f1 support
class 0 0.5 0.5 0.5 100
class 1 0.5 0.5 0.5 200
class 2 0.5 0.5 0.5 300
<Mine>
precision recall f1 support preds
class 0 0.5 0.5 0.5 100 100
class 1 0.5 0.5 0.5 200 300
class 2 0.5 0.5 0.5 300 200
When performing error analysis I find it useful to compare the true label distribution to the predicted label distribution. However, since scikit-learn's function doesn't implement this I made a simple change in it so that it does.
I'm curious why this isn't a feature to start with? Is this based on some reason that the number of predictions is insignificant compared to support?

Random numbers with different probability of ocurring

I'm simulating a problem in Excel to prove my theoretical result.
I have a total number of customers, let's say n = 80. 40% of this group is female and 70% is of the age from 40 to 60. On paper, if I want to get a group of female that is of the age from 40 to 60, I can just multiply 0.4 * 0.7 * 80
However, I'm running a Monte Carlo simulation on Excel, so the sex and the age have to be random here. I can't figure out how to "simulate" the 40% and 70% here. For example, if I do rand() and take 1 for male and 0 for female, that would give 50% female though right?
Can I get help with this please?
Convert Rand to 1 or 0 in the appropriate ratios. i.e.,:
=IF(RAND()>0.7,1,0)
Will give a value of 0 70% of the time and 1 the other 30% of the time.

Aggregation of predictions from ALS Model - SPARK Collaborative filtering

I'm using the ALS algorithm (implicitPrefs = True) in Spark 2.1.0 for collaborative filtering.
I am wondering if it is possible to aggregate the prediction scores. Lets say for a User1 there are the following predictions:
Item a: 0.4
Item b: 0.2
Item c: 0.1
Item d: 0.5
In my case items belong to several groups. Lets say Item a and b belong to Group1 and Item c and d to Group 2. Can I now aggregate the predictions? For example by summing them up in order to get the predictions:
Group1: 0.4 + 0.2 = 0.6
Group2: 0.5 + 0.1 = 0.6
P.S. Fitting the model on Groups is not wanted, because the correlation between groups and items is not constant. Therefore I don't want to refit the model every time the correlation changes. I can't figure out of if the aggregation of predictions is mathematical nonsense or not and I am happy for any help.

Is this the right way of using pd.get_dummies?

I have a dataframe that has both categorical and numerical variables. In my regression model, I would like to use both my categorical and numerical data.
df_w_dummies = pd.get_dummies(df, columns =['Publisher','Platform','Genre','Publisher_Country','Publisher_Continent'],
drop_first = True)
features_dummies = df_w_dummies.loc[:, df_w_dummies.columns != 'NA_Sales']
target_dummies = df_w_dummies.loc[:,'NA_Sales'].dropna()
I am also trying to avoid multicollinearity by adding the 'drop_first' keyword as True.
Any advice/input would be appreciated!
This is not very pretty... but here is an example of what some of the data would look like.
Name Platform Publisher Chartz_Score User_Score Critic_Score Global_Sales NA_Sales EU_Sales JP_Sales Other_Sales Year_of_Release Genre Year Total_Tweets Publisher_Country Publisher_Continent Publisher_Lat Publisher_Long
Super Mario Bros. Nintendo Nintendo EAD NaN 10.0 NaN 60.312336 89.184016 16.740672 53.505894 0.77 1985-10-18 Platform 1985.0 NaN MX North America 14.88102 -92.27582
Wii Sports Resort Nintendo Nintendo EAD 8.8 8.0 8.8 49.311030 47.873538 51.344296 25.849397 3.02 2009-07-26 Sports 2009.0 296.0 GB Europe 14.88102 -92.27582
It looks good except when you .dropna()in target variable it may/may not be the same size with features variables. So if you want do drop NaN values in the data, you should do it at the beginning.
df = df.dropna(subset=['NA_Sales'])

How to make a "trending" or "averaging" curve

I have a spreadsheet on which I've been tracking my weight for the last year.
I weigh myself nearly every day, and I can be off by as much as 5 pounds from day to day.
I would like make a graph shows the overall pattern of my weight loss / gain, but without all of the noise.
What are some formulas that I can use to calculate the overall trend?
Place the raw daily measurements in A1 thru A365In B2 enter:
=(A1+A2+A3)/3
and copy down. Column B will give you a smoother dataset for plotting and trending.
Once you have enough data points a "moving average" will help reduce the daily noise. Let's say you have 10 data points starting in A1:
120.0 119.0 114.1 116.7 112.0 108.7 107.9 104.6 108.9 111.7
In cell C2 you could use the formula AVERAGE(A1:C1) and copy it to the end of your data set. THe relative references will always average the last 3 measurements.
Now your data looks like:
120.0 119.0 114.1 116.7 112.0 108.7 107.9 104.6 108.9 111.7
117.7 116.6 114.3 112.5 109.5 107.1 107.1 108.4
So your second row has far less variation that the raw data.
You can also get fancy and make the number of measurements variable. If that number were stored in A5 (below your data) then the formula would be something like
=AVERAGE(OFFSET(C1,0,0,1,-MIN(COLUMN(),$A$5)))
The MIN ensures that you don't go past the beginning of the data set (if you do a 5-day moving average you can;t go back 5 days from the 4th day, etc.)

Resources