I am currently working with self-created processes in Brightway2 and I get nan scores when I try to run LCA. The processes are composed of ecoinvent 3.6 activities. Each ecoinvent activity has a correct score but when running the global process, the right score is not calculated. After looking into the lca code, it seems the linear solver for the technosphere matrix and the demand array returns an array full of nan. Any ideas to fix this ? Thank you !
If you are seeing NaN values, you aren't presenting a linear system that can be solved. My guess is that you are missing a production exchange for your self-created process. As we don't have details, we can't really see how this was created, but if it missing a production exchange (the value commonly given on the diagonal), the technosphere matrix will be singular.
Related
I have the following task:
I have a database of output measurements of a solar system and I am supposed to detect errors or rather abnormalities. The following Link shows Example Data. What seemingly happened is that the sensors didn't provide any measurements at 11:00am, the same happened from 16:00 to 18:00. Then at 12:00pm and at 19:00 all the prior missing measurements have been summed up.
I am supposed to create a system that automatically detects these types of abnormalities and my first thought is to use classification (maybe decision trees or naive bayes classifier) to predict if a row is an error or not. My question is if this is a reasonable method or is it completely wrong to use classification here (and what are other methods to solve this problem)?
Thanks in advance
I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!
I have used the following code to run and evaluate a RandomForestRegressor model for my data:
My dataset is 36 features, 1 label with around 31 million rows. The features are continuous and the labels are binary.
I have the following questions:
When I use np.unique(Y_Pred) it tells me array([0. , 0.5, 1. ]). Why am I getting 0.5 as an output? Is there a parameter I can change in the model to fix it? I don't know whether to include it as a 1 or 0. For now I've included it as a 1 (hence Y_Pred > 0.45 in my code).
The documentation says the most important parameters to adjust are n_estimators and max_features. For n_estimators what is a reasonable number? I've started at 2 because of how slow it took to run on my TPU Google Colab session (43 minutes for each tree or 86 minutes total). Should I bother increasing trees to improve accuracy? Are there any other parameters I can change to improve speed? All of my features are reasonably important so I don't want to start dropping them.
Is there anything I am doing wrong that is making it slow, or anything I can do to make it faster?
Any help would be greatly appreciated.
When your labels are binary, you should use the RandomForestClassifier so that you can get the 1 or 0 as the output directly from the model.
you could play around with the max_samples parameter to reduce the number of datapoints used for each tree in the random forest. Since you have 31 millions records, it make sense to subsample them for each tree.
max_depth has greatly help you to reduce the training time. You need to find the sweet spot the get a balance between computation time and model performance.
I'm trying to implement collaborative Filtering by using sklearn truncatedSVD method. However, I receive very high rmse and it is because I receive very low ratings for every recommendation.
I perform truncatedSVD on a sparse matrix and I was wondering if this low recommendations are because the truncatedSVD accepts non-rated movies as 0 rated movies? If not, do you know what might cause low recommendations? Thanks!
So, it turned out to be that, if your data set's numeric values don't meaningfully start with zero you cannot apply truncatedSVd, without some adjustments. In case of movie ratings, which are from 1 to 5, you need to mean center the data, where you assign a meaning to zeros. Mean centering the data worked for me and I started to get reasonable rmse values.
I am new in machine learning. I did a test but do not know how to explain and evaluate.
Case 1:
I first divide randomly the data (data A, about 8000 words) into 10 groups (a1..a10). Within each group, I use 90% of data to build ngram model. This ngram model is then tested on the other 10% data of the same group. The result is below 10% accuracy. Other 9 groups are done same way (respectively build model and respectively tested on the remained 10% data of that group). All results are about 10% accuracy. (Is this 10 fold cross-validation?)
Case 2:
I first build a ngram model based on entire data set (data A) of about 8000 words. Then I divide this A into 10 groups(a1,a2,a3..a10), randomly of course. I then use this ngram to test respectively a1,a2..a10. I found the model is almost 96% accuracy on all groups.
How to explain such situations.
Thanks in advance.
Yes, 10-fold cross validation.
This testing method has the common flaw of testing on the training set. That is why the accuracy is inflated. It is unrealistic because, in real life, your test instances are novel and previously unseen by the system.
N-fold cross validation is a valid evaluation method used in many works.
You need to read up on the topic of overfitting.
The situation you describes gives the impression that your ngram model is heavily overfitted: it can "memorize" 96% of the training data. But when trained on a proper subset, it only achieves a prediction on the unknown data of 10%.
This is called 10 fold cross-validation