find a product that causes a non-square technosphere matrix with Brightway2 - brightway

As a prophecy, I have a question related to a previous question of "cleaning" the database. How can I identify why my technosphere is not longer square?
I have done something to my database that if I try to do an LCIA of a random activity
def testactivity(activity):
method_key=methods.random()
fu={activity:1}
lca = LCA(fu,method_key)
lca.lci()
lca.lcia()
print(lca.score)
return()
testactivity(Database('ei_33consequential').random())
I get this warning message: NonsquareTechnosphere: Technosphere matrix is not square: 12384 activities (columns) and 12385 products (rows). Use LeastSquaresLCA to solve this system, or fix the input data.
I tried to find if I have a dataset with two reference products, to check that I looped through the database to check if the "production amount" was not a float. but I didn't find anything "wrong"
for ds in Database('ei_33consequential'):
if (isinstance(act['production amount'],float))==False:
print(ds['name'])
Is this approach correct to find an activity with more than one reference flow?. Otherwise, how can I find the product which is making my matrix non inversable?

You can check to see which activities have more than one production exchange with something like this:
for a in Database("ecoinvent 3.3 cutoff"):
assert len(a.production()) == 1

Related

Handling optional data in Logistic regression

I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.

Time Series Forecasting in Tensorflow 2.0 - How to predict using the last of the Validation Dataset?

I'm (desperately) trying to figure out Tensorflow 2.0 without much luck so far, but I think I'm close with what I need right now.
I've followed the doc here to make a simple network to forecast stock data (not weather data), and what I'd like to do now is, forecast the future using the latest/most recent section of the validation dataset. I'm hoping someone else has read through it already and can help me here.
The code to predict the future using the validation dataset looks like this:
for x, y in val_data_multi.take(3):
multi_step_plot(x[0], y[0], multi_step_model.predict(x)[0])
...where to the best of my knowledge, it takes a random chunk (3 separate times), and in my case is a 20 row x 9 column section, from the val_data_multi "Repeat dataset" type, and then uses the model's multi_step_plot function to spit out a plot that has the predicted values based on that random section of the validation dataset. But what if I don't want to just take a random validation section, I want to use the bottom of my actual dataset? So that if I have recent stock data at the bottom of my validation dataset, and I want to forecast for the future that hasn't happened yet, how can I take a 20x9 section from the bottom of that set, and not just have it "take" a random section to predict with?
As a pseudo code attempt to explain what I'm trying to do, I was trying something like:
for x, y in val_data_multi[-20:].take(1): #.take(3)
...to try and make it take one section 20 rows up from the bottom, and all columns. But of course this didn't work as TypeError: 'RepeatDataset' object is not subscriptable.
I hope that makes sense, and if it'll help for me to post my code, I can do that, but I'm just using what's already shown in that page, just made some modifications to use a stock dataset, that's all.
I was able to find a much better guide from this Github repo:
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
...which basically gets into better detail what I'm looking to do and made it very easy to understand. Thanks!

How to apply a Pearson Correlation Analysis over all pairs of pixels of a DataArray as a Correlation Matrix?

I am facing serious difficulties in generating a correlation matrix (pixel by pixel) of a single Netcdf with dimensions ('lon', 'lat', 'time'). My final intent is to generate what one calls a Teleconnectivity Map.
This Map is composed of correlation coefficients. Each pixel has a value that represents the highest correlation value (in module) found in the correlation matrix over all pairs of pixels in the DataArray.
Therefore, in order to create my Teleconnectivity Map, instead of looping over every longitude ('lon') and every latitude ('lat') and later checking all possible combinations of correlation for which one was higher in magnitude, I was thinking of applying the xr.apply_ufunction with a wrapped correlation function inside.
Despite my efforts, I still don't get what is truly happening behind the scenes in the xr.apply_ufunc. All I managed to do was as a single Resultant Matrix with all pixels equals to 1 (perfect correlation).
See code below:
import numpy as np
import xarray as xr
def correlation(x, y):
return np.corrcoef(x, y)[0,0] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
from functools import partial
fpartial = partial(correlation, x.values)
return xr.apply_ufunc(fpartial,
da,
input_core_dims=[[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
# testing the wrapped correlation for a sample data:
ds = xr.tutorial.open_dataset('air_temperature').load()
# testing for a single point in space.
x = ds['air'].sel(dict(lon=1, lat=92), method='nearest')
# over all points in the DataArray
Corr_over_x = wrapped_correlation(ds['air'], x)
Corr_over_x# notice that the resultant DataArray is composed solely of ones (perfect correlation match). This is impossible. I would expect to have different values of correlation for each pixel in here
# if one would plot the data, I would be composed of a variety of correlation values (see example below):
Corr_over_x.plot()
This is an important asset for meteorologists and Remote Sensing researches. It allows the evaluation of potential geophysical patterns over a given area of study.
I thank you for your time, and I hope hearing from you soon.
Sincerely yours,
Firstly, you need to use np.corrcoef(x, y)[0,1]. In the end, you don't need to use partial at all, see below:
def correlation(x1, x2):
return np.corrcoef(x1, x2)[0,1] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
return xr.apply_ufunc(correlation,
da,
x,
input_core_dims=[[coord],[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
I have managed to solve my question. The script has become a bit long. Nevertheless, it does what it was previously intended.
The code is adapted from this reference
Since it is too long to show a snippet in here, I am posting a link to my Github account in which the algorithm (organized in a package named Teleconnection_using_xarray_data) can be checked here.
The package has two modules with similar results.
The first module (teleconnection_with_connecting_pathways) is slower than the second (teleconnection_via_numpy), but it allows to evaluate the connecting pathways between the partial teleconnection maps.
The second, only returns the resultant teleconnection map, without the connecting lines (geopandas-Linestrings), though it is much faster.
Feel free to colaborate. If possible, I would like to combine both modules ensuring speed and pathway analyses in the Teleconnection algorithm.
Sincerely yours,
Philipe Leal

STL decomposition getting rid of NaN values

Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.
According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.
According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

Resources