How to make exact search in Database - brightway

Normally in Whoosh phrase queries to search exact math are obtained using double quotes. It seems to work most but not all the time in bw2 (e.g. see here).
db.search('"{}"'.format("Carbon dioxide, from soil or biomass stock"))
['Carbon dioxide, from soil or biomass stock' (kilogram, None, ('air','non-urban air or from high stacks')),
'Carbon dioxide, to soil or biomass stock' (kilogram, None, ('soil', agricultural')),
'Carbon dioxide, from soil or biomass stock' (kilogram, None, ('air', 'urban air close to ground')),
'Carbon dioxide, from soil or biomass stock' (kilogram, None, ('air',)),
'Carbon dioxide, to soil or biomass stock' (kilogram, None, ('soil', 'forestry')),
'Carbon dioxide, from soil or biomass stock' (kilogram, None, ('air', 'indoor')),
'Carbon dioxide, from soil or biomass stock' (kilogram, None, ('air', 'lower stratosphere + upper troposphere')),
'Carbon dioxide, to soil or biomass stock' (kilogram, None, ('soil', 'industrial')),
'Carbon dioxide, from soil or biomass stock' (kilogram, None, ('air', 'low population density, long-term')),
'Carbon dioxide, to soil or biomass stock' (kilogram, None, ('soil',))]
Any idea on how to get exact match search?

The easiest way to find things which seem to be difficult in the Whoosh search index is to simply skip it, and filter the raw datasets, e.g.
[ds for ds in db if ds['name'].startswith('Carbon dioxide, from soil or biomass stock')]
It is quite easy to add arbitrary complexity because you are just adding Python functions.

Related

Fitting a function of a variable number of parameters and number of independent variables

I am using curve_fit to fit a function of variable number of parameters and fixed in number but more than one independent variables
The fitted function (fun_multiple) takes the multiple independent variables as a array of arrays (alphas) and the fitted parameters are G and number of Hs parameters, which changes depending on the data given. Flux array is 1-D.
The fitted function is:
popt, pcov = curve_fit(lambda alphas, G, *Hs: fun_multiple(alphas, G, Hs), self.alphas, flux, p0=[0.1, *self.max_mags])
Below is the example data for which 6 parameters is fitted (1xG plus 5xHs).
I am getting this error, which seems understandable, but i can not pinpoint what should be corrected?
File "/home/starlink/PCFit/PCFit_1.17/fitter.py", line 64, in fit_HG
popt, pcov = curve_fit(lambda alphas, G, *Hs: fun_multiple(alphas, G, Hs), self.alphas, flux, p0=[0.1, *self.max_mags])
File "/opt/anaconda3/lib/python3.8/site-packages/scipy/optimize/minpack.py", line 742, in curve_fit
xdata = np.asarray_chkfinite(xdata, float)
File "/opt/anaconda3/lib/python3.8/site-packages/numpy/lib/function_base.py", line 486, in asarray_chkfinite
a = asarray(a, dtype=dtype, order=order)
File "/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
flux [0.00366438 0.00398107 0.00401791 0.00452898 0.00465586 0.00452898
0.00436516 0.00428549 0.00416869 0.00366438 0.00331131 0.00301995
0.00304789 0.00299226 0.00280543 0.00263027 0.00263027 0.00246604
0.00253513 0.00199526 0.00201372 0.0021677 0.00222844 0.00233346
0.00235505 0.00235505 0.00248886 0.00253513 0.00258226 0.00260615
0.00267917 0.00272898 0.00277971 0.00299226 0.00316228 0.00343558
0.00353183 0.00224905 0.00218776 0.00199526 0.00201372 0.0021677
0.00222844 0.00233346 0.00235505 0.00235505 0.00248886 0.00253513
0.00258226 0.00260615 0.00267917 0.00272898 0.00277971 0.00299226
0.00316228 0.00343558 0.00353183 0.00359749 0.00390841 0.00398107
0.00343558 0.0030761 0.00299226 0.00258226 0.00246604 0.00233346
0.00224905 0.00218776 0.00199526 0.00201372 0.0021677 0.00222844
0.00233346 0.00235505 0.00235505 0.00248886 0.00253513 0.00258226
0.00260615 0.00267917 0.00272898 0.00277971 0.00299226 0.00343558
0.00353183 0.00359749 0.00390841 0.00398107 0.00343558 0.0030761
0.00299226 0.00258226 0.00246604 0.00233346 0.00224905 0.00218776
0.00199526 0.00201372 0.0021677 0.00222844 0.00233346 0.00235505
0.00235505 0.00248886 0.00253513 0.00258226 0.00260615 0.00267917
0.00272898 0.00277971 0.00299226 0.00316228 0.00343558 0.00353183
0.00359749 0.00390841 0.00398107 0.00343558 0.0030761 0.00299226
0.00258226 0.00246604 0.00233346 0.00224905 0.00218776]
alphas [array([ 3.5, 3.1, 2.8, 1.5, 1.4, 1.4, 2. , 2.3, 2.6, 4.2, 7.2,
9.7, 10. , 10.2, 12.4, 15.9, 16.1, 17.3, 17.4]), array([17. , 16.2, 14.4, 13.5, 11.6, 11.3, 11. , 10.3, 8.8, 8.5, 8.1,
7.3, 6.9, 6.5, 5.7, 4.1, 2.4, 2. , 12.1, 13.8]), array([17. , 16.2, 14.4, 13.5, 11.6, 11.3, 11. , 10.3, 8.8, 8.5, 8.1,
7.3, 6.9, 6.5, 5.7, 4.1, 2.4, 2. , 1.6, 1.2, 1. , 2.5,
4.2, 4.6, 8.8, 10.3, 11.4, 12.1, 13.8]), array([17. , 16.2, 14.4, 13.5, 11.6, 11.3, 11. , 10.3, 8.8, 8.5, 8.1,
7.3, 6.9, 6.5, 5.7, 2.4, 2. , 1.6, 1.2, 1. , 2.5, 4.2,
4.6, 8.8, 10.3, 11.4, 12.1, 13.8]), array([17. , 16.2, 14.4, 13.5, 11.6, 11.3, 11. , 10.3, 8.8, 8.5, 8.1,
7.3, 6.9, 6.5, 5.7, 4.1, 2.4, 2. , 1.6, 1.2, 1. , 2.5,
4.2, 4.6, 8.8, 10.3, 11.4, 12.1, 13.8])]

Pyspark RDD unexpected change in standard deviation

I'm following Raju Kumar's PySpark recipes and on recipe 4-5 I found that when you do rdd.stats() and rdd.stats().asDict() you get different values for the standard deviation. This goes unnoticed in the book BTW.
Here is the code to reproduce the finding
import pyspark
sc = pyspark.SparkContext()
air_speed = [12,13,15,12,11,12,11]
air_rdd = sc.parallelize(air_speed)
print(air_rdd.stats())
print(air_rdd.stats().asDict())
An this is the output
(count: 7, mean: 12.285714285714286, stdev: 1.2777531299998799, max: 15.0, min: 11.0)
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.3801311186847085, 'variance': 1.904761904761905}
Now, I know the stdev on the first case is the "population" stdev formula while the second is the
unbiased estimator of the population standard dev (AKA the "sample standard deviation"). See an article for reference. But what I don't understand is why do they change from one output to the other, I mean it looks like
.asDict() should simply change the format of the output, not it's meaning.
So, does anybody understand the logic of this change?
I mean it looks like .asDict() should simply change the format of the output, not it's meaning.
I doesn't really change the meaning. pyspark.statcounter.StatCounter provides both sample and population variant
>>> stats = air_rdd.stats()
>>> stats.stdev()
1.2777531299998799
>>> stats.sampleStdev()
1.3801311186847085
and you can choose which one should be used when converting to dictionary:
>>> stats.asDict()
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.3801311186847085, 'variance': 1.904761904761905}
>>> stats.asDict(sample=True)
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.2777531299998799, 'variance': 1.63265306122449}

numpy arange function returns inconsistant array

While coding some array iteration stuff, I came across this strange behavior of the numpy arange() function :
>>> import numpy as np
>>> np.arange(0.13,0.16, step=0.01)
array([0.13, 0.14, 0.15])
>>> np.arange(0.12,0.16, step=0.01)
array([0.12, 0.13, 0.14, 0.15, 0.16])
>>> np.arange(0.11,0.16, step=0.01)
array([0.11, 0.12, 0.13, 0.14, 0.15])
As you can see, when asked to start on 0.13, the result stops one step short of the end value (as it should) but when asked to start on 0.12, the last value is returned ! Further down, starting on 0.11, the last value is gone again.
This causes some obvious problems if you're expecting the array to be increased by one when extending the range by exactly one step...
Any ideas on why the inconsistant behavior ?
System info : Python 3.6.5, numpy 1.14.0
np.arange documentation states:
When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.
So, you should consider using np.linspace instead.
You can implement your own arange method using linspace:
def my_arange(start, end, step):
return np.linspace(start, end, num=round((end-start)/step), endpoint=False)
And it would work as expected:
In [27]: my_arange(0.13, 0.16, step=0.01)
Out[27]: array([ 0.13, 0.14, 0.15])
In [28]: my_arange(0.12, 0.16, step=0.01)
Out[28]: array([ 0.12, 0.13, 0.14, 0.15])
In [29]: my_arange(0.11, 0.16, step=0.01)
Out[29]: array([ 0.11, 0.12, 0.13, 0.14, 0.15])

Scipy Interpolation CubicSpline Boundaries

I have an issue using The scipy.interpolate.CubicSpline function.
Here is my code :
CS1 = CubicSpline(T,A,bc_type='not-a-knot',extrapolate=bool, axis=1)
Result :
CS1 =
[-8.34442117e+03 -6.94866126e+03 -5.71682333e+03 -4.63872647e+03
-3.70418976e+03 -2.90303229e+03 -2.22507315e+03 -1.66013142e+03
-1.19802617e+03 -8.28576513e+02 -5.41601516e+02 -3.26920268e+02
-1.74351855e+02 -7.37153621e+01 -1.48298738e+01 1.24855245e+01
1.84117475e+01 1.31297102e+01 6.82032749e+00 9.66451413e+00
3.15397607e+01 7.05279383e+01 1.09387991e+02 1.32530056e+02
1.36799756e+02 1.22858734e+02 9.60947464e+01 6.66210660e+01
4.28224903e+01 2.64229282e+01 1.75832317e+01 1.45176021e+01
1.39435432e+01 1.33609464e+01 1.23801442e+01 1.09650786e+01
9.27738095e+00 7.59606003e+00 6.29249366e+00 5.91452686e+00
6.79882387e+00 7.57144653e+00 6.13515774e+00 2.70590543e+00
9.34668162e-01 3.86336659e+00 9.73615276e+00 1.52487556e+01
1.90469811e+01 2.20000000e+01]
There are negative values, which i find odd because the original data is only positive :
[7.0,
12.0,
20.0,
111.0,
132.0,
68.0,
22.0,
14.0,
12.0,
8.0,
6.0,
7.0,
1.0,
13.0,
22.0,
23.0,
5.0,
3.0,
5.0,
65.0,
236.0,
234.0,
105.0,
152.0,
466.0,
401.0,
157.0,
51.0,
21.0,
13.0,
11.0,
19.0,
15.0,
11.0,
9.0,
15.0,
86.0,
276.0,
423.0,
291.0,
108.0,
36.0,
22.0,
21.0,
16.0,
16.0,
13.0,
9.0]
And T is only a list that goes one by one from 1 to 48 (48 is the length of A and T)
I feel that the issue is from a boundary issue but the problem is only in the beginning...
Any ideas ?
Nothing odd here: a cubic spline on positive data can attain negative values, no matter what the boundary conditions are. If it's necessary to maintain positivity, piecewise linear interpolation (degree 1 spline) is an option. Other options are discussed in How can I find a non-negative interpolation function?
Here is an illustration of why this happens: spl = CubicSpline([-2, -1, 1, 2], [10, 1, 1, 10])
This spline fits a parabola to the given points. The parabola dips into negative territory in the middle, between the points.
That in your example this happened near the boundary is not really important; it can happen anywhere.

NLP model training

I just started with NLP (Natural Language Processing) and struggling to understand one important concept. How to train system for relation extraction on future inputs?
For example, I have few lines like:
Tom is working for abc company
Jerry works at xyz
organization is the place where Person works.
In all these cases relation ship is "person" "Organization" with relation ship type "working"
Based on above examples and some NLP readings, I think we need to train system based on Part of Speech tag than real "entity names", to make it generic for other input data in field. This is the part I am really confused.
Please don't simply point me to some algorithms( SVM etc.,), because I know it is possible with them, but I am missing details on how algorithms process these lines to process other inputs. All the examples I see directly provides models and tells use them, due to which I am unable to construct few things I would like to.
Any example on how algorithms (any example algorithm is Ok) use above sentences to construct training model would be really helpful.
Thank you for your time and help.
Note: Any one of the programming language specified in tags section is Ok for me.
You're correct. There are so many words that simply using the word won't actually allow you to develop a good model. You need to reduce the dimensionality. As you suggested, one way to do that is to take the part of speech. Of course, there are also other features that you could extract. For example, the following very small portion of one of my .arff files was used for determining whether a period in a sentence marked the end of or not:
#relation period
#attribute minus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
#attribute minus_three_length real
#attribute minus_three_case {'UC','LC','NA'}
#attribute minus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
#attribute minus_two_length real
#attribute minus_two_case {'UC','LC','NA'}
#attribute minus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
#attribute minus_one_length real
#attribute minus_one_case {'UC','LC','NA'}
#attribute plus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
#attribute plus_one_length real
#attribute plus_one_case {'UC','LC','NA'}
#attribute plus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
#attribute plus_two_length real
#attribute plus_two_case {'UC','LC','NA'}
#attribute plus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
#attribute plus_three_length real
#attribute plus_three_case {'UC','LC','NA'}
#attribute left_before_reliable real
#attribute right_before_reliable real
#attribute spaces_follow_period real
#attribute class {'EOS','NEOS'}
#data
VBP, 2, LC,NP, 4, UC,NN, 1, UC,NP, 6, UC,NEND, 1, NA,NN, 7, LC,31,47,1,NEOS
NNS, 10, LC,RBR, 4, LC,VBN, 5, LC,?, 3, NA,NP, 6, UC,NP, 6, UC,93,0,0,EOS
VBD, 4, LC,RB, 2, LC,RP, 4, LC,CC, 3, UC,UH, 5, LC,VBP, 2, LC,19,17,2,EOS
EDIT (based off of question):
So, this was a supervised learning experiment. The training data came from normal sentences in a paragraph style format, but were transformed to the following vector model:
Column 1: Class: End-of-Sentence or Not-End-of-Sentence
Columns 2-8: The +/- 3 words surrounding the period in question
Columns 9,10: The number of words to the left/right, respectively, of the period before the next reliable sentence delimiter (e.g. ?, ! or a paragraph marker).
Column 11: The number of spaces following the period.
Of course, this is not a very complicated problem to solve, but it's a nice little introduction to Weka. Since we can't just use the words as features, I used their POS tags. I also extracted length of words, whether or not the word was capitalized, etc.
So, you could feed anything as testing data, so long as you're able to transform it into the vector model above and extract the features used in the .arff.

Resources