How term frequency is calculated in TfidfVectorizer? - python-3.x

I searched a lot for understanding this but I am not able to. I understand that by default TfidfVectorizer will apply l2 normalization on term frequency. This article explain the equation of it. I am using TfidfVectorizer on my text written in Gujarati language. Following is details of output about it:
My two documents are:
ખુબ વખાણ કરે છે
ખુબ વધારે છે
The code I am using is:
vectorizer = TfidfVectorizer(tokenizer=tokenize_words, sublinear_tf=True, use_idf=True, smooth_idf=False)
Here, tokenize_words is my function for tokenizing words.
The list of TF-IDF of my data is:
[[ 0.6088451 0.35959372 0.35959372 0.6088451 0. ]
[ 0. 0.45329466 0.45329466 0. 0.76749457]]
The list of features:
['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']
The value of idf:
{'વખાણ': 1.6931471805599454, 'છે.': 1.0, 'કરે': 1.6931471805599454, 'વધારે': 1.6931471805599454, 'ખુબ': 1.0}
Please explain me in this example what shall be the term frequency of each term in my both documents.

Ok, Now lets go through the documentation I gave in comments step by step:
Documents:
`ખુબ વખાણ કરે છે
ખુબ વધારે છે`
Get all unique terms (features): ['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']
Calculate frequency of each term in documents:-
a. Each term present in document1 [ખુબ વખાણ કરે છે] is present once, and વધારે is not present.`
b. So the term frequency vector (sorted according to features): [1 1 1 1 0]
c. Applying steps a and b on document2, we get [0 1 1 0 1]
d. So our final term-frequency vector is [[1 1 1 1 0], [0 1 1 0 1]]
Note: This is the term frequency you want
Now find IDF (This is based on features, not on document basis):
idf(term) = log(number of documents/number of documents with this term) + 1
1 is added to the idf value to prevent zero divisions. It is governed by "smooth_idf" parameter which is True by default.
idf('કરે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
idf('ખુબ') = log(2/2)+1 = 0 + 1 = 1
idf('છે.') = log(2/2)+1 = 0 + 1 = 1
idf('વખાણ') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
idf('વધારે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
Note: This corresponds to the data you showed in question.
Now calculate TF-IDF (This again is calculated document-wise, calculated according to sorting of features):
a. For document1:
For 'કરે', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314
For 'ખુબ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1
For 'છે.', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1
For 'વખાણ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314
For 'વધારે', tf-idf = tf(કરે) x idf(કરે) = 0 x 1.69314 = 0
So for document1, the final tf-idf vector is [1.69314 1 1 1.69314 0]
b. Now normalization is done (l2 Euclidean):
dividor = sqrt(sqr(1.69314)+sqr(1)+sqr(1)+sqr(1.69314)+sqr(0))
= sqrt(2.8667230596 + 1 + 1 + 2.8667230596 + 0)
= sqrt(7.7334461192)
= 2.7809074272977876...
Dividing each element of the tf-idf array with dividor, we get:
[0.6088445 0.3595948 0.3595948548 0.6088445 0]
Note: This is the tfidf of firt document you posted in question.
c. Now do the same steps a and b for document 2, we get:
[ 0. 0.453294 0.453294 0. 0.767494]
Update: About sublinear_tf = True OR False
Your original term frequency vector is [[1 1 1 1 0], [0 1 1 0 1]] and you are correct in your understanding that using sublinear_tf = True will change the term frequency vector.
new_tf = 1 + log(tf)
Now the above line will only work on non zero elements in the term-frequecny. Because for 0, log(0) is undefined.
And all your non-zero entries are 1. log(1) is 0 and 1 + log(1) = 1 + 0 = 1`.
You see that the values will remain unchanged for elements with value 1. So your new_tf = [[1 1 1 1 0], [0 1 1 0 1]] = tf(original).
Your term frequency is changing due to the sublinear_tf but it still remains the same.
And hence all below calculations will be same and output is same if you use sublinear_tf=True OR sublinear_tf=False.
Now if you change your documents for which the term-frequecy vector contains elements other than 1 and 0, you will get differences using the sublinear_tf.
Hope your doubts are cleared now.

Related

Use Kalman Filter to estimate position

I try to use Kalman filter in order to estimate the position. The input in the system is the velocity and this is also what I measure. The velocity is not stable, the system movement is like a cosine in general. So the equation is:
xnew = Ax + Bu + w, where:
x= [x y]'
A = [1 0; 0 1]
B= [dt 0; 0 dt]
u=[ux uy]
w noise
As I mentioned, what I measure is the velocity. My question is how would the matrix C look like in the equation:
y= Cx + v
Should I involve the velocity in the estimated states (matrix A)? Or should I change the equations to involve also the acceleration? I can't measure the acceleration.
One way would be to drop the velocities as inputs and put them in your state. This way, your state is both the position and velocity and your filter uses as observation both the measured speed of your vehicle and a noisy estimate of your position.
With this system your problem becomes:
x = [x_e y_e vx_e vy_e]'
A = [1 0 dt 0; 0 1 0 dt; 0 0 1 0; 0 0 0 1]
w noise
with x_e, y_e, vx_e, and vy_e the estimated values of the state
B is removed because u is 0. And then you have
y = Cx + v
with C = [1 0 0 0 ; 0 1 0 0 ; 0 0 1 0 ; 0 0 0 1]
With y = [x + dt*vx ; y + dt*vy ; vx ; vy] and x, y, vx, and vy the measured values of the velocities and x and y the position calculated with the measured velocities.
It is very similar to the example you will find here on Wikipedia

Create a dictionary of subcubes from larger cube in Python

I am examining every contiguous 8 x 8 x 8 cube within a 50 x 50 x 50 cube. I am trying to create a collection (in this case a dictionary) of the subcubes that contain the same sum and a count of how many subcubes share that same sum. So in essence, the result would look something like this:
{key = sum, value = number of cubes that have the same sum}
{256 : 3, 119 : 2, ...}
So in this example, there are 3 cubes that sum to 256 and 2 cubes that sum to 119, etc. Here is the code I have thus far, but it only sums (at least I think it does):
an_array = np.array([i for i in range(500)])
cube = np.reshape(an_array, (8, 8, 8))
c_size = 8 # cube size
sum = 0
idx = None
for i in range(cube.shape[0] - cs + 2):
for j in range(cube.shape[1] - cs + 2):
for k in range(cube.shape[2] - cs + 2):
cube_sum = np.sum(cube[i:i + cs, j:j + cs, k:k + cs])
new_list = {cube_sum : ?}
What I am trying to make this do is iterate the cube within cubes, sum all cubes then count the cubes that share the same sum. Any ideas would be appreciated.
from collections import defaultdict
an_array = np.array([i for i in range(500)])
cube = np.reshape(an_array, (8, 8, 8))
c_size = 8 # cube size
sum = 0
idx = None
result = defaultdict(int)
for i in range(cube.shape[0] - cs + 2):
for j in range(cube.shape[1] - cs + 2):
for k in range(cube.shape[2] - cs + 2):
cube_sum = np.sum(cube[i:i + cs, j:j + cs, k:k + cs])
result[cube_sum] += 1
Explanation
The defaultdict(int), can be read as a result.get(key, 0). Which means that if a key doesn't exists it will be initialized with 0. So the line result[cube_sum] += 1, will either contain 1, or add 1 to the current number of cube_sum.

Spark: Running Backwards Elimination By P-Value With Linear Regressions

I presently have a Spark Dataframe with 2 columns:
1) a column where each row contains a vector of predictive features
2) a column containing the value to be predicted.
To discern the most predictive features for use in a later model, I am using backwards elimination by P-value, as outlined by this article. Below is my code:
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
for i in range(0, num_vars):
model = LinearRegression(featuresCol="filtered_features", labelCol="averageScore")
model = model.fit(scoresDf)
p_values = model.summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
drop_max_index_udf = udf(lambda elem, drop_index, var_count:
Vectors.dense([elem[j] for j in range(var_count) if j not in [drop_index]]), VectorUDT())
scoresDfs = scoresDf.withColumn("filtered_features", drop_max_index_udf(scoresDf["filtered_features"],
lit(max_index), lit(num_vars)))
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
The code runs, but the only problem is that every iteration takes drastically longer than the last. Based on the answer to this question, it appears that the code is re-evaluating all prior iterations every time.
Ideally, I would like to feed the entire logic into some Pipeline structure that would store it all lazily and then execute sequentially with no repeats when called upon, but I am unsure as to whether that is even possible given that none of Spark's estimator / transformer functions seem to fit this use case.
Any guidance would be appreciated, thanks!
You are creating the model repeatedly inside a loop. It is a time consuming process and needs to be done once per training data set and a set of parameters. Try the following -
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
modelAlgo = LinearRegression(featuresCol="filtered_features", labelCol="averageScore")
model = modelAlgo.fit(scoresDf)
for i in range(0, num_vars):
p_values = model.summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
drop_max_index_udf = udf(lambda elem, drop_index, var_count:
Vectors.dense([elem[j] for j in range(var_count) if j not in [drop_index]]), VectorUDT())
scoresDfs = scoresDf.withColumn("filtered_features", drop_max_index_udf(scoresDf["filtered_features"],
lit(max_index), lit(num_vars)))
num_vars = scoresDf.select("filtered_features").take(1)[0][0].__len__()
Once you are happy with the model you save it. When you need to evaluate your data just read this model and predict with it.
why are you doing
model = model.fit(scoresDf)
when scoredDfs contains your new df with one less independent variable?
If you change your code with the following:
independent_vars = ['x0', 'x1', 'x2', 'x3', 'x4']
def remove_element(array, index):
return Vectors.dense(np.delete(array, index, 0))
remove_element_udf = udf(lambda a, i: remove_element(a, i), VectorUDT())
max_p = 1
i = 0
while (max_p > 0.05):
model = LinearRegression(featuresCol="filtered_features",
labelCol="averageScore",
fitIntercept=False)
model = model.fit(scoresDf)
print('iteration: ', i)
summary = model.summary
summary_df = pd.DataFrame({
'var': independent_vars,
'coeff': model.coefficients,
'se': summary.coefficientStandardErrors,
'p_value': summary.pValues
})
print(summary_df)
print("r2: %f" % summary.r2)
p_values = summary.pValues
max_p = np.max(p_values)
if max_p > 0.05:
max_index = p_values.index(max_p)
max_var = independent_vars[max_index]
print('-> max_index {max_index}, corresponding to var {var}'.format(max_index=max_index, var=max_var))
scoresDf = scoresDf.withColumn("filtered_features", remove_element_udf(scoresDf["filtered_features"],
lit(max_index)))
independent_vars = np.delete(independent_vars, max_index, 0)
print()
i += 1
you will get
iteration: 0
var coeff se p_value
0 x0 0.174697 0.207794 0.402616
1 x1 -0.448982 0.203421 0.029712
2 x2 -0.452940 0.233972 0.055856
3 x3 -3.213578 0.209935 0.000000
4 x4 3.790730 0.212917 0.000000
r2: 0.870330
-> max_index 0, corresponding to var x0
iteration: 1
var coeff se p_value
0 x1 -0.431835 0.202087 0.035150
1 x2 -0.460711 0.233432 0.051297
2 x3 -3.218725 0.209525 0.000000
3 x4 3.768661 0.210970 0.000000
r2: 0.869365
-> max_index 1, corresponding to var x2
iteration: 2
var coeff se p_value
0 x1 -0.479803 0.203592 0.020449
1 x3 -3.344830 0.202501 0.000000
2 x4 3.669419 0.207925 0.000000
r2: 0.864065
in first and second iteration, two independent variables with p-value greater than 0.05 are removed

Recursion function inn python

I am trying to understand the recursion function. I would like to know how that answer is coming with steps
def tri_recursion(k):
if(k>0):
result = k+tri_recursion(k-1)
print(result)
else:
result = 0
return result
print("\n\nRecursion Example Results")
tri_recursion(6)
results are this just want to know how its coming
1
3
6
10
15
21
The function computes the sum of all numbers between 0 and n, and prints intermediate results. The first 1 is 0+1, the 3 = 0+1+2, 6 = 0+1+2+3, 10 = 0+1+2+3+4, ...
To understand a recursive function, you need 2 points : how is the recursive call done, and when does the recursion stop.
The recursive call is given by result = k+tri_recursion(k-1)and the recursion stops when k <= 0and returns 0. So if we assume only positive numbers, we could describe tri_recursion so:
tri_recursion(k) = k + tri_recursion(k-1) if k > 0
tri_recursion(0) = 0
So tri_recursion(k) = k + tri_recursion(k-1) = k + (k-1) + tri_recursion(k-2) = k + (k-1) + (k-2) + tri_recursion(k-3) ... = k + (k-1) + (k-2) + ... + 0
So tri_recursion(k) is the sum of all numbers between 0 and k.
Note that the sum on all numbers between 0 and k equals k*(k+1) / 2 so tri_recursion(6) = 6 * 7 / 2 = 21

How to filter a numpy array based on two conditions: one depending on the other?

I am using the cv2 library to detect key points of 2 stereo images and converted the resulting dmatches objects to a numpy array:
kp_left, des_left = sift.detectAndCompute(im_left, mask_left)
matches = bf.match(des_left, des_right) # according to assignment pdf
np_matches = dmatch2np(matches)
Then I want to filter matches if the key points are filtering, after y-direction, which should not differ bigger than 3 pixels:
ind = np.where(np.abs(kp_left[np_matches[:, 0], 1] - kp_right[np_matches[:, 1], 1]) < 4)
AND those key points should also not have a difference smaller than < 0. Then it means the key point is behind the camera.
ind = np.where((kp_left[np_matches[ind[0], 0], 0] - kp_right[np_matches[ind[0], 1], 0]) >= 0)
How to combine those 2 conditions?
The general form is this:
condition1 = x < 4
condition2 = y >= 100
result = np.where(condition1 & condition2)
The even more general form:
conditions = [...] # list of bool arrays
result = np.where(np.logical_and.reduce(conditions))

Resources