I have a problem understanding sklearn's TfidfVectorizer results

I have a problem understanding sklearn's TfidfVectorizer results - python-3.x

Given a corpus of 3 documents, for example:
sentences = ["This car is fast",
"This car is pretty",
"Very fast truck"]
I am executing by hand the calculation of tf-idf.
For document 1, and the word "car", I can find that:
TF = 1/4
IDF = log(3/2)
TF-IDF = 1/4 * log(3/2)
Same result should apply to document 2, since it has 4 words, and one of them is "car".
I have tried to apply this in sklearn, with the code below:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
data = {'text': sentences}
df = pd.DataFrame(data)
tv = TfidfVectorizer()
tfvector = tv.fit_transform(df.text)
print(pd.DataFrame(tfvector.toarray(), columns=tv.get_feature_names()))
And the result I get is:
car fast is pretty this truck very
0 0.500000 0.50000 0.500000 0.000000 0.500000 0.000000 0.000000
1 0.459854 0.00000 0.459854 0.604652 0.459854 0.000000 0.000000
2 0.000000 0.47363 0.000000 0.000000 0.000000 0.622766 0.622766
I can understand that sklearn uses L2 normalization, but still, shouldn't the tf-idf score of "car" in the first two documents be the same? Can anyone help me understanding the results?

It is because of the normalization. If you add the parameter norm=None to the TfIdfVectorizer(norm=None), you will get the following result, which has the same value for car
car fast is pretty this truck very
0 1.287682 1.287682 1.287682 0.000000 1.287682 0.000000 0.000000
1 1.287682 0.000000 1.287682 1.693147 1.287682 0.000000 0.000000
2 0.000000 1.287682 0.000000 0.000000 0.000000 1.693147 1.693147

Related

Pandas: mask dataframe by a rolling window

I have a dataframe df_snow_or_ice which indicates whether there is snow or not in a certain day as following:
df_snow_or_ice
Out[63]:
SWE
datetime_doy
2007-01-01 0.000000
2007-01-02 0.000000
2007-01-03 0.000000
2007-01-04 0.000000
2007-01-05 0.000000
...
2019-12-27 0.000000
2019-12-28 0.000000
2019-12-29 0.000000
2019-12-30 0.000000
2019-12-31 0.000064
[4748 rows x 1 columns]
And I also have a dataframe gpi_data_tmp and want to mask it based on whether there is snow or not (whether df_snow_or_ice['SWE']>0) in a rolling window of 42 days. That is, if at day d, df_snow_or_ice.iloc[d-21:d+21]['SWE']>0 during the interval [d-21:d+21], then gpi_data_tmp.iloc[d] is masked as np.nan. If I wrote it in for-loop, it's like:
half_width = 21
for i in range(half_width,len(df_snow_or_ice)-half_width+1,1):
if df_snow_or_ice['SWE'].iloc[i] > 0 :
gpi_data_tmp.iloc[(i-half_width):(i+half_width)] = np.nan
for i in range(len(df_snow_or_ice)):
if df_snow_or_ice['SWE'].iloc[i] > 0 :
gpi_data_tmp.iloc[i] = np.nan
So how can I write it efficiently? by some functions of pandas? Thanks!

Assign different colors to polydata in paraview

Trying to avoid defining multiple individual polygons/quad, so I use polydata.
I need to define multiple polydata in a Matlab generated vtk file, but each one should be assigned a different color (defined in a lookup table).
The following code gives an error and accepts only the first color which it assigns to all polydata.
# vtk DataFile Version 5.1
vtk output
ASCII
DATASET POLYDATA
POINTS 12 float
0.500000 1.000000 0.000000
0.353553 1.000000 -0.353553
0.000000 1.000000 -0.500000
-0.353553 1.000000 -0.353553
-0.500000 1.000000 0.000000
-0.353553 1.000000 0.353553
0.000000 1.000000 0.500000
0.353553 1.000000 0.353553
0. 0. 0.
1. 1. 1.
2. 2. 2.
1. 2. 1.
POLYGONS 3 12
OFFSETS vtktypeint64
0 8 12
CONNECTIVITY vtktypeint64
0 1 2 3 4 5 6 7
9 10 11 12
CELL_DATA 2
SCALARS SMEARED float 1
LOOKUP_TABLE victor
0 1
LOOKUP_TABLE victor 1
1.000000 0.000000 0.000000 1.000000
0.000000 1.000000 0.000000 1.000000

LOOKUP_TABLE victor 1
This should be LOOKUP_TABLE victor 2, as you define 2 RGBA points in your table

How to add path to texture in OBJ or MTL file?

I have next problem:
My project consists of .obj file, .mtl file and texture(.jpg).
I need to divide texture into multiple files. But, when I do it, the UV coordinates (after mapping and reverse mapping) will be the same on several files, thus it cause error watching obj using meshlab.
How can I solve my problem ?

Meshlab does support files with several texture files, just by using a separate material for each texture. It is not clear if you are generating your obj files with meshlab or other program, so I'm not sure if this is a meshlab related question.
Here is a sample of a minimal multitexture .obj file (8 vertex, 4 triangles, 2 textures)
mtllib ./TextureDouble.obj.mtl
# 8 vertices, 8 vertices normals
vn 0.000000 0.000000 1.570796
v 0.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 1.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 1.000000 1.000000 0.000000
vn 0.000000 0.000000 1.570796
v 0.000000 1.000000 0.000000
vn 0.000000 0.000000 1.570796
v 2.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 3.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 3.000000 1.000000 0.000000
vn 0.000000 0.000000 1.570796
v 2.000000 1.000000 0.000000
# 4 coords texture
vt 0.000000 0.000000
vt 1.000000 0.000000
vt 1.000000 1.000000
vt 0.000000 1.000000
# 2 faces using material_0
usemtl material_0
f 1/1/1 2/2/2 3/3/3
f 1/1/1 3/3/3 4/4/4
# 4 coords texture
vt 0.000000 0.000000
vt 1.000000 0.000000
vt 1.000000 1.000000
vt 0.000000 1.000000
# 2 faces using material_1
usemtl material_1
f 5/5/5 6/6/6 7/7/7
f 5/5/5 7/7/7 8/8/8
And here is the TextureDouble.obj.mtl file. To test the files, you must provide 2 image files named TextureDouble_A.png and TextureDouble_B.png.
newmtl material_0
Ka 0.200000 0.200000 0.200000
Kd 1.000000 1.000000 1.000000
Ks 1.000000 1.000000 1.000000
Tr 1.000000
illum 2
Ns 0.000000
map_Kd TextureDouble_A.png
newmtl material_1
Ka 0.200000 0.200000 0.200000
Kd 1.000000 1.000000 1.000000
Ks 1.000000 1.000000 1.000000
Tr 1.000000
illum 2
Ns 0.000000
map_Kd TextureDouble_B.png

Compute the Euclidean distance using word counts

Consider the following two sentences.
Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Compute the Euclidean distance using word counts.

You can use the package tm to find word counts and then compute the euclidean distance
> library(tm)
> s1 <- " The quick brown fox jumps over the lazy dog"
> s2 <- "A quick brown dog outpaces a quick fox"
>
> VS <- VectorSource(c(s1,s2))
> corp <- Corpus(VS)
> dtm <- DocumentTermMatrix(corp)
> d <- dist(t(dtm), method = 'euclidean')
> d
brown dog fox jumps lazy outpaces over quick
dog 0.000000
fox 0.000000 0.000000
jumps 1.000000 1.000000 1.000000
lazy 1.000000 1.000000 1.000000 0.000000
outpaces 1.000000 1.000000 1.000000 1.414214 1.414214
over 1.000000 1.000000 1.000000 0.000000 0.000000 1.414214
quick 1.000000 1.000000 1.000000 2.000000 2.000000 1.414214 2.000000
the 1.414214 1.414214 1.414214 1.000000 1.000000 2.236068 1.000000 2.236068

Plot 3 dimensional unequal length of data

I am new in gnuplot so excuse me if it looks simple. I have a data file like below and I want to draw a diagram like this:
t x = 0.00 0.20 0.40 0.60 0.80 1.00
0.00 0.000000 0.640000 0.960000 0.960000 0.640000 0.000000
0.02 0.000000 0.480000 0.800000 0.800000 0.480000 0.000000
0.04 0.000000 0.400000 0.640000 0.640000 0.400000 0.000000
0.06 0.000000 0.320000 0.520000 0.520000 0.320000 0.000000
0.08 0.000000 0.260000 0.420000 0.420000 0.260000 0.000000
0.10 0.000000 0.210000 0.340000 0.340000 0.210000 0.000000
0.12 0.000000 0.170000 0.275000 0.275000 0.170000 0.000000
0.14 0.000000 0.137500 0.222500 0.222500 0.137500 0.000000
0.16 0.000000 0.111250 0.180000 0.180000 0.111250 0.000000
0.18 0.000000 0.090000 0.145625 0.145625 0.090000 0.000000
0.20 0.000000 0.072813 0.117813 0.117813 0.072813 0.000000
GNU octave equivalent command is something like this:
mesh(tplot,xplot,ttplot);

Well as with many things, it is simple if you know how. This is straighforward to plot if you remove the x = and the t from the data file, e.g.:
0 0.00 0.20 0.40 0.60 0.80 1.00
0.00 0.000000 0.640000 0.960000 0.960000 0.640000 0.000000
0.02 0.000000 0.480000 0.800000 0.800000 0.480000 0.000000
0.04 0.000000 0.400000 0.640000 0.640000 0.400000 0.000000
0.06 0.000000 0.320000 0.520000 0.520000 0.320000 0.000000
0.08 0.000000 0.260000 0.420000 0.420000 0.260000 0.000000
0.10 0.000000 0.210000 0.340000 0.340000 0.210000 0.000000
0.12 0.000000 0.170000 0.275000 0.275000 0.170000 0.000000
0.14 0.000000 0.137500 0.222500 0.222500 0.137500 0.000000
0.16 0.000000 0.111250 0.180000 0.180000 0.111250 0.000000
0.18 0.000000 0.090000 0.145625 0.145625 0.090000 0.000000
0.20 0.000000 0.072813 0.117813 0.117813 0.072813 0.000000
Then the data can be interpreted as a "non-uniform" matrix, although it is uniform. This is useful as it reads the first row and first column correctly. See help matrix and help matrix nonuniform for more. For example:
echo 'splot "data" nonuniform matrix with lines' | gnuplot --persist
Gives me:
To make it similar to the output produced by the GNU Octave mesh command, do something like this:
set xlabel "x"
set ylabel "t"
set zlabel "u"
set view 20,210
set border 4095 lw 2
set hidden3d
set xyplane 0
set autoscale fix
set nokey
set notics
splot "data" nonuniform matrix lt -1 lw 2 with lines
Which results in:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

I have a problem understanding sklearn's TfidfVectorizer results - python-3.x

Related

Pandas: mask dataframe by a rolling window

Assign different colors to polydata in paraview

How to add path to texture in OBJ or MTL file?

Compute the Euclidean distance using word counts

Plot 3 dimensional unequal length of data

Categories

Resources