How to use Word2Vec CBOW in statistical algorithm? - nlp

I have seen few examples of using CBOW in Neural Networks models (although I did not understand it)
I know that Word2Vec is not similar to BOW or TFIDF, as there is no single value for CBOW
and all examples I saw were using Neural Network.
I have 2 questions
1- Can we convert the vector to a single value and put it in a dataframe so we can use it in Logistic Regression Model?
2- Is there any simple code for CBOW usage with logistic Regression?
More Explanation.
In my case, I have a corpus that I want to make a comparison between top features in BOW and CBOW
after converting to BOW
I get this dataset
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 5 3 8 2 0
2 0 1 0 0 6 9
3 1 4 1 5 1 7
after converting to TFIDF
I get this dataset
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 0.38 0.42 0.02 0.22 0.00 0.19
2 0 0.75 0.20 0.08 0.12 0.37 0.21
3 1 0.17 0.84 0.88 0.11 0.07 0.44
I am observing the results of top 3 features in each model
so my dataset become like this
BOW (I put null here for the values that will be omitted)
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 5 null 8 null null 7
2 0 null null null 6 9 2
3 1 4 null 5 null 7 null
TFIDF (I put null here for the values that will be omitted)
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 0.38 0.42 null 0.22 null null
2 0 0.75 null null null 0.37 0.21
3 1 null 0.84 0.88 null null 0.44
I want now to do the same with Word2Ven CBOW
I want to take the highest values in the CBOW model
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 v11 v12 v13 v14 v15 v16
2 0 v21 v22 v23 v24 v25 v26
3 1 v31 v32 v33 v34 v35 v36
to be like this
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 v11 null v13 null v15 null
2 0 null null v23 null v25 v26
3 1 v31 null v33 v34 null null

No matter the internal training method, CBOW or skip-gram, a word-vector is always a multidimensional vector: it contains many floating-point numbers.
So at one level, that is one "value" - where the "value" is a vector. But it's never a single number.
Word-vectors, even with all their dimensions, can absolutely serve as inputs for a downstream logistic regression task. But the exact particulars depend on exactly what data you're operating on, and what you intend to achieve - so you may want to expand your question, or ask a more specific followup, with more info about the specific data/task you're considering.
Note also: this is done more often with the pipeline of a library like scikit-learn. Putting dense high-dimensional word-vectors themselves (or other features derived from word-vectors) directly into "dataframes" is often a mistake, adding overhead & indirection compared to working with such large feature-vectors in their more compact/raw format of (say) numpy arrays.

Related

In Python: How to convert 1/8th of space to 1/6th of space?

Have got dataframe at store-product level as shown in sample below:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.25 0.0625 0.75 2 6
11 Orange 0.5 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.375 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Scenario: All product here have space in terms of 1/8th. But if a product have carton_size other than null, then that particular product space has to be converted in terms of 1/(carton_size)th considering the Min(Space shouldn't be lesser than Min) and Max(Space shouldn't be greater than Max) values. Can get space from non-carton products but at the end, sum of 'Space' column should be equivalent/lesser than 'Total_table' value. Also, these 1/8th and 1/6th values are in relation to the 'Total_table', this total_table value is splitted as Space for each product.
Example: In above given dataframe, Three products have carton size, so we can take 1/8th space from the non-carton product selecting from top and split it as 1/24(means 1/24 + 1/24 + 1/24 = 1/8), which can be added to three carton products to make it 1/6, which forms the expected output shown below considering Min and Max values. If any of the product doesn't satisfy Min or Max condition - leave that product(eg., Tomato).
Roughly Expected Output:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.292 0.0625 0.75 2 6
11 Orange 0.375 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.417 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Need solution in Python.
Thanks in Advance!

Which statsmodels ANOVA model for within- and between-subjects design?

I have a classic ANOVA design: two experimental conditions with two levels each; one participant answers on two of the four resulting conditions. A sample of my data looks like this:
participant_ID Condition_1 Condition_2 dependent_var
1 1 1 0.71
1 2 1 0.43
2 1 1 0.77
2 2 1 0.37
3 1 1 0.58
3 2 1 0.69
4 2 1 0.72
4 1 1 0.12
26 2 2 0.91
26 1 2 0.53
27 1 2 0.29
27 2 2 0.39
28 2 2 0.75
28 1 2 0.51
29 1 2 0.42
29 2 2 0.31
Using statsmodels, I wish to identify the effects of both conditions on the dependent variable, allowing for the fact that each participant answers twice and that there may be interactions. My expectation would be that I would use the repeat-measures ANOVA option as follows:
from statsmodels.stats.anova import AnovaRM
aovrm = AnovaRM(data, 'dependent_var', 'participant_ID', within=['Condition_1'], between = ['Condition_2'], aggregate_func= 'mean').fit()
However, when I do this, I get the following error:
NotImplementedError: Between subject effect not yet supported!
Does anyone know of a workaround for this that doesn't involve learning R? My instinct would be to try a mixed linear model, but I don't know how to account for the fact that each participant answered twice.
Apologies if this turns out to really be a Cross Validated question!
You could try out the pingouin package: https://pingouin-stats.org/index.html
It seems to cover mixed anovas, which are not yet fully implemented in statsmodels.

reading data with varying length header

I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)

How can you identify the best companies for each variable and copy the cases?

i want to compare the means of subgroups. The cases of the subgroup with the lowest and the highest mean should be copied and applied to the end of the dataset:
Input
df.head(10)
Outcome
Company Satisfaction Image Forecast Contact
0 Blue 2 3 3 1
1 Blue 2 1 3 2
2 Yellow 4 3 3 3
3 Yellow 3 4 3 2
4 Yellow 4 2 1 5
5 Blue 1 5 1 2
6 Blue 4 2 4 3
7 Yellow 5 4 1 5
8 Red 3 1 2 2
9 Red 1 1 1 2
I have around 100 cases in my sample. Now i look at the means for each company.
Input
df.groupby(['Company']).mean()
Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
So for satisfaction Yellow got the best and Blue the worst value. I want to copy the cases of yellow and blue and add them to the dataset but now with the new lable "Best" and "Worst". I dont want to rename it and i want to iterate over the dataset and to this for other columns, too (for example Image). Is there a solution for it? After i added the cases i want an output like this:
Input
df.groupby(['Company']).mean()
Expected Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
Best 3.857143 3.142857 3.000000 3.142857
Worst 2.666667 2.583333 2.866667 2.625000
But how i said. It is really important that the companies with the best and worst values for each column will be added again and not just be renamed because i want to do to further data processing with another software.
************************UPDATE****************************
I found out how to copy the correct cases:
Input
df2 = df.loc[df['Company'] == 'Yellow']
df2 = df2.replace('Yellow','Best')
df2 = df2[['Company','Satisfaction']]
new = [df,df2]
result = pd.concat(new)
result
Output
Company Contact Forecast Image Satisfaction
0 Blue 1.0 3.0 3.0 2
1 Blue 2.0 3.0 1.0 2
2 Yellow 3.0 3.0 3.0 4
3 Yellow 2.0 3.0 4.0 3
..........................................
87 Best NaN NaN NaN 3
90 Best NaN NaN NaN 4
99 Best NaN NaN NaN 1
111 Best NaN NaN NaN 2
Now i want to copy the cases of the company with the best values for the other variables, too. But now i have to identify manually which company is best for each category. Isnt there a more comfortable solution?
I have a solution. First i create a dictionary with the variables i want to create a dummy company for best and worst:
variables = ['Contact','Forecast','Satisfaction','Image']
After i loop over this columns and adding the cases again with the new label "Best" or "Worst":
for n in range(0,len(variables),1):
Start = variables[n-1]
neu = df.groupby(['Company'], as_index=False)[Start].mean()
Best = neu['Company'].loc[neu[Start].idxmax()]
Worst = neu['Company'].loc[neu[Start].idxmin()]
dfBest = df.loc[df['Company'] == Best]
dfWorst = df.loc[df['Company'] == Worst]
dfBest = dfBest.replace(Best,'Best')
dfWorst = dfWorst.replace(Worst,'Worst')
dfBest = dfBest[['Company',Start]]
dfWorst = dfWorst[['Company',Start]]
new = [df,dfBest,dfWorst]
df = pd.concat(new)
Thanks guys :)

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources