P value from pairwise_cor using widyr - text

I am using widyr package in R to perform pairwise correlations between clusters of words. I work to examine correlation among clusters (restoration, recreation ...) , which indicates how often they appear together relative to how often they appear separately in my documents (social media text).
Everthing worked fine for correlations following this tutorial (https://www.youtube.com/watch?v=mApnx5NJwQA) from 10:34 to 12:52
# correlation co-occuring
correlatee <- data2 %>%
pairwise_cor(word, line, sort = TRUE)
# A tibble: 72 x 3
item1 item2 correlation
<chr> <chr> <dbl>
1 physical recreation 0.321
2 recreation physical 0.321
3 restoration recreation 0.304
4 recreation restoration 0.304
5 physical restoration 0.283
6 restoration physical 0.283
7 affection aesthetics 0.240
8 aesthetics affection 0.240
9 restoration aesthetics 0.227
10 aesthetics restoration 0.227
# ... with 62 more rows
# i Use `print(n = ...)` to see more rows
However, my question how I can get is p values of the correlations using pairwise_cor() or other ways to get p values in the pairwise comparison?
there is the question but the code is not working for me : pairwise_cor() in R: p value?
Thank you very much.
This link pairwise_cor() in R: p value?
but the code is not working for me

Related

How to merge back k-means clustering outputs to the corresponding units in a dataframe

I just wonder this strategy is the correct way to merge back the k-means clustering outputs to the corresponding units in the existing dataframe.
For example, I have a data set which includes user ID, age, income, gender and I want to run a k-means clustering algorithm to find a set of clusters where each cluster has similar users in terms of these characteristics (age, income, gender). Note that I disregard the value difference among the characteristics for the brevity.
existing_dataframe
user_id age income gender
1 13 10 1 (female)
2 34 50 1
3 75 40 0 (male)
4 23 29 0
5 80 45 1
... ... ... ...
existing_dataframe_for_analysis
(Based on my understanding after referring number of tutorials from online sources,
I should not include user_id variable, so I use the below dataframe for the analysis;
please let me know if I am wrong)
age income gender
13 10 1 (female)
34 50 1
75 40 0 (male)
23 29 0
80 45 1
... ... ... ...
Assume that I found the optimal number of clusters from the dataset is 3. So I decided to set it as 3 and predict in which cluster each user is categorized using the below code.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3,
init='k-means++',
max_iter=20,
n_init=10)
model.fit(existing_dataframe_for_analysis)
predicted=model.predict(existing_dataframe_for_analysis)
print (predicted[:5])
The expected out can be shown below:
[0 1 2 1 2]
If I run the below code where I create a new column called 'cluster' which represents the analysis outputs and add the colum to the existing dataframe, does it gaurantee that nth element from the output list corresponds to the nth observation (user id) in the existing dataframe? Please advice.
existing_dataframe['cluster']=predicted
print (existing_dataframe)
output:
user_id age income gender cluster
1 13 10 1 (female) 0
2 34 50 1 1
3 75 40 0 (male) 2
4 23 29 0 1
5 80 45 1 2
... ... ... ... ...
Your approach to rejoin the predictions is correct. Your assumption to not include any ids is also correct. However, I strongly advise you to scale your input variables before doing any clustering, as your variables have different units.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(existing_dataframe_for_analysis)
Then continue working with this new object as you did before.

Pandas - number of occurances of IDs from a column in one dataframe in several columns of a second dataframe

I'm new to python and pandas, and trying to "learn by doing."
I'm currently working with two football/soccer (depending on where you're from!) dataframes:
player_table has several columns, among others 'player_name' and 'player_id'
player_id player_name
0 223 Lionel Messi
1 157 Cristiano Ronaldo
2 962 Neymar
match_table also has several columns, among others 'home_player_1', '..._2', '..._3' and so on, as well as the corresponding 'away_player_1', '...2' , '..._3' and so on. The content of these columns is a player_id, such that you can tell which 22 (2x11) players participated in a given match through their respective unique IDs.
I'll just post a 2 vs. 2 example here, because that works just as well:
match_id home_player_1 home_player_2 away_player_1 away_player_2
0 321 223 852 729 853
1 322 223 858 157 159
2 323 680 742 223 412
What I would like to do now is to add a new column to player_table which gives the number of appearances - player_table['appearances'] by counting the number of times each player_id is mentioned in the part of the dataframe match_table bound horizontally by (home player 1, away player 2) and vertically by (first match, last match)
Desired result:
player_id player_name appearances
0 223 Lionel Messi 3
1 157 Cristiano Ronaldo 1
2 962 Neymar 0
Coming from other programming languages I think my standard solution would be a nested for loop, but I understand that is frowned upon in python...
I have tried several solutions but none really work, this seems to at least give the number of appearances as "home_player_1"
player_table['appearances'] = player_table['player_id'].map(match_table['home_player_1'].value_counts())
Is there a way to expand the map function to include several columns in a dataframe? Or do I have to stack the 22 columns on top of one another in a new dataframe, and then map? Or is map not the appropriate function?
Would really appreciate your support, thanks!
Philipp
Edit: added specific input and desired output as requested
What you could do is use .melt() on the match_table player columns (so it'll turn your wide table in to a tall/long table of a single column). Then do a .value_counts on the that one column. Finally join it to the player_table on the 'player_id' column
import pandas as pd
player_table = pd.DataFrame({'player_id':[223,157,962],
'player_name':['Lionel Messi','Cristiano Ronaldo','Neymar']})
match_table = pd.DataFrame({
'match_id':[321,322,323],
'home_player_1':[223,223,680],
'home_player_2':[852,858,742],
'away_player_1':[729,157,223],
'away_player_2':[853,159,412]})
player_cols = [x for x in match_table.columns if 'player_' in x]
match_table[player_cols].value_counts(sort=True)
df1 = match_table[player_cols].melt(var_name='columns', value_name='appearances')['appearances'].value_counts(sort=True).reset_index(drop=False).rename(columns={'index':'player_id'})
appearances_df = df1.merge(player_table, how='right', on='player_id')[['player_id','player_name','appearances']].fillna(0)
Output:
print(appearances_df)
player_id player_name appearances
0 223 Lionel Messi 3.0
1 157 Cristiano Ronaldo 1.0
2 962 Neymar 0.0

How to use Word2Vec CBOW in statistical algorithm?

I have seen few examples of using CBOW in Neural Networks models (although I did not understand it)
I know that Word2Vec is not similar to BOW or TFIDF, as there is no single value for CBOW
and all examples I saw were using Neural Network.
I have 2 questions
1- Can we convert the vector to a single value and put it in a dataframe so we can use it in Logistic Regression Model?
2- Is there any simple code for CBOW usage with logistic Regression?
More Explanation.
In my case, I have a corpus that I want to make a comparison between top features in BOW and CBOW
after converting to BOW
I get this dataset
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 5 3 8 2 0
2 0 1 0 0 6 9
3 1 4 1 5 1 7
after converting to TFIDF
I get this dataset
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 0.38 0.42 0.02 0.22 0.00 0.19
2 0 0.75 0.20 0.08 0.12 0.37 0.21
3 1 0.17 0.84 0.88 0.11 0.07 0.44
I am observing the results of top 3 features in each model
so my dataset become like this
BOW (I put null here for the values that will be omitted)
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 5 null 8 null null 7
2 0 null null null 6 9 2
3 1 4 null 5 null 7 null
TFIDF (I put null here for the values that will be omitted)
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 0.38 0.42 null 0.22 null null
2 0 0.75 null null null 0.37 0.21
3 1 null 0.84 0.88 null null 0.44
I want now to do the same with Word2Ven CBOW
I want to take the highest values in the CBOW model
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 v11 v12 v13 v14 v15 v16
2 0 v21 v22 v23 v24 v25 v26
3 1 v31 v32 v33 v34 v35 v36
to be like this
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 v11 null v13 null v15 null
2 0 null null v23 null v25 v26
3 1 v31 null v33 v34 null null
No matter the internal training method, CBOW or skip-gram, a word-vector is always a multidimensional vector: it contains many floating-point numbers.
So at one level, that is one "value" - where the "value" is a vector. But it's never a single number.
Word-vectors, even with all their dimensions, can absolutely serve as inputs for a downstream logistic regression task. But the exact particulars depend on exactly what data you're operating on, and what you intend to achieve - so you may want to expand your question, or ask a more specific followup, with more info about the specific data/task you're considering.
Note also: this is done more often with the pipeline of a library like scikit-learn. Putting dense high-dimensional word-vectors themselves (or other features derived from word-vectors) directly into "dataframes" is often a mistake, adding overhead & indirection compared to working with such large feature-vectors in their more compact/raw format of (say) numpy arrays.

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources