perform principal component analysis on moving window of data - python-3.x

This was partly answered by #WhoIsJack but not completely solved given the errors I get. Basically, I'm trying to perform principal component analysis on a rolling window of data. For example, I'd run PCA on the last 200 days in the df, move forward 1 day and do PCA again on the last 200 days. So as you move forward each day, you'd include the next day's measurement and exclude the last measurement.
You have a random df:
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
Here's window size:
window = 200
Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
Define PCA fit-transform function. Instead of attempting to return the result, it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
Use rolling to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
The results should be contained here:
print(df_pca)
However, when I generate the results only the first row of data looks to contain PCAs while the rest of the rows are zero.
I also tried the following function:
def rolling_pca(x, window):
r = x.rolling(window=window)
pca = PCA(3)
y = pca.fit(r)
z = pca.fit_transform(y)
return z
window = 200
Which I thought would generate a new df with rolling PCAs:
data = df.apply(rolling_pca, window=window)
But I got the following error: setting an array element with a sequence.
I've also tried manually calculating with below. I get: "unsupported operand type(s) for /: 'Rolling' and 'int'"
def rolling_pca(x, window):
# create rolling dataframe
r = x.rolling(window=window)
# demand data
X = np.matrix(r)
X_dm = X - np.mean(X, axis = 0)
#Eigenvalue decomposition (of covariance matrix)
Cov_X = np.cov(X_dm, rowvar = False)
eigen = np.linalg.eig(Cov_X)
eig_values_X = np.matrix(eigen[0])
eig_vectors_X = np.matrix(eigen[1])
#transformed data
Y_dm = X_dm * eig_vectors_X
#assign transformed yields
yields_trans = Y_dm.copy()
# get PCs
pc1_yields = x.copy()
pcas = yields_trans[:,0:3]
return pcas
#assign window length
window = 300
rolling_pca(data, window=window)
And tried below. Get error: "LinAlgError: 0-dimensional array given. Array must be at least two-dimensional"
def pca(x):
# demand data
X = np.matrix(x.values)
X_dm = X - np.mean(X, axis = 0)
#Eigenvalue decomposition (of covariance matrix)
Cov_X = np.cov(X_dm, rowvar = False)
eigen = np.linalg.eig(Cov_X)
eig_values_X = np.matrix(eigen[0])
eig_vectors_X = np.matrix(eigen[1])
#transformed data
Y_dm = X_dm * eig_vectors_X
#assign transformed yields
yields_trans = Y_dm.copy()
# get 3 PCs
pcas = yields_trans[:,0:3]
final_pcas = pd.DataFrame(pcas)
return final_pcas
data.rolling(200).apply(pca)
Any thoughts would be appreciated!

Related

How can I interpolate values from two lists (in Python)?

I am relatively new to coding in Python. I have mainly used MatLab in the past and am used to having vectors that can be referenced explicitly rather than appended lists. I have a script where I generate a list of x- and y- (z-, v-, etc) values. Later, I want to interpolate and then print a table of the values at specified points. Here is a MWE. The problem is at line 48:
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
I'm not sure I have the correct syntax for the last two lines either:
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Here is the full script for the MWE:
#This script was written to test how to interpolate after data was created in a loop and stored as a list. Can a list be accessed explicitly like a vector in matlab?
#
from scipy.interpolate import interp1d
from math import * #for ceil
from astropy.table import Table #for Table
import numpy as np
# define the initial conditions
x = 0 # initial x position
y = 0 # initial y position
Rmax = 10 # maxium range
""" initializing variables for plots"""
x_list = [x]
y_list = [y]
""" define functions"""
# not necessary for this MWE
"""create sample data for MWE"""
# x and y data are calculated using functions and appended to their respective lists
h = 1
t = 0
tf = 10
N=ceil(tf/h)
# Example of interpolation without a loop: https://docs.scipy.org/doc/scipy/tutorial/interpolate.html#d-interpolation-interp1d
#x = np.linspace(0, 10, num=11, endpoint=True)
#y = np.cos(-x**2/9.0)
#f = interp1d(x, y)
for i in range(N):
x = h*i
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
## Interpolation after x- and y-lists are already created
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
print(table)
Your help and patience will be greatly appreciated!
Best regards,
Alex
Your code has some glaring issues that made it really difficult to understand. Let's first take a look at some things I needed to fix:
for i in range(N):
x = h*1
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
x_list.append(x)
y_list.append(y)
You are appending a single value without modifying it. What I presume you wanted is down below.
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
This is where things get strange. First: use pandas tables, this is the more popular choice. Second: I have no idea what you are trying to loop over. What I presume you wanted was to vary the number of points for the interpolation, which I have done so below. Third: you are trying to interpolate a point, when you probably want to interpolate over a range of points (...interpolation). Lastly, you are using the interp1d function incorrectly. Please take a look at the code below or run it here; let me know what you exactly wanted (specifically: what should xq / xq(nn) be?), because the MRE you provided is quite confusing.
from scipy.interpolate import interp1d
from math import *
import numpy as np
Rmax = 10
h = 1
t = 0
tf = 10
N = ceil(tf/h)
x = np.arange(0,N+1)
y = np.cos(-x**2/9.0)
interval = 0.5
NN = ceil(Rmax/interval) + 1
ip_list = np.arange(1,interval*NN,interval)
xtable = []
ytable = []
for i,nn in enumerate(ip_list):
f = interp1d(x,y)
x_i = np.arange(0,nn+interval,interval)
xtable += [x_i]
ytable += [f(x_i)]
[print(i) for i in xtable]
[print(i) for i in ytable]

How to create a dataframe of a particular size containing both continuous and categorical values with a uniform random distribution

So, I'm trying to generate some fake random data of a given dimension size. Essentially, I want a dataframe in which the data has a uniform random distribution. The data consist of both continuous and categorical values. I've written the following code, but it doesn't work the way I want it to be.
import random
import pandas as pd
import time
from datetime import datetime
# declare global variables
adv_name = ['soft toys', 'kitchenware', 'electronics',
'mobile phones', 'laptops']
adv_loc = ['location_1', 'location_2', 'location_3',
'location_4', 'location_5']
adv_prod = ['baby product', 'kitchenware', 'electronics',
'mobile phones', 'laptops']
adv_size = [1, 2, 3, 4, 10]
adv_layout = ['static', 'dynamic'] # advertisment layout type on website
# adv_date, start_time, end_time = []
num = 10 # the given dimension
# define function to generate random advert locations
def rand_shuf_loc(str_lst, num):
lst = adv_loc
# using list comprehension
rand_shuf_str = [item for item in lst for i in range(num)]
return(rand_shuf_str)
# define function to generate random advert names
def rand_shuf_prod(loc_list, num):
rand_shuf_str = [item for item in loc_list for i in range(num)]
random.shuffle(rand_shuf_str)
return(rand_shuf_str)
# define function to generate random impression and click data
def rand_clic_impr(num):
rand_impr_lst = []
click_lst = []
for i in range(num):
rand_impr_lst.append(random.randint(0, 100))
click_lst.append(random.randint(0, 100))
return {'rand_impr_lst': rand_impr_lst, 'rand_click_lst': click_lst}
# define function to generate random product price and discount
def rand_prod_price_discount(num):
prod_price_lst = [] # advertised product price
prod_discnt_lst = [] # advertised product discount
for i in range(num):
prod_price_lst.append(random.randint(10, 100))
prod_discnt_lst.append(random.randint(10, 100))
return {'prod_price_lst': prod_price_lst, 'prod_discnt_lst': prod_discnt_lst}
def rand_prod_click_timestamp(stime, etime, num):
prod_clik_tmstmp = []
frmt = '%d-%m-%Y %H:%M:%S'
for i in range(num):
rtime = int(random.random()*86400)
hours = int(rtime/3600)
minutes = int((rtime - hours*3600)/60)
seconds = rtime - hours*3600 - minutes*60
time_string = '%02d:%02d:%02d' % (hours, minutes, seconds)
prod_clik_tmstmp.append(time_string)
time_stmp = [item for item in prod_clik_tmstmp for i in range(num)]
return {'prod_clik_tmstmp_lst':time_stmp}
def main():
print('generating data...')
# print('generating random geographic coordinates...')
# get the impressions and click data
impression = rand_clic_impr(num)
clicks = rand_clic_impr(num)
product_price = rand_prod_price_discount(num)
product_discount = rand_prod_price_discount(num)
prod_clik_tmstmp = rand_prod_click_timestamp("20-01-2018 13:30:00",
"23-01-2018 04:50:34",num)
lst_dict = {"ad_loc": rand_shuf_loc(adv_loc, num),
"prod": rand_shuf_prod(adv_prod, num),
"imprsn": impression['rand_impr_lst'],
"cliks": clicks['rand_click_lst'],
"prod_price": product_price['prod_price_lst'],
"prod_discnt": product_discount['prod_discnt_lst'],
"prod_clik_stmp": prod_clik_tmstmp['prod_clik_tmstmp_lst']}
fake_data = pd.DataFrame.from_dict(lst_dict, orient="index")
res = fake_data.apply(lambda x: x.fillna(0)
if x.dtype.kind in 'biufc'
# where 'biufc' means boolean, integer,
# unicode, float & complex data types
else x.fillna(random.randint(0, 100)
)
)
print(res.transpose())
res.to_csv("fake_data.csv", sep=",")
# invoke the main function
if __name__ == "__main__":
main()
Problem 1
when I execute the above code snippet, it prints fine but when written to csv format, its horizontally positioned; i.e., it looks like this... How do I position it vertically when writing to csv file? What I want is 7 columns (see lst_dict variable above) with n number of rows?
Problem 2
I dont understand why the random date is generated for the first 50 columns and remaining columns are filled with numerical values?
To answer your first question, replace
print(res.transpose())
with
res.transpose() print(res)
To answer your second question look at the length of the output of the method
rand_shuf_loc()
it as well as the other helper functions only produce a list of 50 items.
The creation of res using the method
fake_data.apply
replaces all nan with a random numeric, so it also applies a numeric to the columns without any predefined values.

Sort simmilarity matrix according to plot colors

I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.

Found array with 0 feature(s) (shape=(268215, 0)) while a minimum of 1 is required by StandardScaler

I am solving a problem where I am pulling data of all the ProductIDs and then I iterate through the dataframe to look at unique ProductIDs to perform a set of functions.
Here, item is the ProductID/Item number:
#looping through the big dataframe to get a dataframe pertaining to the unique ID
for item in df2['Item Nbr'].unique():
# fetch item data
df = df2.loc[df2['Item Nbr'] == item]
And then I have a set of custom made python functions:
So, when I get through the first loop (for one productID) it works all great, but when it iterates through the loop and goes to the next Product ID, I am certain that the data it is pulling out is right, but I get this error:
Found array with 0 feature(s) (shape=(268215, 0)) while a minimum of 1 is required by StandardScaler.
Although, the X_train and y_train shapes are : (268215, 6) (268215,)
Code Snippet : (Extra Information)
It is a huge file to show. But the initial big dataframe has
[362988 rows x 7 columns] - for first product and
[268215 rows x 7 columns] - for second product
Expansion of the code:
the big dataframe with two unique product IDS
biqQueryData = get_item_data(verbose=True)
iterate over each unique product ID for extracting a subset of dataframes that pertain to the product ID
for item in biqQueryData['Item Nbr'].unique():
df = biqQueryData.loc[biqQueryData['Item Nbr'] == item]
try:
df_model = model_all_stores(df, item, n_jobs=n_jobs,
train_model=train_model,
test_model=test_model,
tune_model=tune_model,
export_model=export_model,
output=export_demand)
the function model_all_stores
def model_all_stores(df_raw, item_nbr, n_jobs=1, train_model=False,
test_model=False, export_model=False, output=False,
tune_model=False):
"""Models demand for specified item.
Predict the demand of specified item for all stores. Does not
filter for predict hidden demand (the function get_hidden_demand
should be used for this.)
Output: data frame output
"""
# ML model hyperparameters
impute_with = 'median'
n_estimators = 100
min_samples_split = 3
min_samples_leaf = 3
max_depth = None
# load data and subset traited and valid
dfnew = subset_traited_valid(df_raw)
# get known demand
df_ma = get_demand(dfnew)
# impute missing sales data
median_sales = df_ma['Sales Qty'].median()
df_ma['Sales Qty'] = df_ma['Sales Qty'].fillna(median_sales)
# add moving average features
df_ma = df_ma.sort_values('Gregorian Days')
window_list = [7 * x for x in [1, 2, 4, 8, 16, 52]]
for w in window_list:
grouped = df_ma.groupby('Store Nbr')['Sales Qty'].shift(1)
rolling = grouped.rolling(window=w, min_periods=1).mean()
df_ma['MA' + str(w)] = rolling.reset_index(0, drop=True)
X_full = df_ma.loc[:, 'MA7':].values
#print(X_full.shape)
# use full data if not testing/tuning
rows_for_model = df_ma['Known Demand'].notnull()
X = df_ma.loc[rows_for_model, 'MA7':].values
y = df_ma.loc[rows_for_model, 'Known Demand'].values
X_train, y_train = X, y
print(X_train.shape, y_train.shape)
if train_model:
# instantiate model components
imputer = Imputer(missing_values='NaN', strategy=impute_with, axis=0)
scale = StandardScaler()
pca = PCA()
forest = RandomForestRegressor(n_estimators=n_estimators,
max_features='sqrt',
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_depth=max_depth,
criterion='mse',
random_state=42,
warm_start=True,
n_jobs=n_jobs)
# pipeline for model
pipeline_steps = [('imputer', imputer),
('scale', scale),
('pca', pca),
('forest', forest)]
regr = Pipeline(pipeline_steps)
regr.fit(X_train, y_train)
It fails here
Snippet Of data:
biqQueryData (the entire Dataframe)
364174,1084,2019-12-12,,,,0.0
.....
364174,1084,2019-12-13,,,,0.0
188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0
.....
188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0
Subset DF 1:
364174,1084,2019-12-12,,,,0.0
.....
364174,1084,2019-12-13,,,,0.0
Subset DF 2:
188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0
.....
188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0
Any help here would be great! Thank you

Finding conditional mutual information from 3 discrete variable

I am trying to find conditional mutual information between three discrete random variable using pyitlib package for python with the help of the formula:
I(X;Y|Z)=H(X|Z)+H(Y|Z)-H(X,Y|Z)
The expected Conditional Mutual information value is= 0.011
My 1st code:
import numpy as np
from pyitlib import discrete_random_variable as drv
X=[0,1,1,0,1,0,1,0,0,1,0,0]
Y=[0,1,1,0,0,0,1,0,0,1,1,0]
Z=[1,0,0,1,1,0,0,1,1,0,0,1]
a=drv.entropy_conditional(X,Z)
##print(a)
b=drv.entropy_conditional(Y,Z)
##print(b)
c=drv.entropy_conditional(X,Y,Z)
##print(c)
p=a+b-c
print(p)
The answer i am getting here is=0.4632245116328402
My 2nd code:
import numpy as np
from pyitlib import discrete_random_variable as drv
X=[0,1,1,0,1,0,1,0,0,1,0,0]
Y=[0,1,1,0,0,0,1,0,0,1,1,0]
Z=[1,0,0,1,1,0,0,1,1,0,0,1]
a=drv.information_mutual_conditional(X,Y,Z)
print(a)
The answer i am getting here is=0.1583445441575102
While the expected result is=0.011
Can anybody help? I am in big trouble right now. Any kind of help will be appreciable.
Thanks in advance.
I think that the library function entropy_conditional(x,y,z) has some errors. I also test my samples, the same problem happens.
however, the function entropy_conditional with two variables is ok.
So I code my entropy_conditional(x,y,z) as entropy(x,y,z), the results is correct.
the code may be not beautiful.
def gen_dict(x):
dict_z = {}
for key in x:
dict_z[key] = dict_z.get(key, 0) + 1
return dict_z
def entropy(x,y,z):
x = np.array([x,y,z]).T
x = x[x[:,-1].argsort()] # sorted by the last column
w = x[:,-3]
y = x[:,-2]
z = x[:,-1]
# dict_w = gen_dict(w)
# dict_y = gen_dict(y)
dict_z = gen_dict(z)
list_z = [dict_z[i] for i in set(z)]
p_z = np.array(list_z)/sum(list_z)
pos = 0
ent = 0
for i in range(len(list_z)):
w = x[pos:pos+list_z[i],-3]
y = x[pos:pos+list_z[i],-2]
z = x[pos:pos+list_z[i],-1]
pos += list_z[i]
list_wy = np.zeros((len(set(w)),len(set(y))), dtype = float , order ="C")
list_w = list(set(w))
list_y = list(set(y))
for j in range(len(w)):
pos_w = list_w.index(w[j])
pos_y = list_y.index(y[j])
list_wy[pos_w,pos_y] += 1
#print(pos_w)
#print(pos_y)
list_p = list_wy.flatten()
list_p = np.array([k for k in list_p if k>0]/sum(list_p))
ent_t = 0
for j in list_p:
ent_t += -j * math.log2(j)
#print(ent_t)
ent += p_z[i]* ent_t
return ent
X=[0,1,1,0,1,0,1,0,0,1,0,0]
Y=[0,1,1,0,0,0,1,0,0,1,1,0]
Z=[1,0,0,1,1,0,0,1,1,0,0,1]
a=drv.entropy_conditional(X,Z)
##print(a)
b=drv.entropy_conditional(Y,Z)
c = entropy(X, Y, Z)
p=a+b-c
print(p)
0.15834454415751043
Based on the definitions of conditional entropy, calculating in bits (i.e. base 2) I obtain H(X|Z)=0.784159, H(Y|Z)=0.325011, H(X,Y|Z) = 0.950826. Based on the definition of conditional mutual information you provide above, I obtain I(X;Y|Z)=H(X|Z)+H(Y|Z)-H(X,Y|Z)= 0.158344. Noting that pyitlib uses base 2 by default, drv.information_mutual_conditional(X,Y,Z) appears to be computing the correct result.
Note that your use of drv.entropy_conditional(X,Y,Z) in your first example to compute conditional entropy is incorrect, you can however use drv.entropy_conditional(XY,Z), where XY is a 1D array representing the joint observations about X and Y, for example XY = [2*xy[0] + xy[1] for xy in zip(X,Y)].

Resources