How to do large-scale matrix-matrix multiplication in Spark - apache-spark

I have a dataframe in spark, that is a list of (user, itemm rating)
user item rating
ust001 ipx001 5
ust002 ipx04 2
ust001 itx001 4
ust002 iox04 5
If assume I have n users and m items. I can construct a matrix A with size nxm'
my goal is to save use this matrix to compute ite-item similarity: B = A^T * A, and save it as scipy sparse matrix B.npz
here is what I do in python
import numpy as np
import pandas as pd
import pickle
df = pd.read('user_item.paruet')
# mapping string to index
user2num = {}
item2num = {}
UID = 0
IID = 0
# remaping index to string
num2user = {}
num2ite ={}
# loop over all emelemt and map string to index
for i in range(len(df['user'])):
if df['user'][i] not in user2num:
user2num[df['user'][i]] = UID
num2ser[UID] = df['user'][i]
UID += 1
if df['item'][i] not in item2num:
item2num[df['item'][i]] = IID
num2item[IID] = df['item'][i]
IID += 1
# save the pair of string-index
with open('num2item.pickle', 'wb') as handle:
pickle.dump(num2item, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('item2num.pickle', 'wb') as handle:
pickle.dump(item2num, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('num2user.pickle', 'wb') as handle:
pickle.dump(num2user, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('user2num.pickle', 'wb') as handle:
pickle.dump(user2num, handle, protocol=pickle.HIGHEST_PROTOCOL)
df["user"] = df["user"].map(pan2num)
df["item"] = df["item"].map(mrch2num)
df.to_parquet('ID_user-item.parquet')
Then I have another script to compute matrix
# another file to compte item-item similarity
import numpy as np
import pandas as pd
import pickle
from scipy.sparse import csr_matrix
from scipy import sparse
import scipy
df = pd.read_parquet('ID_user-item.parquet')
with open('num2item.pickle', 'rb') as handle:
item_id = pickle.load(handle)
with open('num2user.pickle', 'rb') as handle:
user_id = pickle.load(handle)
row = df['user'].values
col = df['item'].values
data = df['rating'].values
A = csr_matrix((data,(row, col)), shape=(len(user_id), len(item_id)))
B = csr_matrix((data,(col, row)), shape=(len(item_id), len(user_id)))
C = sparse.csr_matrix.dot(B, A)
scipy.sparse.save_npz('item-item.npz', C)
#based on num2item, I can remap the index to string to retrival the item-item similarity.
Above is okay for small dataset. However, If I have 500G user-item-rating. The python will be alway out of memory.
My question is:
How can I obtain this item-item.npz by using spark, using the same logic?
Above is

Related

How to scatter/send all possible column pairs to the child processes and find coherence between the columns using python mpi4py? Parallel computation

I've a big matrix/2D array for which every possible column-pair I need to find the coherence by parallel computation in python (e.g. mpi4py). Coherence [a function] are computed at various child processes and the child process should send the coherence value to the parent process that gather the coherence value as a list. To do this, I've created a small matrix and list of all possible column pairs as follows:
import numpy as np
from scipy import signal
from itertools import combinations
from mpi4py import MPI
comm = MPI.COMM_WORLD
nproc = comm.Get_size()
rank = comm.Get_rank()
data=np.arange(20).reshape(5, 4)
#List of all possible column pairs
data_col = list(combinations(np.transpose(data), 2)) #list
# Function creation
def myFunc(X,Y):
..................
..................
return Real_coh
if rank==0:
Data= comm.scatter(data_col,root=0) #col_pair
Can anyone suggest me how to proceed further. You are welcome to ask any questions/clarifications. Expecting your cordial help. Thanks
check out the following scripts [with comm.Barrier for sync. communication]. In the script, I've written and read the files as a chunk of h5py dataset which is memory efficient.
import numpy as np
from scipy import signal
from mpi4py import MPI
import h5py as t
chunk_len = 5000 # No. of rows of a matrix
num_c = 34 # No. of column of the matrix
# Actual Dataset
data_mat = np.random.random((10000, num_c))
shape = (chunk_len, data_mat.shape[1])
chunk_size = (chunk_len, 1)
no_of_chunks = data_mat.shape[1]
with t.File('file_name.h5', 'w') as hf:
hf.create_dataset("chunked_arr", data=data_mat, chunks=chunk_size, compression='lzf')
del data_mat
def myFunc(dset_X, dset_Y):
..............
............
return Real_coh
res = np.zeros((num_c, num_c))
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
for i in range(num_c):
with t.File('file_name.h5', 'r', libver='latest') as hf:
dset_X = hf['chunked_arr'][:, i] # Chunk data reading
if i % size == rank:
for j in range(num_c):
with t.File('file_name.h5', 'r', libver='latest') as hf:
dset_Y = hf['chunked_arr'][:, j] # Chunk data reading
res[i][j] = spac(dset_X, dset_Y)
comm.Barrier()
print('Shape of final result :', res.shape )

Reading in multiple files in Python and saving them one by one in a different directory

import glob
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
files = glob.glob("Angular_position_*_*.csv")
output = pd.DataFrame()
for f in files:
df = pd.read_csv(f)
time = df.iloc[:,0]
time = time.to_numpy()
ynew = df.iloc[:,1:]
ynew = ynew.to_numpy()
lowPassCutoffFreq = 6.0 # Cut off frequency
Sample_freq = 150; #Target sample frequency
N = 2 # Order of the filter; In this case 2nd order
Wn = lowPassCutoffFreq/(Sample_freq/2) #Normalize frequency
b, a = signal.butter(5, Wn, btype='low',analog=False,output='ba')
#scipy.signal.butter(N, Wn, btype='low', analog=False, output='ba', fs=None)
output = signal.filtfilt(b, a, ynew, axis=0)
np.savetxt("enter directory path/Filtered_files/Filtered_Angular_position_*_*", output, delimiter = ', ', newline = "\n")
I am trying to read in all files in a directory, they are then low pass filtered. After that the results are saved one after the other but not in one file. The result gives each files with 3 columns and ideally I would like them to named with headers e.g. col1, col2, col3.
Without using glob, I can filter all my files individually but I have more than 100 such files.
Any help would be appreciated.
best wishes,
I have partially solved the issue apart from the header names:
import glob
import pandas as pd
from tnorma import tnorma
import seaborn as sns
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
path = r'location_of_dir'
all_files = glob.glob(path + '/*.csv')
# yn = np.zeros(shape = (101,1))
# tn = np.zeros(shape = (101,1))
#ynew = []
yn = np.zeros(shape = (101,1))
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
print(filename)
foo = filename.split("/")[-1]
#df = pd.read_csv(f)
time = df.iloc[:,0]
time = time.to_numpy()
ynew = df.iloc[:,1:]
ynew = ynew.to_numpy()
#print(ynew)
lowPassCutoffFreq = 6.0 # Cut off frequency
Sample_freq = 150; #Target sample frequency
N = 2 # Order of the filter; In this case 2nd order
Wn = lowPassCutoffFreq/(Sample_freq/2) #Normalize frequency
b, a = signal.butter(5, Wn, btype='low',analog=False,output='ba')
#scipy.signal.butter(N, Wn, btype='low', analog=False, output='ba', fs=None)
output = signal.filtfilt(b, a, ynew, axis=0)
#print (output)
tn = np.linspace(0, 100, 101) # new time vector for the new time-normalized data
yn, tn, indie = tnorma(output, k=3, smooth =1, mask = None, show = False)
np.savetxt("path_name/foldername/file"+ foo, yn, delimiter = ', ', newline = "\n")
However, I am having difficulty in putting header names on the 3 columns per file.

Extract Pixels from a pmg file and convert them into a pandas data frame

I have a directory that has subdirectories each with a bunch of PMG files, I would like to extract the pixels from each image and put them in a pandas data frame.
from PIL import Image
import os
import pandas as pd
import numpy as np
dirs = [r"D:\MSIT\Machine Learning\IMG"+"\\s"+str(i) for i in range(1,41)]
pixels = list()
df = pd.DataFrame(columns = ["f" + str(i) for i in range(1,10305)])
cols = list(df.columns)
for directory in dirs:
for filename in os.listdir(directory):
im = Image.open(directory + "\\" +filename)
dims = (list(im.getdata()))
df2 = pd.Series(dims)
pixels.append(dims)
k = 1
for i in pixels:
for j in i:
df2 = pd.Series(j)
df.append(df2, ignore_index = True)
print(str(k) + "Done")
k += 1
print(df.head())
df.to_csv('pixel_data.csv')
I'm assuming you want the pixel values of the PMG files to be your features. You can use df.loc to use indexing in a DataFrame and to add your data in a row after row fashion. Also, using numpy would make the process a little bit faster.
import pandas as pd
from PIL import Image
import os
import numpy as np
columns = [i for i in range(10304)]
columns.append('Label')
df = pd.DataFrame(columns=columns)
rows = 0
for direc in os.listdir():
if direc.startswith('s'):
print('Adding ' + direc)
print('--------------')
for file in os.listdir('./' + direc):
im = Image.open('./' + direc + '/' + file)
x = np.array(im.getdata())
x = x.tolist()
x.append(int(direc.replace('s', '')))
df.loc[rows] = x
rows += 1
df.to_csv('Dataset.csv')

What is the math behind TfidfVectorizer?

I am trying to understand the math behind the TfidfVectorizer. I used this tutorial, but my code is a little bit changed:
what also says at the end that The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations.
I want to be able to use TfidfVectorizer but also calculate the same simple sample by my hand.
Here is my whole code:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
def main():
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'
corpus = [documentA, documentB]
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
print('----------- compare word count -------------------')
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)
print(pd.DataFrame([tfA, tfB]))
CV = CountVectorizer(stop_words=None, token_pattern='(?u)\\b\\w\\w*\\b')
cv_ft = CV.fit_transform(corpus)
tt = TfidfTransformer(use_idf=False, norm='l1')
t = tt.fit_transform(cv_ft)
print(pd.DataFrame(t.todense().tolist(), columns=CV.get_feature_names()))
print('----------- compare idf -------------------')
idfs = computeIDF([numOfWordsA, numOfWordsB])
print(pd.DataFrame([idfs]))
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
print(pd.DataFrame([tfidfA, tfidfB]))
ttf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
f = ttf.fit_transform(cv_ft)
print(pd.DataFrame(f.todense().tolist(), columns=CV.get_feature_names()))
print('----------- TfidfVectorizer -------------------')
vectorizer = TfidfVectorizer(smooth_idf=False, use_idf=True, stop_words=None, token_pattern='(?u)\\b\\w\\w*\\b', norm=None)
vectors = vectorizer.fit_transform([documentA, documentB])
feature_names = vectorizer.get_feature_names()
print(pd.DataFrame(vectors.todense().tolist(), columns=feature_names))
def computeTF(wordDict, bagOfWords):
tfDict = {}
bagOfWordsCount = len(bagOfWords)
for word, count in wordDict.items():
tfDict[word] = count / float(bagOfWordsCount)
return tfDict
def computeIDF(documents):
import math
N = len(documents)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1
for word, val in idfDict.items():
idfDict[word] = math.log(N / float(val))
return idfDict
def computeTFIDF(tfBagOfWords, idfs):
tfidf = {}
for word, val in tfBagOfWords.items():
tfidf[word] = val * idfs[word]
return tfidf
if __name__ == "__main__":
main()
I can compare calculation of Term Frequency. Both results look the same. But when I calculate the IDF and then TF-IDF there are differences between the code from the website and TfidfVectorizer (I also try combination of CountVectorizer and TfidfTransformer to be sure it returns the same results like TfidfVectorizer does).
Code Tf-Idf results:
TfidfVectorizer Tf-Idf results:
Can anybody help me with a code that would return the same returns as TfidfVectorizer or setting of TfidfVectorizer what would return the same results as the code above?
Here is my improvisation of your code to reproduce TfidfVectorizer output for your data .
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from IPython.display import display
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'
corpus = [documentA, documentB]
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
print('----------- compare word count -------------------')
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
series_A = pd.Series(numOfWordsA)
series_B = pd.Series(numOfWordsB)
df = pd.concat([series_A, series_B], axis=1).T
df = df.reindex(sorted(df.columns), axis=1)
display(df)
tf_df = df.divide(df.sum(1),axis='index')
n_d = 1+ tf_df.shape[0]
df_d_t = 1 + (tf_df.values>0).sum(0)
idf = np.log(n_d/df_d_t) + 1
pd.DataFrame(df.values * idf,
columns=df.columns )
tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w*\\b', norm=None)
pd.DataFrame(tfidf.fit_transform(corpus).todense(),
columns=tfidf.get_feature_names() )
More details on the implementation refer the documentation here.

Use of datetime timedelta with numpy 3d array

I have a 3D array with the count of number of days past a benchmark date (e.g., 01.01.2000). I am interested in the actual day-of-year (DOY: 1-365/366)rather than the total number of days past a given date.
For a single value, the below syntax works. For e.g.,
import numpy as np
import datetime
data = 1595
date = datetime.datetime(2000,1,1,0,0) + datetime.timedelta(data -1)
date.timetuple().tm_yday
134
However, I am having issues with using a 3D array.
import numpy as np
import datetime
data = np.random.randint(5, size = (2,2,2))
data = data + 1595
data
array([[[1596, 1595],
[1599, 1599]],
[[1596, 1599],
[1595, 1595]]])
#Function
def Int_to_DOY(int_array):
date_ = datetime.datetime(2000,1,1,0,0) + datetime.timedelta(int_array - 1)
return date_.timetuple().tm_yday
doy_data = data * 0 #Empty array
for i in range(2):
doy_data[:, :, i] = Int_to_DOY(data[:, :, i])
Here is the error message and I am not able to figure this out.
TypeError: unsupported type for timedelta days component: numpy.ndarray
Thanks for your help.
import numpy as np
import datetime
data = np.random.randint(5, size = (2,2,2))
data = data + 1595
#Function
def Int_to_DOY(int_array):
date_ = datetime.datetime(2000,1,1,0,0) + datetime.timedelta(int(int_array) -1)
return date_.timetuple().tm_yday
doy_data = data.flatten()
for i in range(len(doy_data)):
doy_data[i] = Int_to_DOY(doy_data[i])
doy_data = doy_data.reshape((2,2,2))
Since you tagged pandas:
data = np.array([[[1596, 1595],
[1599, 1599]],
[[1596, 1599],
[1595, 1595]]])
s = pd.to_datetime('2000-01-01') + pd.to_timedelta(data.ravel(), unit='D')
s.dayofyear.values.reshape(data.shape) - 1
Output:
array([[[135, 134],
[138, 138]],
[[135, 138],
[134, 134]]], dtype=int64)

Resources