Creating 2D matrix from RDD - apache-spark

I have the following RDD of the Type ((UserID, MovieID),1):
val data_wo_header=dropheader(data).map(_.split(",")).map(x=>((x(0).toInt,x(1).toInt),1))
I want to convert this data structure into a 2D array such that all elements(userID Movie ID) present in the original RDD have a 1 else 0.
I think we have to map the user ID's to 0-N if N is the number of distinct users and map Movie ID's to 0-M if Mis the number of distinct movies.
EDIT: example
Movie ID->
Userid 1 2 3 4 5 6 7
1 0 1 1 0 0 1 0
2 0 1 0 1 0 0 0
3 0 1 1 0 0 0 1
4 1 1 0 0 1 0 0
5 0 1 1 0 0 0 1
6 1 1 1 1 1 0 0
7 0 1 1 0 0 0 0
8 0 1 1 1 0 0 1
9 0 1 1 0 0 1 0
The RDD will be of the sort
(userID, movID,rating)
101,1002,3.5
101,1003,2.5
101,1006,3
102,1002,3.5
102,1004,4.0
103,1002,1.0
103,1003,1.0
103,1007,5.0
….

val baseRDD = sc.parallelize(Seq((101, 1002, 3.5), (101, 1003, 2.5), (101, 1006, 3), (102, 1002, 3.5), (102, 1004, 4.0), (103, 1002, 1.0), (103, 1003, 1.0), (103, 1007, 5.0)))
baseRDD.map(x => (x._1, x._2)).groupByKey().foreach(println)
(userID, movID,rating) format as you mentioned
Result:
(101,CompactBuffer(1002, 1003, 1006))
(102,CompactBuffer(1002, 1004))
(103,CompactBuffer(1002, 1003, 1007))

HI I managed to generate the 2D matrix using the following function. It takes in the RDD of the format
((userID, movID),rating)
101,1002,3.5
101,1003,2.5
101,1006,3
102,1002,3.5
102,1004,4.0
103,1002,1.0
103,1003,1.0
103,1007,5.0
and returns the characteristic Matrix:
def generate_characteristic_matrix(data_wo_header:RDD[((Int, Int), Int)]):Array[Array[Int]]={
val distinct_user_IDs=data_wo_header.sortByKey().map(x=>x._1._1).distinct().collect().sorted
val distinct_movie_IDs=data_wo_header.sortByKey().map(x=>x._1._2).distinct().collect().sorted
var movie_count=distinct_movie_IDs.size
var user_count=distinct_user_IDs.size
var a =0
var map_movie = new ArrayBuffer[(Int, Int)]()
var map_user = new ArrayBuffer[(Int, Int)]()
//map movie ID's from (0,movie_count)
for( a <- 0 to movie_count-1){
map_movie+=((distinct_movie_IDs(a),a))
}
//map user ID's from (0,user_count)
for( a <- 0 to user_count-1){
map_user+=((distinct_user_IDs(a),a))
}
//size of char matrix is user_countxmovie_count
var char_matrix = Array.ofDim[Int](user_count,movie_count)
data_wo_header.collect().foreach(x => {
var user =x._1._1
var movie=x._1._2
var movie_mappedid=map_movie.filter(x=>x._1==movie).map(x=>x._2).toArray
var user_mappedid=map_user.filter(x=>x._1==user).map(x=>x._2).toArray
char_matrix(user_mappedid(0))(movie_mappedid(0))=1
})
return char_matrix
}

Related

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Faster way to count number of timestamps before another timestamp

I have two dataframe "train" and "log". "log" has datetime columns "time1" while train has datetime column "time2". For every row in "train" I want to find out counts of "time1" when "time1" is before "time2".
I already tried the apply method with dataframe.
def log_count(row):
return sum((log['user_id'] == row['user_id']) & (log['time1'] < row['time2']))
train.apply(log_count, axis = 1)
It is taking very long with this approach.
Since you want to do this once for each (paired) user_id group, you could do the following:
Create a column called is_log which is 1 in log and 0 in train:
log['is_log'] = 1
train['is_log'] = 0
The is_log column will be used to keep track of whether or not a row comes from log or train.
Concatenate the log and train DataFrames:
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
Sort the combined DataFrame by user_id and time:
combined = combined.sort_values(by=["user_id", "time"])
So now combined looks something like this:
time user_id is_log
6 2000-01-17 0 0
0 2000-03-13 0 1
1 2000-06-08 0 1
7 2000-06-25 0 0
4 2000-07-09 0 1
8 2000-07-18 0 0
10 2000-03-13 1 0
5 2000-04-16 1 0
3 2000-08-04 1 1
9 2000-08-17 1 0
2 2000-10-20 1 1
Now the count that you are looking for can be expressed as a cumulative sum of the is_log column, grouped by user_id:
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
This is the main idea: Counting the number of 1s in the is_log column is equivalent to counting the number of times in log which come before each time in train.
For example,
import numpy as np
import pandas as pd
np.random.seed(2019)
def random_dates(N):
return np.datetime64("2000-01-01") + np.random.randint(
365, size=N
) * np.timedelta64(1, "D")
N = 5
log = pd.DataFrame({"time1": random_dates(N), "user_id": np.random.randint(2, size=N)})
train = pd.DataFrame(
{
"time2": np.r_[random_dates(N), log.loc[0, "time1"]],
"user_id": np.random.randint(2, size=N + 1),
}
)
log["is_log"] = 1
train["is_log"] = 0
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
combined = combined.sort_values(by=["user_id", "time"])
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
print(log)
# time1 user_id is_log
# 0 2000-03-13 0 1
# 1 2000-06-08 0 1
# 2 2000-10-20 1 1
# 3 2000-08-04 1 1
# 4 2000-07-09 0 1
print(train)
yields
time user_id is_log count
6 2000-01-17 0 0 0
7 2000-06-25 0 0 2
8 2000-07-18 0 0 3
10 2000-03-13 1 0 0
5 2000-04-16 1 0 0
9 2000-08-17 1 0 1

Create new rows out of columns with multiple items in Python

I have these codes and I need to create a data frame similar to the picture attached - Thanks
import pandas as pd
Product = [(100, 'Item1, Item2'),
(101, 'Item1, Item3'),
(102, 'Item4')]
labels = ['product', 'info']
ProductA = pd.DataFrame.from_records(Product, columns=labels)
Cust = [('A', 200),
('A', 202),
('B', 202),
('C', 200),
('C', 204),
('B', 202),
('A', 200),
('C', 204)]
labels = ['customer', 'product']
Cust1 = pd.DataFrame.from_records(Cust, columns=labels)
merge with get_dummies
dfA.merge(dfB).set_index('customer').tags.str.get_dummies(', ').sum(level=0,axis=0)
Out[549]:
chocolate filled glazed sprinkles
customer
A 3 1 0 2
C 1 0 2 1
B 2 2 0 0
IIUC possible with merge, split, melt and concat:
dfB = dfB.merge(dfA, on='product')
dfB = pd.concat([dfB.iloc[:,:-1], dfB.tags.str.split(',', expand=True)], axis=1)
dfB = dfB.melt(id_vars=['customer', 'product']).drop(columns = ['product', 'variable'])
dfB = pd.concat([dfB.customer, pd.get_dummies(dfB['value'])], axis=1)
dfB
Output:
customer filled sprinkles chocolate glazed
0 A 0 0 1 0
1 C 0 0 1 0
2 A 0 0 1 0
3 A 0 0 1 0
4 B 0 0 1 0
5 B 0 0 1 0
6 C 0 0 0 1
7 C 0 0 0 1
8 A 0 1 0 0
9 C 0 1 0 0
10 A 0 1 0 0
11 A 1 0 0 0
12 B 1 0 0 0
13 B 1 0 0 0

Create a new large matrix by stacking in its diagonal K matrices

l have K (let K here be 7) distincts matrices of dimension (50,50).
I would like to create a new matrix L by filling it in diagonal with the K matrices. Hence L is of dimension (50*K,50*K).
What l have tried ?
K1=np.random.random((50,50))
N,N=K1.shape
K=7
out=np.zeros((K,N,K,N),K1.dtype)
np.einsum('ijik->ijk', out)[...] = K1
L=out.reshape(K*N, K*N) # L is of dimension (50*7,50*7)=(350,350)
Its indeed creating a new matrix L by stacking K1 seven times within its diagonal. However, l would like to stack respectively K1,K2,K3,K5,K6,K7 rather than K1 seven times.
Inputs :
K1=np.random.random((50,50))
K2=np.random.random((50,50))
K3=np.random.random((50,50))
K4=np.random.random((50,50))
K5=np.random.random((50,50))
K6=np.random.random((50,50))
K7=np.random.random((50,50))
L=np.zeros((50*7,50*7))#
Expected outputs :
L[:50,:50]=K1
L[50:100,50:100]=K2
L[100:150,100:50]=K3
L[150:200,150:200]=K4
L[200:250,200:250]=K5
L[250:300,250:300]=K6
L[300:350,300:350]=K7
You could try scipy.linalg.block_diag. If you look at the source, this function basically just loops over the given blocks the way you have written as your output. It can be used like:
K1=np.random.random((50,50))
K2=np.random.random((50,50))
K3=np.random.random((50,50))
K4=np.random.random((50,50))
K5=np.random.random((50,50))
K6=np.random.random((50,50))
K7=np.random.random((50,50))
L=sp.linalg.block_diag(K1,K2,K3,K4,K5,K6,K7)
If you have your K as a ndarray of shape (7,50,50) you can unpack it directly like:
K=np.random.random((7,50,50))
L=sp.linalg.block_diag(*K)
If you don't want to import scipy, you can always just write a simple loop to do what you have written for the expected output.
Here is a way to do that with NumPy:
import numpy as np
def put_in_diagonals(a):
n, rows, cols = a.shape
b = np.zeros((n * rows, n * cols), dtype=a.dtype)
a2 = a.reshape(-1, cols)
ii, jj = np.indices(a2.shape)
jj += (ii // rows) * cols
b[ii, jj] = a2
return b
# Test
a = np.arange(24).reshape(4, 2, 3)
print(put_in_diagonals(a))
Output:
[[ 0 1 2 0 0 0 0 0 0 0 0 0]
[ 3 4 5 0 0 0 0 0 0 0 0 0]
[ 0 0 0 6 7 8 0 0 0 0 0 0]
[ 0 0 0 9 10 11 0 0 0 0 0 0]
[ 0 0 0 0 0 0 12 13 14 0 0 0]
[ 0 0 0 0 0 0 15 16 17 0 0 0]
[ 0 0 0 0 0 0 0 0 0 18 19 20]
[ 0 0 0 0 0 0 0 0 0 21 22 23]]

Resources