Generally when we write a function in c++ to parse a 2D array, it passes through the first row then move to the second row.
for(int i = 0; i < ROW_SIZE; i++){
for(int j = 0; j < COL_SIZE; j++){
*((Mat+i*COL_SIZE) + j) = value;
}
}
However, when I use Rcpp::NumericMatrix it parses through columns first.
// [[Rcpp::export]]
NumericMatrix TestMatrixParsing(){
NumericMatrix xx(4, 5);
int xsize = xx.nrow() * xx.ncol();
for (int i = 0; i < xsize; i++) {
xx[i] = i+100;
}
return xx;
}
/*** R
TestMatrixParsing()
# [,1] [,2] [,3] [,4] [,5]
# [1,] 100 104 108 112 116
# [2,] 101 105 109 113 117
# [3,] 102 106 110 114 118
# [4,] 103 107 111 115 119
*/
Is there any way to force it to parse through rows as my previous code was written to have matrix stored as consecutive rows so pointer doesn't have to jump equal to the COL_SIZE.
Rcpp::NumericMatrix just follows the way R lays out its memory, which is in column major mode. A simple solution would be to transpose the matrix:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericMatrix TestMatrixParsing(){
Rcpp::NumericMatrix xx(4, 5);
int xsize = xx.nrow() * xx.ncol();
xx = Rcpp::transpose(xx);
for (int i = 0; i < xsize; ++i) {
xx[i] = i + 100;
}
xx = Rcpp::transpose(xx);
return xx;
}
/*** R
TestMatrixParsing()
# [,1] [,2] [,3] [,4] [,5]
# [1,] 100 101 102 103 104
# [2,] 105 106 107 108 109
# [3,] 110 111 112 113 114
# [4,] 115 116 117 118 119
*/
That might be to expensive for large matrices though. In that case it is probably best to adapt your algorithm.
Related
Let's say, that I have the following data:
In [1]: df
Out[1]:
Student_Name Maths Physics Chemistry Biology English
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
I want to add a column to this dataframe which tells me the students' top 'n' subjects that are above a threshold, where the subject names are available in the column names. Let's assume n=3 and threshold=80.
The output would look like the following:
In [3]: df
Out[3]:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 John Doe 90 87 81 65 70 Maths, Physics, Chemistry
1 Jane Doe 82 84 75 73 77 Physics, Maths
2 Mary Lim 40 65 55 60 70 nan
3 Lisa Ray 55 52 77 62 90 English
I tried to use the solution written by #jezrael for this question where they use numpy.argsort to get the positions of sorted values for the top 'n' columns, but I am unable to set a threshold value below which nothing should be considered.
Idea is first replace not matched values by missing values in DataFrame.where, then applied solution with numpy.argsort. Filter by number of Trues of for count non missing values in numpy.where for replace not matched values to empty strings.
Last are values joined in list comprehension and filtered out non matched rows for missing value(s):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
If performance is not important use Series.nlargest per rows, but it is really slow if large DataFrame:
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m)
.apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)
def f1(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
return df
def f2(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m).apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
return df
In [210]: %timeit (f1(df.copy()))
19.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit (f2(df.copy()))
2.43 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
An alternative:
res = []
tmp = df.set_index('Student_Name').T
for col in list(tmp):
res.append(tmp[col].nlargest(3)[tmp[col].nlargest(3) > 80].index.tolist())
res = [x if len(x) > 0 else np.NaN for x in res]
df['Top_3_above_80'] = res
Output:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 JohnDoe 90 87 81 65 70 [Maths, Physics, Chemistry]
1 JaneDoe 82 84 75 73 77 [Physics, Maths]
2 MaryLim 40 65 55 60 70 NaN
3 LisaRay 55 52 77 62 90 [English]
I have a following dataframe-
A B C Result
0 232 120 9 91
1 243 546 1 12
2 12 120 5 53
I want to perform the operation of the following kind-
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000
which I am doing using
df['A-B/A+B']=(df['A']-df['B'])/(df['A']+df['B'])
df['A-C/A+C']=(df['A']-df['C'])/(df['A']+df['C'])
df['B-C/B+C']=(df['B']-df['C'])/(df['B']+df['C'])
which I believe is a very crude and ugly way to do.
How to do it in a more correct way?
You can do the following:
# take columns in a list except the last column
colnames = df.columns.tolist()[:-1]
# compute
for i, c in enumerate(colnames):
if i != len(colnames):
for k in range(i+1, len(colnames)):
df[c + '_' + colnames[k]] = (df[c] - df[colnames[k]]) / (df[c] + df[colnames[k]])
# check result
print(df)
A B C Result A_B A_C B_C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000
This is a perfect case to use DataFrame.eval:
cols = ['A-B/A+B','A-C/A+C','B-C/B+C']
x = pd.DataFrame([df.eval(col).values for col in cols], columns=cols)
df.assign(**x)
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 351.482759 786.753086 122.000000
1 243 546 1 12 240.961207 243.995885 16.583333
2 12 120 5 53 128.925000 546.998168 124.958333
The advantage of this method respect to the other solution, is that it does not depend on the order of the operation sings that appear as column names, but rather as mentioned in the documentation it is used to:
Evaluate a string describing operations on DataFrame columns.
I am new to Python and trying to subset a data frame of user-movie ratings first by Row Totals and next by Column Totals. The filter by column totals is taking hours to complete and so I was wondering if you could provide me some pointers to optimize the code.
data_cols = ['user_id','movie_id','rating']
data = pd.read_csv('netflix_data/TrainingRatings.txt', sep=',', names=data_cols)
utrain = (data.sort_values('user_id'))
print(utrain.tail())
Movie_Ratings = utrain.pivot_table(index = ['user_id'],columns = ['movie_id'], values = ['rating'], aggfunc = lambda x:x)
Movie_Ratings.head()
Movie_Ratings = Movie_Ratings.fillna(0)
#Filter by column totals
Movie_Ratings.loc[len(Movie_Ratings)] = [Movie_Ratings[col].sum() for col in Movie_Ratings.columns]
##Following portion is taking the maximum amount of time
x = Movie_Ratings.loc[len(Movie_Ratings)-1]
for col in Movie_Ratings.columns:
if(x[col] <= 500):
Movie_Ratings.drop(col,axis = 1, inplace = True)
First you can use DataFrame.sum only:
Movie_Ratings.loc[len(Movie_Ratings)] = Movie_Ratings.sum()
And then filter without loops:
np.random.seed(100)
Movie_Ratings = pd.DataFrame(np.random.randint(250, size=(5,5)), columns=list('ABCDE'))
print (Movie_Ratings)
A B C D E
0 8 24 67 103 87
1 79 176 138 94 180
2 98 53 66 226 14
3 34 241 240 24 143
4 228 107 60 58 144
Movie_Ratings.loc[len(Movie_Ratings)] = Movie_Ratings.sum()
Movie_Ratings = Movie_Ratings.loc[:, ~(Movie_Ratings.iloc[-1] <= 500)]
#Orchange condition to > and remove ~ for invert condition
#Movie_Ratings = Movie_Ratings.loc[:, (Movie_Ratings.iloc[-1] > 500)]
print (Movie_Ratings)
B C D E
0 24 67 103 87
1 176 138 94 180
2 53 66 226 14
3 241 240 24 143
4 107 60 58 144
5 601 571 505 568
Explanation:
print (Movie_Ratings.iloc[-1])
A 447
B 601
C 571
D 505
E 568
Name: 5, dtype: int64
print (Movie_Ratings.iloc[-1]<= 500)
A True
B False
C False
D False
E False
Name: 5, dtype: bool
print (~(Movie_Ratings.iloc[-1]<= 500))
A False
B True
C True
D True
E True
Name: 5, dtype: bool
I have a DataFrame, foo:
A B C D E
0 50 46 18 65 55
1 48 56 98 71 96
2 99 48 36 79 70
3 15 24 25 67 34
4 77 67 98 22 78
and another Dataframe, bar, which contains the greatest 2 values of each row of foo. All other values have been replaced with zeros, to create sparsity:
A B C D E
0 0 0 0 65 55
1 0 0 98 0 96
2 99 0 0 79 0
3 0 0 0 67 34
4 0 0 98 0 78
How can I test that every row in bar really contains the desired values?
One more thing: The solution should work with large DateFrames i.e. 20000 X 20000.
Obviously you can do that with looping and efficient sorting, but maybe a better way would be:
n = foo.shape[0]
#Test1:
#bar dataframe has original data except zeros for two values:
diff = foo - bar
test1 = ((diff==0).sum(axis=1) == 2) == n
#Test2:
#bar dataframe has 3 zeros on each line
test2 = ((bar==0).sum(axis=1) == 3) == n
#Test3:
#these 2 numbers that bar has are the max
bar2=bar.replace({0:pandas.np.nan(), inplace=True}
#the max of remaining values is smaller than the min of bar:
row_ok = (diff.max(axis=1) < bar.min(axis=1))
test3 = (ok.sum() == n)
I think this covers all cases, but haven't tested it all...
I wish to access data within a table with dot notation that includes strings. I have a list of strings representing columns of interest within the table. How can I access data using those strings? I wish to create a loop going through the list of strings.
For example for table T I have columns {a b c d e}. I have a 1x3 cell cols={b d e}.
Can I retrieve data using cols in the format (or equivalent) T.cols(1) to to give me the same result as T.b?
You can fetch directly the data using {curly braces} and strings as column indices as you would for a cell array.
For example, lets's create a dummy table (modified from the docs):
clc;
close all;
LastName = {'Smith';'Johnson';'Williams';'Jones';'Brown'};
a = [38;43;38;40;49];
b = [71;69;64;67;64];
c = [176;163;131;133;119];
d = [124 93; 109 77; 125 83; 117 75; 122 80];
T = table(a,b,c,d,...
'RowNames',LastName)
The table looks like this:
T =
a b c d
__ __ ___ __________
Smith 38 71 176 124 93
Johnson 43 69 163 109 77
Williams 38 64 131 125 83
Jones 40 67 133 117 75
Brown 49 64 119 122 80
Now select columns of interest and get the data:
%// Select columns of interest
cols = {'a' 'c'};
%// Fetch data
T{:,cols}
ans =
38 176
43 163
38 131
40 133
49 119
Yay!