Seeking efficient way to categorize values in Spark - apache-spark

Looking to improve my current approach (improve speed) to go through an RDD and quickly categorize the row values. For example, let's say I have four columns and five rows, and want to count the number of values in each category per column. My final result would look like:
Column Name Category 1 Category 2 Category 3
Col 1 2 3 0
Col 2 0 4 1
Col 3 0 0 5
Col 4 2 2 1
I've been testing two approaches:
Approach 1
Map each row to a list of row count tuples. Before data could look like ['val1', 2.0, '0001231', True] -> afterwards it would look like [(1, 0, 0), (0, 1, 0), (0, 1, 0), (0, 0, 1)]
Reduce each row by adding the tuples together
my_rdd.map("function to categorize").reduce("add tuples")
Approach 2
Flat map each value to its own row as a key-value pair. Similar to first approach, but result would look like ("col1", (1, 0, 0), ("col2", (0, 1, 0)), ("col3", (0, 1, 0)), ("col4", (0, 0, 1)), where each tuple becomes a new row.
Reduce by key
my_rdd.flatMap("function to categorize and create row for each column value").reduceByKey("add tuples for each column")
Is there a more efficient way to do this for a generic RDD? I'm also looking to count distinct values while I'm doing the mapping, but realize that would require a separate discussion.

Related

Nuanced Excel Question; calculating proportions

Fellow overflowers, all help is appreciated;
I have the following rows of values (always 7 values per row) of data in Excel (3 examples below), where data is coded as 1 or 2. I am interested in the 1's.
2, 2, 1, 2, 2, 1, 1.
1, 2, 2, 2, 2, 1, 2.
2, 2, 2, 1, 1, 1, 2.
I use the =MATCH(1,A1:G1,0) to tell me WHEN the first 1 appears, BUT now I want to calculate the proportion that 1's make up of the the remaining values in the row.
For example;
2, 2, 1, 2, 2, 1, 1. (1 first appears at point 3, but then 1's make up 2 out of 4 remaining points; 50%).
1, 2, 2, 2, 2, 1, 2. (1 first appears at point 1, but then 1's make up 1 out of the 6 remaining points; 16%).
2, 2, 2, 1, 1, 1, 2. (1 first appears at point 4, but then 1's make up 2 out of the 3 remaining points; 66%).
Please help me calculate this proportion!
You could use this one
=(LEN(SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",",""))
-LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",",""),1,""))
)/LEN(SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",",""))
The
SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",","")
-part gets the string after the first 1. The single 1 in the middle part is the one, you want to calculate the percentage for. So if you want to adapt the formula to other chars, you have to change the single 1 in th emiddle part and the three 1s in the three searches.
EDIT thank you for the hint #foxfire
A solution for values in columns would be
=COUNTIF(INDEX(A1:G1,1,MATCH(1,A1:G1,0)+1):G1,1)/(COUNT(A1:G1)-MATCH(1,A1:G1,0))
You can do it with SUMPRODUCT:
My formula in column H is a MATCH like yours:
=MATCH(1;A3:G3;0)
My formula for calculatin % of 1's over reamining numbers after first 1 found, is:
=SUMPRODUCT((A3:G3=1)*(COLUMN(A3:G3)>H3))/(7-H3)
This is how it works:
(A3:G3=1) will return an array of 1 and 0 if cell value is 1 or not. So for row 3 it would be {0;0;1;0;0;1;1}.
COLUMN(A3:G3)>H3 will return an array of 1 and 0 if column number of cell is higher than column number of first 1 found, (that matchs with its position inside array). So for row 3 it would be {0;0;0;1;1;1;1}
We multiply both arrays. So for row 3 it would be {0;0;1;0;0;1;1} * {0;0;0;1;1;1;1} = {0;0;0;0;0;1;1}
With SUMPRODUCT we sum up the array of 1 and 0 from previous step. So for row 3 we would obtain 2. That means there are 2 cells with value 1 after first 1 found.
(7-H3) will just return how many cells are after first 1 found, so fo row 3, it means there are 4 cells after first 1 found.
We divide value from step 4 by value from previous step, and that's the % you want. So for row 3, it would be 2/4=0,50. That means 50%
update: I used 2 columns just in case you need to show where is the first 1. But in case you want a single column with the %, formula would be:
=SUMPRODUCT((A3:G3=1)*(COLUMN(A3:G3)>MATCH(1;A3:G3;0)))/(7-MATCH(1;A3:G3;0))

Using xlswrite in MATLAB

I am working with three datasets in MATLAB, e.g.,
Dates:
There are D dates that are chars each, but saved in a cell array.
{'01-May-2019','02-May-2019','03-May-2019'....}
Labels:
There are 100 labels that are strings each, but saved in a cell array.
{'A','B','C',...}
Values:
[0, 1, 2,...]
This is one row of the Values matrix of size D×100.
I would like the following output in Excel:
date labels Values
01-May-2019 A 0
01-May-2019 B 1
01-May-2019 C 2
till the same date repeats itself 100 times. Then, the next date is added (+ repeated 100 times) onto the subsequent row along with the 100 labels in the second column and new values from 2nd row of Values matrix transposed in third column. This repeats until the date length D is reached.
For the first date, I used:
c_1 = {datestr(datenum(dates(1))*ones(100,1))}
c_2 = labels
c_3 = num2cell(Values(1,:)')
xlswrite('test.xls',[c_1, c_2, c_3])
but, unfortunately, this seemed to have put everything in one column, i.e., the date, then, labels, then, 1st row of values array. I need these to be in three columns.
Also, I think that the above needs to be in a for loop over each day that I am considering. I tried using the table function, but, didn't have much luck with it.
How to solve this efficiently?
You can use repmat and reshape to build your columns and (optionally) add them to a table for exporting.
For example:
dates = {'01-May-2019','02-May-2019'};
labels = {'A','B', 'C'};
values = [0, 1, 2];
n_dates = numel(dates);
n_labels = numel(labels);
dates_repeated = reshape(repmat(dates, n_labels, 1), [], 1);
labels_repeated = reshape(repmat(labels, n_dates, 1).', [], 1);
values_repeated = reshape(repmat(values, n_dates, 1).', [], 1);
full_table = table(dates_repeated, labels_repeated, values_repeated);
Gives us the following table:
>> full_table
full_table =
6×3 table
dates_repeated labels_repeated values_repeated
______________ _______________ _______________
'01-May-2019' 'A' 0
'01-May-2019' 'B' 1
'01-May-2019' 'C' 2
'02-May-2019' 'A' 0
'02-May-2019' 'B' 1
'02-May-2019' 'C' 2
Which should export to a spreadsheet with writetable as desired.
What we're doing with repmat and reshape is "stacking" the values and then converting them into a single column:
>> repmat(dates, n_labels, 1)
ans =
3×2 cell array
{'01-May-2019'} {'02-May-2019'}
{'01-May-2019'} {'02-May-2019'}
{'01-May-2019'} {'02-May-2019'}
We transpose the labels and values so they get woven together (e.g [0, 1, 0, 1] vs [0, 0, 1, 1]), as repmat is column-major.
If you don't want the intermediate table, you can use num2cell to create a cell array from values so you can concatenate all 3 cell arrays together for xlswrite (or writematrix, added in R2019a, which also deprecates xlswrite):
values_repeated = num2cell(reshape(repmat(values, n_dates, 1).', [], 1));
full_array = [dates_repeated, labels_repeated, values_repeated];

Pyspark Columnsimilarities interpretation

I was learning about how to use columnsimilarities can someone explain to me the matrix that was generated by the algorithm
lets say in this code
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
how can I know which row is most similar given in the maxtrix? like (0,2,0.991935352214) mean that row 0 and row 2 have a result of 0.991935352214? I know that 0 and 2 are i and j the row and columns respectively of the matrix.
thank you
how can I know which row is most similar given in the maxtrix?
It is columnSimilarities not rowSimilarities so it is just not the thing you're looking for.
You could apply it on transposed matrix, but you really don't want. Algorithms used here are designed for thin and optimally sparse data. It just won't scale for wide one.

Sum values based on first occurrence of other column using excel formula

Let's say I have the following two columns in excel spreadsheet
A B
1 10
1 10
1 10
2 20
3 5
3 5
and I would like to sum the values from B-column that represents the first occurrence of the value in A-column using a formula. So I expect to get the following result:
result = B1+B4+B5 = 35
i.e., sum column B where any unique value exists in the same row but Column A. In my case if Ai = Aj, then Bi=Bj, where i,j represents the row positions. It means that if two rows from A-column have the same value, then its corresponding values from B-column are the same. I can have the value sorted by column A values, but I prefer to have a formula that works regardless of sorting.
I found this post that refers to the same problem, but the proposed solution I am not able to understand.
Use SUMPRODUCT and COUNTIF:
=SUMPRODUCT(B1:B6/COUNTIF(A1:A6,A1:A6))
Here the step by step explanation:
COUNTIF(A1:A6, A1:A6) will produce an array with the frequency of the values: A1:A6. In our case it will be: {3, 3, 3, 1, 2, 2}
Then we have to do the following division: {10, 10, 10, 20, 5, 5}/{3, 3, 3, 1, 2, 2}. The result will be: {3.33, 3.33, 3.33, 20, 2.5, 2.5}. It replaces each value by the average of its group.
Summing the result we will get: (3.33+3.33+3.33) + 20 + (2.5+2.5=35)=35.
Using the above trick we can just get the same result as if we just sum the first element of each group from the column A.
To make this dynamic, so it grows and shrinks with the data set use this:
=SUMPRODUCT($B$1:INDEX(B:B,MATCH(1E+99,B:B))/COUNTIF($A$1:INDEX(A:A,MATCH(1E+99,B:B)),$A$1:INDEX(A:A,MATCH(1E+99,B:B))))
... or just SUMPRODUCT.
=SUMPRODUCT(B2:B7, --(A2:A7<>A1:A6))

How to return the file number from bag of words

I am working with CountVectorizer from the sklearn, I want to know how I will access or extract the file number, these what I try
like from the out put: (1 ,12 ) 1
I want only the 1 which represent the file number
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
string1="these is my first statment in vectorizer"
string2="hello every one i like the place here"
string3="i am going to school every day day like the student in my school"
email_list=[string1,string2,string3]
bagofword=vectorizer.fit(email_list)
bagofword=vectorizer.transform(email_list)
print(bagofword)
output:
(0, 3) 1
(0, 7) 1
(0, 8) 1
(0, 10) 1
(0, 14) 1
(1, 12) 1
(1, 16) 1
(2, 0) 1
(2, 1) 2
You could iterate over the columns of the sparse array with,
features_map = [col.indices.tolist() for col in bagofword.T]
and to get a list of all documents that contain the feature k, simply take the element k of this list.
For instance, features_map[2] == [1, 2] means that feature number 2, is present in documents 1 and 2.

Resources