Pyspark Columnsimilarities interpretation - apache-spark

I was learning about how to use columnsimilarities can someone explain to me the matrix that was generated by the algorithm
lets say in this code
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
how can I know which row is most similar given in the maxtrix? like (0,2,0.991935352214) mean that row 0 and row 2 have a result of 0.991935352214? I know that 0 and 2 are i and j the row and columns respectively of the matrix.
thank you

how can I know which row is most similar given in the maxtrix?
It is columnSimilarities not rowSimilarities so it is just not the thing you're looking for.
You could apply it on transposed matrix, but you really don't want. Algorithms used here are designed for thin and optimally sparse data. It just won't scale for wide one.

Related

Averaging n elements along 1st axis of 4D array with numpy

I have a 4D array containing daily time-series of gridded data for different years with shape (year, day, x-coordinate, y-coordinate). The actual shape of my array is (19, 133, 288, 620), so I have 19 years of data with 133 days per year over a 288 x 620 grid. I want to take the weekly average of each grid cell over the period of record. The shape of the weekly averaged array should be (19, 19, 288, 620), or (year, week, x-coordinate, y-coordinate). I would like to use numpy to achieve this.
Here I construct some dummy data to work with and an array of what the solution should be:
import numpy as np
a1 = np.arange(1, 10).reshape(3, 3)
a1days = np.repeat(a1[np.newaxis, ...], 7, axis=0)
b1 = np.arange(10, 19).reshape(3, 3)
b1days = np.repeat(b1[np.newaxis, ...], 7, axis=0)
c1year = np.concatenate((a1days, b1days), axis=0)
a2 = np.arange(19, 28).reshape(3, 3)
a2days = np.repeat(a2[np.newaxis, ...], 7, axis=0)
b2 = np.arange(29, 38).reshape(3, 3)
b2days = np.repeat(b2[np.newaxis, ...], 7, axis=0)
c2year = np.concatenate((a2days, b2days), axis=0)
dummy_data = np.concatenate((c1year, c2year), axis=0).reshape(2, 14, 3, 3)
solution = np.concatenate((a1, b1, a2, b2), axis=0).reshape(2, 2, 3, 3)
The shape of the dummy_data is (2, 14, 3, 3). Per the dummy data, I have two years of data, 14 days per year, over a 3 X 3 grid. I want to return the weekly average of the grid for both years, resulting in a solution with shape (2, 2, 3, 3).
You can reshape and take mean:
week_mean = dummy_data.reshape(2,-1,7,3,3).mean(axis=2)
# in your case .reshape(year, -1, 7, x_coord, y_coord)
# check:
(dummy_data.reshape(2,2,7,3,3).mean(axis=2) == solution).all()
# True

np.percentile not equal to quartiles

I'm trying to calculate the quartiles for an array of values in python using numpy.
X = [1, 1, 1, 3, 4, 5, 5, 7, 8, 9, 10, 1000]
I would do the following:
quartiles = np.percentile(X, range(0, 100, 25))
quartiles
# array([1. , 2.5 , 5. , 8.25])
But this is incorrect, as the 1st and 3rd quartiles should be 2 and 8.5, respectively.
This can be shown as the following:
Q1 = np.median(X[:len(X)/2])
Q3 = np.median(X[len(X):])
Q1, Q3
# (2.0, 8.5)
I can't get my heads round what np.percentile is doing to give a different answer. Any light shed on this, I'd be very grateful for.
There is no right or wrong, but simply different ways of calculating percentiles The percentile is a well defined concept in the continuous case, less so for discrete samples: different methods would not make a difference for a very big number of observations (compared to the number of duplicates), but can actually matter for small samples and you need to figure out what makes more sense case by case.
To obtain you desired output, you should specify interpolation = 'midpoint' in the percentile function:
quartiles = np.percentile(X, range(0, 100, 25), interpolation = 'midpoint')
quartiles # array([ 1. , 2. , 5. , 8.5])
I'd suggest you to have a look at the docs http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

How to return the file number from bag of words

I am working with CountVectorizer from the sklearn, I want to know how I will access or extract the file number, these what I try
like from the out put: (1 ,12 ) 1
I want only the 1 which represent the file number
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
string1="these is my first statment in vectorizer"
string2="hello every one i like the place here"
string3="i am going to school every day day like the student in my school"
email_list=[string1,string2,string3]
bagofword=vectorizer.fit(email_list)
bagofword=vectorizer.transform(email_list)
print(bagofword)
output:
(0, 3) 1
(0, 7) 1
(0, 8) 1
(0, 10) 1
(0, 14) 1
(1, 12) 1
(1, 16) 1
(2, 0) 1
(2, 1) 2
You could iterate over the columns of the sparse array with,
features_map = [col.indices.tolist() for col in bagofword.T]
and to get a list of all documents that contain the feature k, simply take the element k of this list.
For instance, features_map[2] == [1, 2] means that feature number 2, is present in documents 1 and 2.

Removing bi-directional unique rows from text file

I have a text file as follows:
1 3
2 5
3 6
4 5
5 4
6 1
7 2
The above file represents the edges in an undirected graph. I want to remove the duplicate edges in the graph. In the above given example I want to remove either 4,5 or 5,4 as they represent the same edge in graph and hence causes duplication. I am trying to visualize the graph from the file using Graphstream using the GraphX library in Apache Spark. But due to the presence of duplicate nodes as explained above it gives an error as follows
org.graphstream.graph.EdgeRejectedException: Edge 4[5--4] was rejected by node 5
What would be the best way to remove such duplicates from the text file?
You can use convertToCanonicalEdges method from GraphOps. It
Converts bi-directional edges into uni-directional.
Rewrites the vertex ids of edges so that srcIds are smaller than dstIds, and merges the duplicated edges.
In your case:
val graph = Graph.fromEdgeTuples(sc.parallelize(
Seq((1, 3), (2, 5), (3, 6), (4, 5), (5, 4), (6, 1), (7, 2))), -1)
graph.convertToCanonicalEdges().edges.collect.foreach(println)
with result:
Edge(3,6,1)
Edge(1,6,1)
Edge(1,3,1)
Edge(2,5,1)
Edge(2,7,1)
Edge(4,5,1)

Seeking efficient way to categorize values in Spark

Looking to improve my current approach (improve speed) to go through an RDD and quickly categorize the row values. For example, let's say I have four columns and five rows, and want to count the number of values in each category per column. My final result would look like:
Column Name Category 1 Category 2 Category 3
Col 1 2 3 0
Col 2 0 4 1
Col 3 0 0 5
Col 4 2 2 1
I've been testing two approaches:
Approach 1
Map each row to a list of row count tuples. Before data could look like ['val1', 2.0, '0001231', True] -> afterwards it would look like [(1, 0, 0), (0, 1, 0), (0, 1, 0), (0, 0, 1)]
Reduce each row by adding the tuples together
my_rdd.map("function to categorize").reduce("add tuples")
Approach 2
Flat map each value to its own row as a key-value pair. Similar to first approach, but result would look like ("col1", (1, 0, 0), ("col2", (0, 1, 0)), ("col3", (0, 1, 0)), ("col4", (0, 0, 1)), where each tuple becomes a new row.
Reduce by key
my_rdd.flatMap("function to categorize and create row for each column value").reduceByKey("add tuples for each column")
Is there a more efficient way to do this for a generic RDD? I'm also looking to count distinct values while I'm doing the mapping, but realize that would require a separate discussion.

Resources