Removing bi-directional unique rows from text file - apache-spark

I have a text file as follows:
1 3
2 5
3 6
4 5
5 4
6 1
7 2
The above file represents the edges in an undirected graph. I want to remove the duplicate edges in the graph. In the above given example I want to remove either 4,5 or 5,4 as they represent the same edge in graph and hence causes duplication. I am trying to visualize the graph from the file using Graphstream using the GraphX library in Apache Spark. But due to the presence of duplicate nodes as explained above it gives an error as follows
org.graphstream.graph.EdgeRejectedException: Edge 4[5--4] was rejected by node 5
What would be the best way to remove such duplicates from the text file?

You can use convertToCanonicalEdges method from GraphOps. It
Converts bi-directional edges into uni-directional.
Rewrites the vertex ids of edges so that srcIds are smaller than dstIds, and merges the duplicated edges.
In your case:
val graph = Graph.fromEdgeTuples(sc.parallelize(
Seq((1, 3), (2, 5), (3, 6), (4, 5), (5, 4), (6, 1), (7, 2))), -1)
graph.convertToCanonicalEdges().edges.collect.foreach(println)
with result:
Edge(3,6,1)
Edge(1,6,1)
Edge(1,3,1)
Edge(2,5,1)
Edge(2,7,1)
Edge(4,5,1)

Related

How can I fix statsmodels.AnovaRM claiming "Data is unbalanced" although it isn't?

I am trying to perform a three-way repeated measurements ANOVA with statsmodels.AnovaRM, but there is already a hindrance while performing a two-way ANOVA: When running
aov = AnovaRM(anova_df, depvar='Test', subject='Subject',
within=["Factor1", "Factor2"], aggregate_func='mean').fit()
print(aov)
it returns "Data is unbalanced.". Let's look at the factors I extracted from the DataFrame that I fed into it:
Factor1, level 0, shape: (68, 6)
Factor1, level 1, shape: (68, 6)
Factor1, level 2, shape: (68, 6)
Factor2, level a, shape: (68, 6)
Factor2, level b, shape: (68, 6)
Factor2, level c, shape: (68, 6)
Because this is a test, I even aligned the Factors with each other.
Test Factor1 Factor 2
0 32.6 0 a
1 39.3 1 b
2 43.0 2 c
3 32.0 0 a
4 32.8 1 b
5 38.3 2 c
6 36.7 0 a
7 40.4 1 b
8 41.9 2 c
How is that not being balanced? What am I doing wrong, how can I fix this?
I run into the same issue. A dataset that AnovaRM runs with and works is in this tutorial: https://pythontutorials.eu/numerical/statistics-solution-1/
I also used your method of checking the shapes iterating through all the levels of all the variables. The output also showed everything has the same shape. The dataset in the link above also has this feature.
It turned out that having the same shape is not enough. For the variable you use for subject, in your input df, if you run something like df[subject_name].value_counts() every unique subject_name has to have the same number. If the numbers are different, the AnovaRM will give you an unbalanced data error.
I used this checking method on my df and it showed that some subjects have fewer values than others, while when checking on the example df from the link above, every subject has the same number of values. Furthermore, I manually subset my df to include the subjects that have the same number of values/measurements, and AnovaRM worked for me. Have a try and let me know whether this helps you understand what unbalancing really means.
factor1 = factor2 in every block.
Try using an index like "Treatment" and drop factors 1 and 2:
treatment When
X F1 = 0 and F2 = a
Y F1 = 1 and F2 = b
Z F1 = 2 and F2 = c

Pyspark Columnsimilarities interpretation

I was learning about how to use columnsimilarities can someone explain to me the matrix that was generated by the algorithm
lets say in this code
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
how can I know which row is most similar given in the maxtrix? like (0,2,0.991935352214) mean that row 0 and row 2 have a result of 0.991935352214? I know that 0 and 2 are i and j the row and columns respectively of the matrix.
thank you
how can I know which row is most similar given in the maxtrix?
It is columnSimilarities not rowSimilarities so it is just not the thing you're looking for.
You could apply it on transposed matrix, but you really don't want. Algorithms used here are designed for thin and optimally sparse data. It just won't scale for wide one.

How to return the file number from bag of words

I am working with CountVectorizer from the sklearn, I want to know how I will access or extract the file number, these what I try
like from the out put: (1 ,12 ) 1
I want only the 1 which represent the file number
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
string1="these is my first statment in vectorizer"
string2="hello every one i like the place here"
string3="i am going to school every day day like the student in my school"
email_list=[string1,string2,string3]
bagofword=vectorizer.fit(email_list)
bagofword=vectorizer.transform(email_list)
print(bagofword)
output:
(0, 3) 1
(0, 7) 1
(0, 8) 1
(0, 10) 1
(0, 14) 1
(1, 12) 1
(1, 16) 1
(2, 0) 1
(2, 1) 2
You could iterate over the columns of the sparse array with,
features_map = [col.indices.tolist() for col in bagofword.T]
and to get a list of all documents that contain the feature k, simply take the element k of this list.
For instance, features_map[2] == [1, 2] means that feature number 2, is present in documents 1 and 2.

gnuplot plot data from data sets

the following data sets are generated from program:
1 **1 0.11111**
1 **2 0.22222**
1 **3 0.33333**
1 **4 0.44444**
2 1 0.00185
2 2 0.00005
2 3 0.12355
2 4 0.68124
3 1 0.54875
3 2 0.62155
3 3 0.35895
3 4 0.41588
My question: How do I plot the first 4 row(bold) in 2-dimensional figure? i.e. the following point should be plotted:
(1, 0.11111)
(2, 0.22222)
(3, 0.33333)
(4, 0.44444)
I know I can use "index" directive to plot multiple data sets, if so, double blank lines must come up from the file (in order to distinguish data sets). But I don't want any blank lines to come up. thanks
you can use every ::0::3 to plot up to the 4th row: Gnuplot plotting data from a file up to some row, and using 2:3 to plot using the 2nd and 3rd column.

Seeking efficient way to categorize values in Spark

Looking to improve my current approach (improve speed) to go through an RDD and quickly categorize the row values. For example, let's say I have four columns and five rows, and want to count the number of values in each category per column. My final result would look like:
Column Name Category 1 Category 2 Category 3
Col 1 2 3 0
Col 2 0 4 1
Col 3 0 0 5
Col 4 2 2 1
I've been testing two approaches:
Approach 1
Map each row to a list of row count tuples. Before data could look like ['val1', 2.0, '0001231', True] -> afterwards it would look like [(1, 0, 0), (0, 1, 0), (0, 1, 0), (0, 0, 1)]
Reduce each row by adding the tuples together
my_rdd.map("function to categorize").reduce("add tuples")
Approach 2
Flat map each value to its own row as a key-value pair. Similar to first approach, but result would look like ("col1", (1, 0, 0), ("col2", (0, 1, 0)), ("col3", (0, 1, 0)), ("col4", (0, 0, 1)), where each tuple becomes a new row.
Reduce by key
my_rdd.flatMap("function to categorize and create row for each column value").reduceByKey("add tuples for each column")
Is there a more efficient way to do this for a generic RDD? I'm also looking to count distinct values while I'm doing the mapping, but realize that would require a separate discussion.

Resources