Related
Python newbie question:
I have a list like this:
list1 = [[1.0, 0.0, 0.0],[0.0, 0.0, 1.0],[0.0, 0.0, 1.0],[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]
I want to convert to another list like this (an integer that contains the index where 1.0 is located):
list2 = [0,2,2,1,2]
Tried doing this:
d = {[1.0, 0.0, 0.0]:0,
[0.0, 1.0, 0.0]:1,
[0.0, 0.0, 1.0]:2}
list2 = map(d.get, list1)
But no success
Keys of python dicts cannot be mutable, the quickest way to modify your existing code would just be to feed tuples in as keys instead:
list1 = [[1.0, 0.0, 0.0],[0.0, 0.0, 1.0],[0.0, 0.0, 1.0],[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]
d = {(1.0, 0.0, 0.0):0,
(0.0, 1.0, 0.0):1,
(0.0, 0.0, 1.0):2}
list2 = list(map(lambda x: d.get(tuple(x)), list1))
Here is the output:
And here is the code:
list1 = [[1.0, 0.0, 0.0],[0.0, 0.0, 1.0],[0.0, 0.0, 1.0],[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]
value_you_want = 1.0
list2 = []
for temp1 in range(len(list1)):
list2.append(list1[temp1].index(value_you_want))
print(list2)
list1 = [[1.0, 0.0, 0.0],[0.0, 0.0, 1.0],[0.0, 0.0, 1.0] \
,[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]
list2=[i.index(1.0) for i in list1 if 1.0 in i]
print(list2)
I have a unique situation where I need to go from a DenseVector to a Sparse Vector Column.
I am trying to implement the SMOTE technique I found here: https://github.com/Angkirat/Smote-for-Spark/blob/master/PythonCode.py
But on line 44 I had to change it from min_Array[neigh][0] - min_Array[i][0] to DenseVector(min_Array[neigh][0]) - DenseVector(min_Array[i][0]) due to an error.
Once I have the DenseVector column, I need to convert it back to a SparseVector column to union my data.
I have tried the Following:
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
list_to_vector_udf = udf(lambda l: Vectors.sparse(l), VectorUDT())
df = df.withColumn('features', list_to_vector_udf(df["features"]))
"int() argument must be a string, a bytes-like object or a number, not 'DenseVector''
assembler = VectorAssembler(inputCols=['features'],outputCol='features')
df = assembler.transform(df)
"Data type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> of column features is not supported."
It usually doesn't make too much sense to convert a dense vector to a sparse vector since dense vector has already taken the memory. If you really need to do this, look at the sparse vector API, it either accepts a list of pairs (indice, value) or you need to directly pass nonzero indices and values to the constructor. Something like the following:
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.linalg import DenseVector
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
def to_sparse(dense_vector):
size = len(dense_vector)
pairs = [(i, v) for i, v in enumerate(dense_vector.values.tolist()) if v != 0]
return Vectors.sparse(size, pairs)
dense_to_sparse_udf = udf(to_sparse, VectorUDT())
df = df.withColumn('features', dense_to_sparse_udf(df["features"]))
df.show()
+-------+--------------------+
|row_num| features|
+-------+--------------------+
| 1|(10,[1,2,3,4,5],[...|
| 2| (10,[9],[100.0])|
| 3| (10,[1],[1.0])|
+-------+--------------------+
when i try to fit it in kmeans clustering it throws error "ValueError: setting an array element with a sequence."
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(df)
Array decription.
Name: Vector, Length: 179, dtype: object
0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
10 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
100 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
101 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Your column has a list in it. It needs to be opened up into multiple columns before passing it to KMeans.
df = pd.read_json('/Users/roshansk/Downloads/NewsArticles.json')
#Extracting the vectors into columns
vectors = df.Vector.apply(pd.Seriesies)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(vectors)
I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+
I cannot find out the mistake I made, could anyone help me? Thanks very much!
import math
def GASSEM():
a0 = [12,-2,1,0,0,0,0,0,0,0,13.97]
a1 = [-2,12,-2,1,0,0,0,0,0,0,5.93]
a2 = [1,-2,12,-2,1,0,0,0,0,0,-6.02]
a3 = [0,1,-2,12,-2,1,0,0,0,0,8.32]
a4 = [0,0,1,-2,12,-2,1,0,0,0,-23.75]
a5 = [0,0,0,1,-2,12,-2,1,0,0,28.45]
a6 = [0,0,0,0,1,-2,12,-2,1,0,-8.9]
a7 = [0,0,0,0,0,1,-2,12,-2,1,-10.5]
a8 = [0,0,0,0,0,0,1,-2,12,-2,10.34]
a9 = [0,0,0,0,0,0,0,1,-2,12,-38.74]
A = [a0,a1,a2,a3,a4,a5,a6,a7,a8,a9] # 10x11 matrix
interchange=[0,0,0,0,0,0,0,0,0,0,0]
for i in range (1,10):
median = abs(A[i-1][i-1])
for m in range (i,10): #pivoting
if abs(A[m][i-1]) > median:
median = abs(A[m][i-1])
interchange = A[i-1]
A[i-1] = A[m]
A[m] = interchange
for j in range(i,10): #creating upper triangle matrix
A[j] = [A[j][k]-(A[j][i-1]/A[i-1][i-1])*A[i-1][k] for k in range(0,11)]
for t in range (0,10): #print the upper triangle matrix
print(A[t])
The output is not an upper triangle matrix, I'm getting lost in the for loops...
When I run this code, the output is
[12, -2, 1, 0, 0, 0, 0, 0, 0, 0, 13.97]
[0.0, 11.666666666666666, -1.8333333333333333, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.258333333333333]
[0.0, 0.0, 11.628571428571428, -1.842857142857143, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -5.886428571428571]
[0.0, 0.0, -2.220446049250313e-16, 11.622235872235873, -1.8415233415233416, 1.0, 0.0, 0.0, 0.0, 0.0, 6.679281326781327]
[0.0, 0.0, -3.518258683818212e-17, 0.0, 11.622218698800275, -1.8415517150256329, 1.0, 0.0, 0.0, 0.0, -22.185475397706252]
[0.0, 0.0, 1.3530439218911067e-17, 0.0, 0.0, 11.62216239813737, -1.841549039580908, 1.0, 0.0, 0.0, 24.359991632712457]
[0.0, 0.0, 5.171101701700419e-18, 0.0, 0.0, 0.0, 11.622161705324444, -1.84154850220678, 1.0, 0.0, -3.131238144426707]
[0.0, 0.0, -3.448243038110395e-19, 0.0, 0.0, 0.0, 0.0, 11.62216144141611, -1.8415485389982904, 1.0, -13.0921440313208]
[0.0, 0.0, -4.995725026226573e-19, 0.0, 0.0, 0.0, 0.0, 0.0, 11.622161418001749, -1.8415485322346454, 8.534950160892514]
[0.0, 0.0, -4.9488445836100553e-20, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.622161417603511, -36.26114362292296]
This effectively is upper triangular. The absolute value of the 'non-zero' entries in the third column of the lower triangle are all less than 10e-15. Given that other values are 1 or greater, these small numbers look like floating point subtraction errors in A[j][k] - (A[j][i-1]/A[i-1][i-1])*A[i-1][k] that can be considered to be 0. Without more investigation, I don't know why the non-zero values are limited to this column.
For this data, the condition abs(A[m][i-1]) > median is never true, so the if block code is not tested.