Dense Vector Column to Sparse Vector Column - apache-spark

I have a unique situation where I need to go from a DenseVector to a Sparse Vector Column.
I am trying to implement the SMOTE technique I found here: https://github.com/Angkirat/Smote-for-Spark/blob/master/PythonCode.py
But on line 44 I had to change it from min_Array[neigh][0] - min_Array[i][0] to DenseVector(min_Array[neigh][0]) - DenseVector(min_Array[i][0]) due to an error.
Once I have the DenseVector column, I need to convert it back to a SparseVector column to union my data.
I have tried the Following:
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
list_to_vector_udf = udf(lambda l: Vectors.sparse(l), VectorUDT())
df = df.withColumn('features', list_to_vector_udf(df["features"]))
"int() argument must be a string, a bytes-like object or a number, not 'DenseVector''
assembler = VectorAssembler(inputCols=['features'],outputCol='features')
df = assembler.transform(df)
"Data type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> of column features is not supported."

It usually doesn't make too much sense to convert a dense vector to a sparse vector since dense vector has already taken the memory. If you really need to do this, look at the sparse vector API, it either accepts a list of pairs (indice, value) or you need to directly pass nonzero indices and values to the constructor. Something like the following:
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.linalg import DenseVector
df = sc.parallelize([
(1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
(2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
(3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
def to_sparse(dense_vector):
size = len(dense_vector)
pairs = [(i, v) for i, v in enumerate(dense_vector.values.tolist()) if v != 0]
return Vectors.sparse(size, pairs)
dense_to_sparse_udf = udf(to_sparse, VectorUDT())
df = df.withColumn('features', dense_to_sparse_udf(df["features"]))
df.show()
+-------+--------------------+
|row_num| features|
+-------+--------------------+
| 1|(10,[1,2,3,4,5],[...|
| 2| (10,[9],[100.0])|
| 3| (10,[1],[1.0])|
+-------+--------------------+

Related

Pivoting ArrayType columns in pyspark

I have a pyspark dataframe with the following schema
+----------+-------------------+-----------------------------------+------------------+
| date| numeric_id| feature_column| city|
+----------+-------------------+-----------------------------------+------------------+
|2017-08-01| 2343434545| [0.0, 0.0, 0.0, 0...| Berlin|
|2017-08-01| 2343434545| [0.0, 0.0, 0.0, 0...| Rome|
|2017-08-01| 2343434545| [0.0, 0.0, 0.0, 0...| NewYork|
|2017-08-01| 2343434545| [0.0, 0.0, 0.0, 0...| Beijing|
|2019-12-01| 6455534545| [0.0, 0.0, 0.0, 0...| Berlin|
|2019-12-01| 6455534545| [0.0, 0.0, 0.0, 0...| Rome|
|2019-12-01| 6455534545| [0.0, 0.0, 0.0, 0...| NewYork|
|2019-12-01| 6455534545| [0.0, 0.0, 0.0, 0...| Beijing|
+----------+-------------------+-----------------------------------+------------------+
I want to pivot the dataframe so that I can have each feature_column x city as a new column, grouped by date and numeric_id. The output dataframe should look like
+----------+-------------+----------------------+--------------------+-----------------------+----------------------+
| date| numeric_id| feature_column_Berlin| feature_column_Rome| feature_column_NewYork|feature_column_Beijing|
+----------+-------------+----------------------+--------------------+-----------------------+----------------------+
|2017-08-01| 2343434545| [0.0, 0.0, 0.0, 0...|[0.0, 0.0, 0.0, 0...|[0.0, 0.0, 0.0, 0... |[0.0, 0.0, 0.0, 0... |
|2019-12-01| 6455534545| [0.0, 0.0, 0.0, 0...|[0.0, 0.0, 0.0, 0...|[0.0, 0.0, 0.0, 0... |[0.0, 0.0, 0.0, 0... |
+----------+-------------+----------------------+--------------------+-----------------------+----------------------+
This is different from the question posted on pivoting strings Pivot String column on Pyspark Dataframe since I am dealing with ArrayType columns.
I'm thinking it would be easier to implement it in Pandas (but handling ArrayType columns will be tricky), so am curious about how to do it using spark SQL. Any suggestions?
//Initially I am creating the sample data to load the data in dataframe.
import org.apache.spark.sql.functions._
val df = Seq(("2017-08-01","2343434545",Array("0.0","0.0","0.0","0.0"),"Berlin"),("2017-08-01","2343434545",Array("0.0","0.0","0.0","0.0"),"Rome"),("2017-08-01","2343434545",Array("0.0","0.0","0.0","0.0"),"NewYork"),("2017-08-01","2343434545",Array("0.0","0.0","0.0","0.0"),"Beijing"),("2019-12-01","6455534545",Array("0.0","0.0","0.0","0.0"),"Berlin"),("2019-12-01","6455534545",Array("0.0","0.0","0.0","0.0"),"Rome"),("2019-12-01","6455534545",Array("0.0","0.0","0.0","0.0"),"NewYork"),("2019-12-01","6455534545",Array("0.0","0.0","0.0","0.0"),"Beijing"))
.toDF("date","numeric_id","feature_column","city")
df.groupBy("date","numeric_id").pivot("city")
.agg(collect_list("feature_column"))
.withColumnRenamed("Beijing","feature_column_Beijing")
.withColumnRenamed("Berlin","feature_column_Berlin")
.withColumnRenamed("NewYork","feature_column_NewYork")
.withColumnRenamed("Rome","feature_column_Rome").show()
You can see the output as below :

How to write a custom loss function in LGBM?

I have a binary cross-entropy implementation in Keras. I would like to implement the same one in LGBM as a custom loss. Now I understand LGBM of course has 'binary' objective built-in but I would like to implement this one custom-made on my own as a starter for some future enhancements.
Here is the code,
def custom_binary_loss(y_true, y_pred):
"""
Keras version of binary cross-entropy (works like charm!)
"""
# https://github.com/tensorflow/tensorflow/blob/v2.3.1/tensorflow/python/keras/backend.py#L4826
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
term_0 = (1 - y_true) * K.log(1 - y_pred + K.epsilon()) # Cancels out when target is 1
term_1 = y_true * K.log(y_pred + K.epsilon()) # Cancels out when target is 0
return -K.mean(term_0 + term_1, axis=1)
# --------------------
def custom_binary_loss_lgbm(y_pred, train_data):
"""
LGBM version of binary cross-entropy
"""
y_pred = 1.0 / (1.0 + np.exp(-y_pred))
y_true = train_data.get_label()
y_true = np.expand_dims(y_true, axis=1)
y_pred = np.expand_dims(y_pred, axis=1)
epsilon_ = 1e-7
y_pred = np.clip(y_pred, epsilon_, 1 - epsilon_)
term_0 = (1 - y_true) * np.log(1 - y_pred + epsilon_) # Cancels out when target is 1
term_1 = y_true * np.log(y_pred + epsilon_) # Cancels out when target is 0
grad = -np.mean(term_0 + term_1, axis=1)
hess = np.ones(grad.shape)
return grad, hess
But using the above my LGBM model only predicts zeros. Now my dataset is balanced and everything looks cool so what's the error here?
params = {
'objective': 'binary',
'num_iterations': 100,
'seed': 21
}
ds_train = lgb.Dataset(df_train[predictors], y, free_raw_data=False)
reg_lgbm = lgb.train(params=params, train_set=ds_train, fobj=custom_binary_loss_lgbm)
I also tried with a different hessian hess = (y_pred * (1. - y_pred)).flatten(). Although I don't know what hessian really means it didn't work either!
list(map(lambda x: 1.0 / (1.0 + np.exp(-x)), reg_lgbm.predict(df_train[predictors])))
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, .............]
Try setting the metric parameter to the string "None" in params, like this:
params = {
'objective': 'binary',
'metric': 'None',
'num_iterations': 100,
'seed': 21
}
Otherwise, according to the documentation, the algorithm would choose a default evaluation method for objective set to 'binary'

can some one help to fit the array in kmeans clustering

when i try to fit it in kmeans clustering it throws error "ValueError: setting an array element with a sequence."
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(df)
Array decription.
Name: Vector, Length: 179, dtype: object
0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
10 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
100 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
101 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Your column has a list in it. It needs to be opened up into multiple columns before passing it to KMeans.
df = pd.read_json('/Users/roshansk/Downloads/NewsArticles.json')
#Extracting the vectors into columns
vectors = df.Vector.apply(pd.Seriesies)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(vectors)

Optimization for faster numpy 'where' with boolean condition

I generate a bunch of 5-elements vectors with
def beam(n):
# For performance considerations, see
# https://software.intel.com/en-us/blogs/2016/06/15/faster-random-number-generation-in-intel-distribution-for-python
try:
import numpy.random_intel
generator = numpy.random_intel.multivariate_normal
except ModuleNotFoundError:
import numpy.random
generator = numpy.random.multivariate_normal
return generator(
[0.0,
0.0,
0.0,
0.0,
0.0
],
numpy.array([
[1.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.2]
]),
int(n)
)
This vector will be multiplied by 5x5 matrices (element wise) and checked for boundaries. I use this:
b = beam(1e5)
bound = 1000
s = (b[:, 0]**2 + b[:, 3]**2) < bound**2
#b[np.where(s)] (equivalent performances)
b[s] # <= returned value from a function
It seems that this operation with 100k elements is quite time consuming (3ms on my machine).
Would there be an obvious (or less obvious) way to perform this
operation (the where part, the random generation is there to give an example) ?
As your components are uncorrelated one obvious speedup would be to use the univariate normal instead of the multivariate:
>>> from timeit import repeat
>>> import numpy as np
>>>
>>> kwds = dict(globals=globals(), number=100)
>>>
>>> repeat('np.random.multivariate_normal(np.zeros((5,)), np.diag((1,1,1,1,0.2)), (100,))', **kwds)
[0.01475344318896532, 0.01471381587907672, 0.013099645031616092]
>>> repeat('np.random.normal((0,0,0,0,0), (1,1,1,1,np.sqrt(0.2)), (100, 5))', **kwds)
[0.003930734936147928, 0.004097769036889076, 0.004246715921908617]
Further, as it stands your condition is extremely unlikely to fail. So, just check s.all() and if True do nothing.

Gaussian elimination with partial pivoting (column)

I cannot find out the mistake I made, could anyone help me? Thanks very much!
import math
def GASSEM():
a0 = [12,-2,1,0,0,0,0,0,0,0,13.97]
a1 = [-2,12,-2,1,0,0,0,0,0,0,5.93]
a2 = [1,-2,12,-2,1,0,0,0,0,0,-6.02]
a3 = [0,1,-2,12,-2,1,0,0,0,0,8.32]
a4 = [0,0,1,-2,12,-2,1,0,0,0,-23.75]
a5 = [0,0,0,1,-2,12,-2,1,0,0,28.45]
a6 = [0,0,0,0,1,-2,12,-2,1,0,-8.9]
a7 = [0,0,0,0,0,1,-2,12,-2,1,-10.5]
a8 = [0,0,0,0,0,0,1,-2,12,-2,10.34]
a9 = [0,0,0,0,0,0,0,1,-2,12,-38.74]
A = [a0,a1,a2,a3,a4,a5,a6,a7,a8,a9] # 10x11 matrix
interchange=[0,0,0,0,0,0,0,0,0,0,0]
for i in range (1,10):
median = abs(A[i-1][i-1])
for m in range (i,10): #pivoting
if abs(A[m][i-1]) > median:
median = abs(A[m][i-1])
interchange = A[i-1]
A[i-1] = A[m]
A[m] = interchange
for j in range(i,10): #creating upper triangle matrix
A[j] = [A[j][k]-(A[j][i-1]/A[i-1][i-1])*A[i-1][k] for k in range(0,11)]
for t in range (0,10): #print the upper triangle matrix
print(A[t])
The output is not an upper triangle matrix, I'm getting lost in the for loops...
When I run this code, the output is
[12, -2, 1, 0, 0, 0, 0, 0, 0, 0, 13.97]
[0.0, 11.666666666666666, -1.8333333333333333, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.258333333333333]
[0.0, 0.0, 11.628571428571428, -1.842857142857143, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -5.886428571428571]
[0.0, 0.0, -2.220446049250313e-16, 11.622235872235873, -1.8415233415233416, 1.0, 0.0, 0.0, 0.0, 0.0, 6.679281326781327]
[0.0, 0.0, -3.518258683818212e-17, 0.0, 11.622218698800275, -1.8415517150256329, 1.0, 0.0, 0.0, 0.0, -22.185475397706252]
[0.0, 0.0, 1.3530439218911067e-17, 0.0, 0.0, 11.62216239813737, -1.841549039580908, 1.0, 0.0, 0.0, 24.359991632712457]
[0.0, 0.0, 5.171101701700419e-18, 0.0, 0.0, 0.0, 11.622161705324444, -1.84154850220678, 1.0, 0.0, -3.131238144426707]
[0.0, 0.0, -3.448243038110395e-19, 0.0, 0.0, 0.0, 0.0, 11.62216144141611, -1.8415485389982904, 1.0, -13.0921440313208]
[0.0, 0.0, -4.995725026226573e-19, 0.0, 0.0, 0.0, 0.0, 0.0, 11.622161418001749, -1.8415485322346454, 8.534950160892514]
[0.0, 0.0, -4.9488445836100553e-20, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 11.622161417603511, -36.26114362292296]
This effectively is upper triangular. The absolute value of the 'non-zero' entries in the third column of the lower triangle are all less than 10e-15. Given that other values are 1 or greater, these small numbers look like floating point subtraction errors in A[j][k] - (A[j][i-1]/A[i-1][i-1])*A[i-1][k] that can be considered to be 0. Without more investigation, I don't know why the non-zero values are limited to this column.
For this data, the condition abs(A[m][i-1]) > median is never true, so the if block code is not tested.

Resources