Calculating tf-idf for name/surname in pyspark

Calculating tf-idf for name/surname in pyspark - apache-spark

I am having the following RDD(sample):
names_rdd.take(3)
[u'Daryll Dickenson', u'Dat Naijaboi', u'Duc Dung Lam']
And I am trying to calculate the tf_idf:
from pyspark.mllib.feature import HashingTF,IDF
hashingTF = HashingTF()
tf_names = hashingTF.transform(names_rdd)
tf_names.cache()
idf_names =IDF().fit(tf_names)
tfidf_names = idf_names.transform(tf_names)
I dont understand why tf_names.take(3) gives these results:
[SparseVector(1048576, {60275: 1.0, 134386: 1.0, 145380: 1.0, 274465: 1.0, 441832: 1.0, 579064: 1.0, 590058: 1.0, 664173: 2.0, 812399: 2.0, 845381: 2.0, 886510: 1.0, 897504: 1.0, 1045730: 1.0}),
SparseVector(1048576, {208501: 1.0, 274465: 1.0, 441832: 2.0, 515947: 1.0, 537935: 1.0, 845381: 1.0, 886510: 1.0, 897504: 3.0, 971619: 1.0}),
SparseVector(1048576, {274465: 2.0, 282612: 2.0, 293606: 1.0, 389709: 1.0, 738284: 1.0, 812399: 1.0, 845381: 2.0, 897504: 1.0, 1045730: 1.0})]
Shouldn't be each line have 2 values such as something like this:
[SparseVector(1048576, {60275: 1.0, 134386: 1.0}),
SparseVector(1048576, {208501: 1.0, 274465: 1.0}),
SparseVector(1048576, {274365: 2.0, 282612: 2.0})]
?

What I was doing wrong is that I had every line to split the words and make a list of it. Something like this:
def split_name(name):
list_name = name.split(' ')
list_name = [word.strip() for word in list_name]
return list_name
names = names_rdd.map(lambda name:split_name(name))
hashingTF = HashingTF()
tf_names = hashingTF.transform(names_rdd)
.
.
.

Related

Panda how to overwrite new value on previous value?

I have the following code :
# Append columns to an empty DataFrame.
self.df = pd.DataFrame(columns = ["ID", "Features"],index=['index1'])
tracking_id = output[4]
print(tracking_id in set(self.df['ID']))
if tracking_id in self.df['ID'] :
df2 = pd.DataFrame([[tracking_id], features])
self.df.update(df2)
print(self.df)
else :
self.df = self.df.append({'ID' : tracking_id, 'Features' : features}, ignore_index = True)
In this code, first I check whether an element with the same İD is available or not. Is it is available, previous feature value should be updated with the new one. Actually it works but not correctly.
My output is :
ID Features
0 NaN NaN
1 1.0 [[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 50.0...
2 4.0 [[0.0, 0.0, 0.0, 1.0, 89.0, 15.0, 0.0, 0.0, 10...
3 4.0 [[70.0, 41.0, 17.0, 41.0, 4.0, 0.0, 0.0, 0.0, ...
4 4.0 [[42.0, 18.0, 16.0, 14.0, 2.0, 0.0, 0.0, 0.0, ...
5 6.0 [[3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 8.0, 59.0...
6 6.0 [[0.0, 6.0, 7.0, 9.0, 12.0, 3.0, 0.0, 0.0, 51....
As you can see, the same value is appended like absent İD on the list. However sometimes it is not added to the list just is overwrited to previous ones. How can I solve this problem ?

Python GEKKO: Value of parameter changes while solving the model

I face the following problem with GEKKO: some parameters (.Param) are changing (others not) when solving a model and I cannot determine why.
Background: I am currently trying to translate code from EViews (see gennaro.zezza.it) to python. I use GEKKO to simulate a system consisting out of 11 equations (for now). I do want to use parameters (instead of constants which seem to work perfectly fine) as I need to ('exogenously') change their value over time (and thus need an array).
Example: In the following example, an 'economic system' reacts to new government expenditures. Here, I particularly face problems with "m.alpha1" and "m.alpha2" - if they are introduced as ".Param" their value will change to 1.0 (instead of 0.6 and 0.4) when solving the model. How can I stop GEKKO from doing this? (Again, I want to be able to change, e.g., alpha1 to 0.7 after time x. E.g., lower and upper bounds won't help here.)
Thanks for your help!!
Code:
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
# Initialize model
m = GEKKO(remote=False)
tstart = 1945
tend = 2000
tdur = tend-tstart+1
m.time = np.linspace(0, tend-tstart, tdur)
# Model parameters
m.t = m.Param(value=m.time)
# Exogenous parameters
alpha1_ex = 0.6
alpha2_ex = 0.4
theta_ex = 0.2
w_ex = 1
# -as .Const
m.alpha1 = m.Const(value=alpha1_ex, name='Propensity to consume out of income')
m.alpha2 = m.Const(value=alpha2_ex, name='Propensity to consume out of wealth')
#m.theta = m.Const(value=theta_ex, name='Tax rate')
#m.w = m.Const(value=w_ex, name='Wage rate')
# -as .Param: issues with alpha1 & alpha2
#m.alpha1 = m.Param(value=np.full(tdur,alpha1_ex), name='Propensity to consume out of income')
#m.alpha2 = m.Param(value=np.full(tdur,alpha2_ex), name='Propensity to consume out of wealth')
m.theta = m.Param(value=np.full(tdur,theta_ex), name='Tax rate')
m.w = m.Param(value=np.ones(tdur), name='Wage rate')
# no issues with g_d
m.g_d = m.Param(value=np.zeros(tdur), name='Government goods, demand')
m.g_d[1:] = 20
# Endogenous variables
m.c_d = m.Var(value=0, name='Consumption goods demand by households')
m.c_s = m.Var(value=0, name='Consumption goods supply')
m.g_s = m.Var(value=0, name='Government goods, supply')
m.h_h = m.Var(value=0, name='Cash money held by households')
m.h_s = m.Var(value=0, name='Cash money supplied by government')
m.n_d = m.Var(value=0, name='Demand for labor')
m.n_s = m.Var(value=0, name='Supply for labor')
m.t_d = m.Var(value=0, name='Taxes, "demand"')
m.t_s = m.Var(value=0, name='Taxes, "supply"')
m.y = m.Var(value=0, name='Income (=GDP)')
m.yd = m.Var(value=0, name='Disposable income of households')
# Lag variables
m.h_h_lag = m.Var(value=0, name='Cash money held by households (t-1)')
m.delay(m.h_h,m.h_h_lag,1) # m.h_h_lag = m.h_h(t-1)
m.h_s_lag = m.Var(value=0, name='Cash money supplied by government (t-1)')
m.delay(m.h_s,m.h_s_lag,1)
# Equations
m.Equation(m.c_s == m.c_d)
m.Equation(m.g_s == m.g_d)
m.Equation(m.t_s == m.t_d)
m.Equation(m.n_s == m.n_d)
m.Equation(m.yd == m.w*m.n_s - m.t_s)
m.Equation(m.t_d == m.theta*m.w*m.n_s)
m.Equation(m.c_d == m.alpha1*m.yd + m.alpha2*m.h_h_lag)
m.Equation(m.h_s == m.h_s_lag + m.g_d - m.t_d)
m.Equation(m.h_h == m.h_h_lag + m.yd - m.c_d)
m.Equation(m.y == m.c_s + m.g_s)
m.Equation(m.n_d == m.y/m.w)
# Solve
m.options.IMODE = 4
m.solve(disp=False)
print("Alpha1 = ", m.alpha1.value)
print("Alpha2 = ", m.alpha2.value)
print("Theta = ", m.theta.value)
print("w = ", m.w.value)
# Plot results
fig, axes = plt.subplots(2, 2, sharex=True, figsize=(8, 7))
fig.canvas.manager.set_window_title('Figures Chapter 3')
fig.suptitle('SIM Model - basic')
x_major_ticks = np.arange(0,tdur,5)
axes[0,0].plot(m.time, m.g_d.value, '-', color='black', linewidth=1)
axes[0,0].legend([m.g_d.name],loc=4,fontsize=7)
axes[0,0].grid()
axes[0,0].set_xticks(x_major_ticks)
axes[1,0].plot(m.time, m.y.value, '-', color='red', linewidth=1)
axes[1,0].legend([m.y.name],loc=4,fontsize=7)
axes[1,0].grid()
axes[1,0].set_xlabel('Time (years)')
axes[1,0].set_xticks(x_major_ticks)
axes[0,1].plot(m.time, m.c_d.value, '-', color='blue', linewidth=0.75)
axes[0,1].plot(m.time, m.yd.value, '-', color='green', linewidth=0.75)
axes[0,1].legend([m.c_d.name,m.yd.name],loc=4,fontsize=7)
axes[0,1].grid()
axes[0,1].set_xticks(x_major_ticks)
ln1 = axes[1,1].plot(m.time, m.h_h.value, '-', color='purple', linewidth=0.75)
axes[1,1].tick_params(axis='y', labelcolor='purple')
ax2 = axes[1,1].twinx()
ln2 = ax2.plot(m.time, [a_i - b_i for a_i, b_i in zip(m.h_h, m.h_h_lag)], '-', color='orange', linewidth=0.75)
ax2.tick_params(axis='y', labelcolor='orange')
lns = ln1+ln2
axes[1,1].legend(lns,[m.h_h.name,'Household savings'],loc=4,fontsize=7)
axes[1,1].grid()
axes[1,1].set_xticks(x_major_ticks)
axes[1,1].set_xlabel('Time (years)')
plt.show()
Output #1: with m.alpha1 and m.alpha2 as .const
Alpha1 = 0.6
Alpha2 = 0.4
Theta = [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]
w = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Output #2: with m.alpha1 as .param
Alpha1 = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Alpha2 = 0.4
Theta = [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]
w = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The problem is that the name of the variable name='Propensity to consume out of income' is over 25 characters long.
m.alpha1 = m.Param(value=np.full(tdur,alpha1_ex), name='Propensity to consume out of income')
m.alpha2 = m.Param(value=np.full(tdur,alpha2_ex), name='Propensity to consume out of wealth')
The model file is produced correctly (gk_model0.apm) but the data file (gk_model0.csv) header is truncated to 25 characters. The files are accessible with m.open_folder(). The bug is in this line of gk_write_files.py where numbers are output as strings of length 25.
np.savetxt(os.path.join(self._path,file_name), csv_data.T, delimiter=",", fmt='%1.25s')
I've added this as a bug report with tracking on GitHub. One work-around is to use shorter variable names or leave off the variable names.
m.alpha1 = m.Param(value=np.full(tdur,alpha1_ex)) # Propensity to consume out of income

np.nansum ignores zeros in Counters

I have two counters in python: counter1 and counter2. When I try to perform np.nansum on them, one of the fields is ignored because it contains zeros (if i change values to non-zeros the code works fine). Is there any workaround to get all the input keys in the output dict?
counter1 = Counter({'sensitivity': 1.0, 'dice': 1.0, 'specificity': 1.0, 'precision': 1.0, 'c-factor': 0.0})
counter2 = Counter({'sensitivity': 1.0, 'dice': 1.0, 'specificity': 1.0, 'precision': 1.0, 'c-factor': 0.0})
c = np.nansum([counter1, counter2])
the result i get is:
c= Counter({'sensitivity': 2.0, 'specificity': 2.0, 'dice': 2.0, 'precision': 2.0})
to compare, when i do:
counter1 = Counter({'sensitivity': 1.0, 'dice': 1.0, 'specificity': 1.0, 'precision': 1.0, 'c-factor': 0.1})
counter2 = Counter({'sensitivity': 1.0, 'dice': 1.0, 'specificity': 1.0, 'precision': 1.0, 'c-factor': 0.1})
c = np.nansum([counter1, counter2])
i get:
c=Counter({'sensitivity': 2.0, 'specificity': 2.0, 'dice': 2.0, 'precision': 2.0, 'c-factor': 0.2})

See this post. If you need to update if you want to keep zeros. Try doing:
c=np.nansum(counter1).copy() #I don't know why you use np.nansum, but you can pass it like this
c.update(np.nansum(counter2))
c
>>Counter({'c-factor': 0.0,
'dice': 2.0,
'precision': 2.0,
'sensitivity': 2.0,
'specificity': 2.0})

Finding frequency distribution of a list of numbers in python

I have a Long list of numbers like the following. I would like to find frequency distribution of each number, but I could not use Counter function to get frequency of each item, as they are integers and I get the error that it is not iterable , and therefore I could not convert the list to strings. I checked the similar questions but they did not work for me.
data=[1.0, 2.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 7.0, 1.0, 0.0, 0.0, 4.0, 3.0, 3.0, 1.0, 1.0, 2.0, 4.0, 0.0, 1.0, 7.0, 2.0, 1.0, 1.0, 4.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0, 2.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 10.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 2.0, 3.0, 0.0, 3.0, 2.0, 11.0, 0.0, 5.0, 2.0, 0.0, 1.0, 2.0, 1.0, 8.0, 1.0, 0.0, 6.0, 2.0, 4.0, 0.0, 17.0, 0.0, 27.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 0.0, 0.0, 6.0, 0.0, 0.0, 1.0, 1.0, 2.0, 0.0, 10.0, 0.0, 0.0, 5.0, 7.0, 1.0, 0.0, 1.0, 2.0, 1.0, 5.0, 2.0, 1.0, 9.0, 1.0, 0.0, 2.0, 0.0, 1.0, 3.0, 1.0, 1.0, 0.0, 0.0, 3.0, 5.0, 2.0, 0.0, 1.0, 9.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 3.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 2.0, 0.0, 1.0, 1.0, 3.0, 1.0, 2.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 2.0, 3.0, 2.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 4.0, 1.0, 0.0, 2.0, 1.0, 1.0, 19.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 2.0, 1.0, 1.0, 5.0, 4.0, 2.0, 0.0, 1.0, 2.0, 0.0, 5.0, 0.0, 0.0, 3.0, 1.0, 0.0, 1.0, 1.0, 0.0, 3.0, 2.0, 4.0, 10.0, 2.0, 1.0, 3.0, 1.0, 0.0, 2.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 17.0, 0.0, 2.0, 3.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, 2.0, 5.0, 2.0, 1.0, 1.0, 1.0, 3.0, 0.0, 1.0, 1.0, 0.0, 4.0, 5.0, 2.0, 2.0, 1.0, 3.0, 0.0, 1.0, 3.0, 1.0, 1.0, 1.0, 0.0, 3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 5.0, 0.0, 1.0, 4.0, 0.0, 0.0, 1.0, 6.09]

You could use something simple like:
def freq(lst):
d = {}
for i in lst:
if d.get(i):
d[i] += 1
else:
d[i] = 1
return d
results:
>>> freq(data)
{0.0: 72, 1.0: 106, 2.0: 40, 3.0: 21, 4.0: 9, 5.0: 10, 6.0: 2, 7.0: 3, 8.0: 2, 9.0: 2, 10.0: 3, 11.0: 1, 15.0: 1, 17.0: 2, 19.0: 1, 6.09: 1, 27.0: 1}
Though Counter worked fine for me (I copy-pasted the data that you posted):
...
>>> from collections import Counter
>>> Counter(data)
Counter({1.0: 106, 0.0: 72, 2.0: 40, 3.0: 21, 5.0: 10, 4.0: 9, 7.0: 3, 10.0: 3, 6.0: 2, 8.0: 2, 9.0: 2, 17.0: 2, 11.0: 1, 15.0: 1, 19.0: 1, 6.09: 1, 27.0: 1})

distribution ={i:data.count(i)/len(data) for i in set(data)}

Issues with Logistic Regression for multiclass classification using PySpark

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector:
For full code base and error log, please check my github repo
Case 1: I tried using the pipeline of ML as follow:
# imported library from ML
from pyspark.ml.feature import HashingTF
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
print(type(trainingData)) # for checking only
print(trainingData.take(2)) # for of data type
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=maximumIteration, regParam=re
gParamValue)
pipeline = Pipeline(stages=[lr])
# Train model
model = pipeline.fit(trainingData)
Got the following error:
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=2.0, features=SparseVector(2000, {51: 1.0, 160: 1.0, 341: 1.0, 417: 1.0, 561: 1.0, 656: 1.0, 863: 1.0, 939: 1.0, 1021: 1.0, 1324: 1.0, 1433: 1.0, 1573: 1.0, 1604: 1.0, 1720: 1.0})), Row(label=3.0, features=SparseVector(2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 285: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 1252: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 1423: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0}))]
16/08/25 19:14:07 ERROR org.apache.spark.ml.classification.LogisticRegression: Currently, LogisticRegression with E
lasticNet in ML package only supports binary classification. Found 5 in the input dataset.
Traceback (most recent call last):
File "/home/LR/test.py", line 260, in <module>
accuracy = TrainLRCModel(trainData, testData)
File "/home/LR/test.py", line 211, in TrainLRCModel
model = pipeline.fit(trainingData)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 69, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 213, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 69, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 133, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 130, in _fit_java
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.fit.
: org.apache.spark.SparkException: Currently, LogisticRegression with ElasticNet in ML package only supports binary
classification. Found 5 in the input dataset.
at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:290)
at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Case 2: I search the possible alternate solution of above one and got that LogisticRegressionWithLBFGS will work on multi-class classificaton, I tried as follow:
#imported library
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
print(type(trainingData)) # to check the dataset type
print(trainingData.take(2)) # To see the data
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
print(type(model))
Got the following error:
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=3.0, features=SparseVector(2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 28
5: 1.0, 432: 1.0, 539: 1.0, 571: 1.0, 630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.
0, 823: 1.0, 893: 1.0, 900: 1.0, 915: 1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.
0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 1252: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 14
23: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1
.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0})), Row(label=5.0, features=SparseV
ector(2000, {103: 1.0, 310: 1.0, 601: 1.0, 817: 1.0, 866: 1.0, 940: 1.0, 1023: 1.0, 1118: 1.0, 1339: 1.0, 1447: 1.0
, 1634: 1.0, 1776: 1.0}))]
Traceback (most recent call last):
File "/home/LR/test.py", line 260, in <module>
accuracy = TrainLRCModel(trainData, testData)
File "/home/LR/test.py", line 230, in TrainLRCModel
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/classification.py", line 382, in train
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/regression.py", line 206, in _regression_train_wrapper
TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
Again I tried to convert the dataset into RDD of Labeled Point as follow i.e case 3:
Case 3: Converted the dataset into RDD of Labeled Point so that I can use LogisticRegressionWithLBFGS as follow:
#imported libraries
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
print(type(trainingData))
print(trainingData.take(2))
trainingData = trainingData.map(lambda row:[LabeledPoint(row.label,row.features)])
print('type of trainingData')
print(type(trainingData))
print(trainingData.take(2))
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
print(type(model))
Got the following error:
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=2.0, features=SparseVector(2000, {51: 1.0, 160: 1.0, 341: 1.0, 417: 1.0, 561: 1.0, 656: 1.0, 863: 1.0, 9
39: 1.0, 1021: 1.0, 1324: 1.0, 1433: 1.0, 1573: 1.0, 1604: 1.0, 1720: 1.0})), Row(label=3.0, features=SparseVector(
2000, {24: 1.0, 51: 2.0, 119: 1.0, 167: 1.0, 182: 1.0, 190: 1.0, 195: 1.0, 285: 1.0, 432: 1.0, 539: 1.0, 571: 1.0,
630: 1.0, 638: 1.0, 656: 1.0, 660: 2.0, 751: 1.0, 785: 1.0, 794: 1.0, 801: 1.0, 823: 1.0, 893: 1.0, 900: 1.0, 915:
1.0, 956: 1.0, 966: 1.0, 1025: 1.0, 1029: 1.0, 1035: 1.0, 1038: 1.0, 1093: 1.0, 1115: 2.0, 1147: 1.0, 1206: 1.0, 12
52: 1.0, 1261: 1.0, 1262: 1.0, 1268: 1.0, 1304: 1.0, 1351: 1.0, 1378: 1.0, 1423: 1.0, 1437: 1.0, 1441: 1.0, 1530: 1
.0, 1534: 1.0, 1556: 1.0, 1562: 1.0, 1604: 1.0, 1711: 1.0, 1737: 1.0, 1750: 1.0, 1776: 1.0, 1858: 1.0, 1865: 1.0, 1
923: 1.0, 1926: 1.0, 1959: 1.0, 1999: 1.0}))]
type of trainingData
<class 'pyspark.rdd.PipelinedRDD'>
[[LabeledPoint(2.0, (2000,[51,160,341,417,561,656,863,939,1021,1324,1433,1573,1604,1720],[1.0,1.0,1.0,1.0,1.0,1.0,1
.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))], [LabeledPoint(3.0, (2000,[24,51,119,167,182,190,195,285,432,539,571,630,638,656
,660,751,785,794,801,823,893,900,915,956,966,1025,1029,1035,1038,1093,1115,1147,1206,1252,1261,1262,1268,1304,1351,
1378,1423,1437,1441,1530,1534,1556,1562,1604,1711,1737,1750,1776,1858,1865,1923,1926,1959,1999],[1.0,2.0,1.0,1.0,1.
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1
.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]]
Traceback (most recent call last):
File "/home/LR/test.py", line 260, in <module>
accuracy = TrainLRCModel(trainData, testData)
File "/home/LR/test.py", line 230, in TrainLRCModel
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/classification.py", line 381, in train
AttributeError: 'list' object has no attribute 'features'
Can someone please suggest where I am missing something, I wanted to use the Logistic Regression in PySpark and classify the multi-class classification.
Currently I am using spark version version 1.6.2 and python version Python 2.7.9 on google cloud.
Thanking you in advance for you kind help.

Case 1: There is nothing strange here, simply (as the error message says) LogisticRegression does not support multi-class classification, as clearly stated in the documentation.
Case 2: Here you have switched from ML to MLlib, which however does not work with dataframes but needs the input as RDD of LabeledPoint (documentation), hence again the error message is expected.
Case 3: Here is where things get interesting. First, you should remove the brackets from your map function, i.e. it should be
trainingData = trainingData.map(lambda row: LabeledPoint(row.label, row.features)) # no brackets after "row:"
Nevertheless, guessing from the code snippets you have provided, most probably you are going to get a different error now:
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
[...]
: org.apache.spark.SparkException: Input validation failed.
Here is what happening (it took me some time to figure it out), using some dummy data (it's always a good idea to provide some sample data with your question):
# 3-class classification
data = sc.parallelize([
LabeledPoint(3.0, SparseVector(100,[10, 98],[1.0, 1.0])),
LabeledPoint(1.0, SparseVector(100,[1, 22],[1.0, 1.0])),
LabeledPoint(2.0, SparseVector(100,[36, 54],[1.0, 1.0]))
])
lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # throws exception
[...]
: org.apache.spark.SparkException: Input validation failed.
The problem is that your labels must start from 0 (and this is nowhere documented - you have to dig in the Scala source code to see that this is the case!); so, mapping the labels in my dummy data above from (1.0, 2.0, 3.0) to (0.0, 1.0, 2.0), we finally get:
# 3-class classification
data = sc.parallelize([
LabeledPoint(2.0, SparseVector(100,[10, 98],[1.0, 1.0])),
LabeledPoint(0.0, SparseVector(100,[1, 22],[1.0, 1.0])),
LabeledPoint(1.0, SparseVector(100,[36, 54],[1.0, 1.0]))
])
lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # no error now
Judging from your numClasses=5 argument, as well as from the label=5.0 in one of your printed records, I guess that most probably your code suffers from the same issue. Change your labels to [0.0, 4.0] and you should be fine.
(I suggest that you delete the other identical question you have opened here, for reducing clutter...)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Calculating tf-idf for name/surname in pyspark - apache-spark

Related

Panda how to overwrite new value on previous value?

Python GEKKO: Value of parameter changes while solving the model

np.nansum ignores zeros in Counters

Finding frequency distribution of a list of numbers in python

Issues with Logistic Regression for multiclass classification using PySpark

Categories

Resources