Sparse vector to dataframe in pyspark - apache-spark

I have sparsevector in pyspark which looks like this
SparseVector(5,{1:5,2:3,3:5,4:3,5:2})
How can I convert it to pandas dataframe with two columns which loks like this
ID VALUE
1 5
2 3
3 5
4 3
5 2
I tried sparsevector.zipWithIndex() but it did not work

Your example array is malformed, as you've specified 5 levels so there can not be an index 5. After you fix that issue, you can simply call toArray() which will return a numpy.ndarray. Just pass that into the constructor for a pandas.DataFrame.
from pyspark.mllib.linalg import SparseVector # code works the same
#from pyspark.ml.linalg import SparseVector # code works the same
import pandas as pd
a = SparseVector(5,{0:5,1:3,2:5,3:3,4:2}) # note the index starts at 0
df = pd.DataFrame(a.toArray())
print(df)
# 0
#0 5.0
#1 3.0
#2 5.0
#3 3.0
#4 2.0
The code works the same whether you're working with pyspark.mllib.linalg.SparseVector or pyspark.ml.linalg.SparseVector.

Related

convert multi-categorical column into two category in pandas

I have a dataframe as shown below.
df:
ID tag
1 pandas
2 numpy
3 matplotlib
4 pandas
5 pandas
6 sns
7 sklearn
8 sklearn
9 pandas
10 pandas
to the above df, I would like to add a column named tag_binary. Which will whether it is pandas or not.
Expected output:
ID tag tag_binary
1 pandas pandas
2 numpy non_pandas
3 matplotlib non_pandas
4 pandas pandas
5 pandas pandas
6 sns non_pandas
7 sklearn non_pandas
8 sklearn non_pandas
9 pandas pandas
10 pandas pandas
I tried the below code using a dictionary and map function. It worked fine. But I am wondering is there any easier way without creating this complete dictionary.
d = {'pandas':'pandas', 'numpy':'non_pandas', 'matplotlib':'non_pandas',
'sns':'non_pandas', 'sklearn':'non_pandas'}
df["tag_binary"] = df['tag'].map(d)
You can use where with an equality check to keep 'pandas' and fill everything else with 'non_pandas'.
df['tag_binary'] = df['tag'].where(df['tag'].eq('pandas'), 'non_pandas')
ID tag tag_binary
0 1 pandas pandas
1 2 numpy non_pandas
2 3 matplotlib non_pandas
3 4 pandas pandas
4 5 pandas pandas
5 6 sns non_pandas
6 7 sklearn non_pandas
7 8 sklearn non_pandas
8 9 pandas pandas
9 10 pandas pandas
If you want something a little more flexible, so you can also map specific values to some label, then you can leverage the fact that for keys not in your dict, map returns NaN. So only specify mappings you care about and then fillna to deal with every other case.
# Could be more general like {'pandas': 'pandas', 'geopandas': 'pandas'}
d = {'pandas': 'pandas'}
df['pandas_binary'] = df['tag'].map(d).fillna('non_pandas')
you can use apply
def is_pandas(name):
if name == 'pandas':
return 'pandas'#or True
return 'non_pandas' # or Fales
df['tag_binary'] = df['tag'].apply(lambda x: is_pandas(x))
If specifically needing "Categorical Data", to assign some ordering hierarchy, ensuring that only these values are permitted in the column, or simply reducing the amount of space, we can create a CategoricalDtype make the conversion with astype then fillna to fill the NaN values introduced when converting values that are not contained within the Categorical:
cat_dtype = pd.CategoricalDtype(['pandas', 'non_pandas'])
df['tag_binary'] = df['tag'].astype(cat_dtype).fillna('non_pandas')
df:
ID tag tag_binary
0 1 pandas pandas
1 2 numpy non_pandas
2 3 matplotlib non_pandas
3 4 pandas pandas
4 5 pandas pandas
5 6 sns non_pandas
6 7 sklearn non_pandas
7 8 sklearn non_pandas
8 9 pandas pandas
9 10 pandas pandas
Setup Used:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'tag': ['pandas', 'numpy', 'matplotlib', 'pandas', 'pandas', 'sns',
'sklearn', 'sklearn', 'pandas', 'pandas']
})

Pandas Apply Function returns numpy.nan instead of None

My DataFrame has Null values and I would like to replace them with None to send them to Database. If I use apply function None gets written as Numpy.nan in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame([1,2, 4, 5, np.nan], columns = ['a'])
df.a.apply(lambda x: x if x==x else None)
Output:
0 1.0
1 2.0
2 4.0
3 5.0
4 NaN
Name: a, dtype: float64
If I run below function it writes None in Dataframe.
df.a.apply(lambda x: None)
0 None
1 None
2 None
3 None
4 None
Name: a, dtype: object
This might be because of the column datatype is float and not Object. Any workaround for that? Thank you.

Sort pandas dataframe by a column

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'A' :[1,1,1,1,2,2,2,2],
'B' :[2,3,1,5,7,7,1,6]}
# Create DataFrame
df = pd.DataFrame(data)
df
I want to sort 'B' by each group of 'A'
Expected Output:
A B
0 1 1
1 1 2
2 1 3
3 1 5
4 2 1
5 2 6
6 2 7
7 2 7
You can sort a dataframe using the sort_values command. This command will sort your dataframe with priority on A and then B as requested.
df.sort_values(by=['A', 'B'])
Docs

One Hot Encoding a single column

I am trying to use one hot encoder on the target column('Species') in the Iris dataset.
But I am getting the following errors:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
I did google the issue and i found that most of the scikit learn estimators need a 2D array rather than a 1D array.
At the same time, I also found that we can try passing the dataframe with its index to encode single columns, but it didn't work
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')
X = dataset.iloc[:,1:5].values
y = dataset.iloc[:, 5].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder= LabelEncoder()
y = labelencoder.fit_transform(y)
onehotencoder = OneHotEncoder(categorical_features=[0])
y = onehotencoder.fit_transform(y)
I am trying to encode a single categorical column and split into multiple columns (the way the encoding usually works)
ValueError: Expected 2D array, got 1D array instead: Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
Says that you need to convert your array to a vector.
You can do that by:
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import numpy as np
# load iris dataset
>>> iris = datasets.load_iris()
>>> iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
>>> y = iris.target.values
>>> onehotencoder = OneHotEncoder(categories='auto')
>>> y = onehotencoder.fit_transform(y.reshape(-1,1))
# y - will be sparse matrix of type '<class 'numpy.float64'>
# if you want it to be a array you need to
>>> print(y.toarray())
[[1. 0. 0.]
[1. 0. 0.]
. . . .
[0. 0. 1.]
[0. 0. 1.]]
Also you can use get_dummies function (docs)
>>> pd.get_dummies(iris.target).head()
0.0 1.0 2.0
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
Hope that helps!
For your case, since it looks like you are using the kaggle dataset, I would just use
import pandas as pd
pd.get_dummies(df.Species).head()
Out[158]:
Iris-setosa Iris-versicolor Iris-virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
Note that the default here encodes all the classes (3 species), where it is common to use just two and compare differences in the means to the baseline group, (eg. the default in R or typically when doing regression/ANOVA which can be accomplished using the drop_first argument).
I came across similar situation and found the below method to be working :
Use two square brackets for the column name in the fit or fit_transform command
one_hot_enc = OneHotEncoder()
arr = one_hot_enc.fit_transform(data[['column']])
df = pd.DataFrame(arr)
The fit_transform gives you an array and you can convert this to pandas dataframe. You may append this to the original dataframe or directly assign to an existing column.

Logarithm calculation in python

I am trying to perform a logarithm in series, but I get the following error.
TypeError: cannot convert the series to class 'float'
I have a dataframe with two column A and B
A B
------
1 5
2 6
3 7
I am trying the following:
O/p = 10*math.log(10,df['A']+df['B'])
Required output:
row1 = 10*math.log(10,6)
row2 = 10*math.log(10,8)
row3 = 10*math.log(10,10)
But getting TypeError: cannot convert the series to class 'float'
math.log is meant to work with a scalar of type float. To compute log10 of a dataframe column, which is of type series, use numpy.log10 documented here.
Example:
import numpy
10*numpy.log10(df['A']+df['B'])
Here's a reproducible example:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame([[1,5],[2,6],[3,7]], columns=["A","B"])
>>> df
A B
0 1 5
1 2 6
2 3 7
>>> np.log10(df["A"]+df["B"])
0 0.778151
1 0.903090
2 1.000000
dtype: float64
>>>

Resources