convert multi-categorical column into two category in pandas - python-3.x

I have a dataframe as shown below.
df:
ID tag
1 pandas
2 numpy
3 matplotlib
4 pandas
5 pandas
6 sns
7 sklearn
8 sklearn
9 pandas
10 pandas
to the above df, I would like to add a column named tag_binary. Which will whether it is pandas or not.
Expected output:
ID tag tag_binary
1 pandas pandas
2 numpy non_pandas
3 matplotlib non_pandas
4 pandas pandas
5 pandas pandas
6 sns non_pandas
7 sklearn non_pandas
8 sklearn non_pandas
9 pandas pandas
10 pandas pandas
I tried the below code using a dictionary and map function. It worked fine. But I am wondering is there any easier way without creating this complete dictionary.
d = {'pandas':'pandas', 'numpy':'non_pandas', 'matplotlib':'non_pandas',
'sns':'non_pandas', 'sklearn':'non_pandas'}
df["tag_binary"] = df['tag'].map(d)

You can use where with an equality check to keep 'pandas' and fill everything else with 'non_pandas'.
df['tag_binary'] = df['tag'].where(df['tag'].eq('pandas'), 'non_pandas')
ID tag tag_binary
0 1 pandas pandas
1 2 numpy non_pandas
2 3 matplotlib non_pandas
3 4 pandas pandas
4 5 pandas pandas
5 6 sns non_pandas
6 7 sklearn non_pandas
7 8 sklearn non_pandas
8 9 pandas pandas
9 10 pandas pandas
If you want something a little more flexible, so you can also map specific values to some label, then you can leverage the fact that for keys not in your dict, map returns NaN. So only specify mappings you care about and then fillna to deal with every other case.
# Could be more general like {'pandas': 'pandas', 'geopandas': 'pandas'}
d = {'pandas': 'pandas'}
df['pandas_binary'] = df['tag'].map(d).fillna('non_pandas')

you can use apply
def is_pandas(name):
if name == 'pandas':
return 'pandas'#or True
return 'non_pandas' # or Fales
df['tag_binary'] = df['tag'].apply(lambda x: is_pandas(x))

If specifically needing "Categorical Data", to assign some ordering hierarchy, ensuring that only these values are permitted in the column, or simply reducing the amount of space, we can create a CategoricalDtype make the conversion with astype then fillna to fill the NaN values introduced when converting values that are not contained within the Categorical:
cat_dtype = pd.CategoricalDtype(['pandas', 'non_pandas'])
df['tag_binary'] = df['tag'].astype(cat_dtype).fillna('non_pandas')
df:
ID tag tag_binary
0 1 pandas pandas
1 2 numpy non_pandas
2 3 matplotlib non_pandas
3 4 pandas pandas
4 5 pandas pandas
5 6 sns non_pandas
6 7 sklearn non_pandas
7 8 sklearn non_pandas
8 9 pandas pandas
9 10 pandas pandas
Setup Used:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'tag': ['pandas', 'numpy', 'matplotlib', 'pandas', 'pandas', 'sns',
'sklearn', 'sklearn', 'pandas', 'pandas']
})

Related

How to access list of list values in columns in dataset

In my DataFrame.I am having a list of list values in a column. For example, I am having columns as A, B, C, and my output column. In column A I'm having a value of 12 and in column B I am having values of 30 and in column C I am having a list of values like [0.01,1.234,2.31].When I try to find mean for all the list of list values.It shows list object as no attribute mean.How to convert all list of list values to mean in the dataframe?
You can transform the column which contains the lists to another DataFrame and calculate the mean.
import pandas as pd
df = ... # Original df
pd.DataFrame(df['column_with_lists'].values.tolist()).mean(1)
This would result in a pandas DataFrame which looks like the following:
0 mean_of_list_row_0
1 mean_of_list_row_1
. .
. .
. .
n mean_of_list_row_n
You can use apply(np.mean) on the column with the lists in it to get the mean. For example:
Build a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2,4],[4,6]])
df[3] = [[5,7],[8,9,10]]
print(df)
0 1 3
0 2 4 [5, 7]
1 4 6 [8, 9, 10]
Use apply(np.mean)
print(df[3].apply(np.mean))
0 6.0
1 9.0
If you want to convert that column into the mean of the lists:
df[3] = df[3].apply(np.mean)
print(df)
Name: 3, dtype: float64
0 1 3
0 2 4 6.0
1 4 6 9.0

Sort pandas dataframe by a column

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'A' :[1,1,1,1,2,2,2,2],
'B' :[2,3,1,5,7,7,1,6]}
# Create DataFrame
df = pd.DataFrame(data)
df
I want to sort 'B' by each group of 'A'
Expected Output:
A B
0 1 1
1 1 2
2 1 3
3 1 5
4 2 1
5 2 6
6 2 7
7 2 7
You can sort a dataframe using the sort_values command. This command will sort your dataframe with priority on A and then B as requested.
df.sort_values(by=['A', 'B'])
Docs

Change the bar item name in Pandas

I have a test excel file like:
df = pd.DataFrame({'name':list('abcdefg'),
'age':[10,20,5,23,58,4,6]})
print (df)
name age
0 a 10
1 b 20
2 c 5
3 d 23
4 e 58
5 f 4
6 g 6
I use Pandas and matplotlib to read and plot it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
df.plot(kind="bar")
plt.show()
the result shows:
it use index number as item name, how can I change it to the name, which stored in column name?
You can specify columns for x and y values in plot.bar:
df.plot(x='name', y='age', kind="bar")
Or create Series first by DataFrame.set_index and select age column:
df.set_index('name')['age'].plot(kind="bar")
#if multiple columns
#df.set_index('name').plot(kind="bar")

Sparse vector to dataframe in pyspark

I have sparsevector in pyspark which looks like this
SparseVector(5,{1:5,2:3,3:5,4:3,5:2})
How can I convert it to pandas dataframe with two columns which loks like this
ID VALUE
1 5
2 3
3 5
4 3
5 2
I tried sparsevector.zipWithIndex() but it did not work
Your example array is malformed, as you've specified 5 levels so there can not be an index 5. After you fix that issue, you can simply call toArray() which will return a numpy.ndarray. Just pass that into the constructor for a pandas.DataFrame.
from pyspark.mllib.linalg import SparseVector # code works the same
#from pyspark.ml.linalg import SparseVector # code works the same
import pandas as pd
a = SparseVector(5,{0:5,1:3,2:5,3:3,4:2}) # note the index starts at 0
df = pd.DataFrame(a.toArray())
print(df)
# 0
#0 5.0
#1 3.0
#2 5.0
#3 3.0
#4 2.0
The code works the same whether you're working with pyspark.mllib.linalg.SparseVector or pyspark.ml.linalg.SparseVector.

Logarithm calculation in python

I am trying to perform a logarithm in series, but I get the following error.
TypeError: cannot convert the series to class 'float'
I have a dataframe with two column A and B
A B
------
1 5
2 6
3 7
I am trying the following:
O/p = 10*math.log(10,df['A']+df['B'])
Required output:
row1 = 10*math.log(10,6)
row2 = 10*math.log(10,8)
row3 = 10*math.log(10,10)
But getting TypeError: cannot convert the series to class 'float'
math.log is meant to work with a scalar of type float. To compute log10 of a dataframe column, which is of type series, use numpy.log10 documented here.
Example:
import numpy
10*numpy.log10(df['A']+df['B'])
Here's a reproducible example:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame([[1,5],[2,6],[3,7]], columns=["A","B"])
>>> df
A B
0 1 5
1 2 6
2 3 7
>>> np.log10(df["A"]+df["B"])
0 0.778151
1 0.903090
2 1.000000
dtype: float64
>>>

Resources