pandas: convert multiple columns to string - string

I have some columns ['a', 'b', 'c', etc.] (a and c are float64 while b is object)
I would like to convert all columns to string and preserve nans.
Tried using df[['a', 'b', 'c']] == df[['a', 'b', 'c']].astype(str) but that left blanks for the float64 columns.
Currently I am going through one by one with the following:
df['a'] = df['a'].apply(str)
df['a'] = df['a'].replace('nan', np.nan)
Is the best way to use .astype(str) and then replace '' with np.nan? Side question: is there a difference between .astype(str) and .apply(str)?
Sample Input: (dtypes: a=float64, b=object, c=float64)
a, b, c, etc.
23, 'a42', 142, etc.
51, '3', 12, etc.
NaN, NaN, NaN, etc.
24, 'a1', NaN, etc.
Desired output: (dtypes: a=object, b=object, c=object)
a, b, c, etc.
'23', 'a42', '142', etc.
'51', 'a3', '12', etc.
NaN, NaN, NaN, etc.
'24', 'a1', NaN, etc.

This gives you the list of column names
lst = list(df)
This converts all the columns to string type
df[lst] = df[lst].astype(str)

df = pd.DataFrame({
'a': [23.0, 51.0, np.nan, 24.0],
'b': ["a42", "3", np.nan, "a1"],
'c': [142.0, 12.0, np.nan, np.nan]})
for col in df:
df[col] = [np.nan if (not isinstance(val, str) and np.isnan(val)) else
(val if isinstance(val, str) else str(int(val)))
for val in df[col].tolist()]
>>> df
a b c
0 23 a42 142
1 51 3 12
2 NaN NaN NaN
3 24 a1 NaN
>>> df.values
array([['23', 'a42', '142'],
['51', '3', '12'],
[nan, nan, nan],
['24', 'a1', nan]], dtype=object)

You could apply .astype() function on every elements of dataframe, or could select the column of interest to convert to string by following ways too.
In [41]: df1 = pd.DataFrame({
...: 'a': [23.0, 51.0, np.nan, 24.0],
...: 'b': ["a42", "3", np.nan, "a1"],
...: 'c': [142.0, 12.0, np.nan, np.nan]})
...:
In [42]:
In [42]: df1
Out[42]:
a b c
0 23.0 a42 142.0
1 51.0 3 12.0
2 NaN NaN NaN
3 24.0 a1 NaN
### Shows current data type of the columns:
In [43]: df1.dtypes
Out[43]:
a float64
b object
c float64
dtype: object
### Applying .astype() on each element of the dataframe converts the datatype to string
In [45]: df1.astype(str).dtypes
Out[45]:
a object
b object
c object
dtype: object
### Or, you could select the column of interest to convert it to strings
In [48]: df1[["a", "b", "c"]] = df1[["a","b", "c"]].astype(str)
In [49]: df1.dtypes ### Datatype update
Out[49]:
a object
b object
c object
dtype: object

I did this way.
get all your value from a specific column, e.g. 'text'.
k = df['text'].values
then, run each value into a new declared string, e.g. 'thestring'
thestring = ""
for i in range(0,len(k)):
thestring += k[i]
print(thestring)
hence, all string in column pandas 'text' has been put into one string variable.
cheers,
fairuz

Related

Pandas select preferred value from one of two columns to make a new column

I have a Pandas DataFrame with two columns of "complementary" data. For any given row, there are 3 possibilities:
1) Column A has a non-null value, and column B has a null value, NaN, that I want to replace with the non-null value from column A.
2) Column A has a null value, NaN, that I want to replace with the non-null value from column B.
3) Both columns A and B have null values, NaN, which means I'll keep NaN as the value for that row.
Here's a simplified version of my DataFrame:
df1 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan]})
I was thinking that as an intermediate step, I'd create a new column C with the entries I need:
df2 = pd.DataFrame({'A' : ['keep1', np.nan, np.nan, 'keep4', np.nan],
'B' : [np.nan, 'keep2', np.nan, np.nan, np.nan],
'C' : ['keep1', 'keep2', np.nan, 'keep4', np.nan]}
Then I'd drop the first two rows A and B:
df_final = df2.drop(['A', 'B'], axis=1)
My actual DataFrame has hundreds of rows, and I've tried several approaches (boolean filters, looping through the DataFrame using iterrows, using DataFrame.where()) without success. I'd think this would be a simple problem, but I'm not seeing it. Any help is appreciated.
Thanks
You can use combine_first() to fill the gaps in A from B:
df1['C'] = df1['A'].combine_first(df1['B'])
#0 keep1
#1 keep2
#2 NaN
#3 keep4
#4 NaN
Use Series.fillna for replace missing values from A by B values:
df1['C'] = df1.A.fillna(df1.B)
print (df1)
A B C
0 keep1 NaN keep1
1 NaN keep2 keep2
2 NaN NaN NaN
3 keep4 NaN keep4
4 NaN NaN NaN
For avoid drop is possible use DataFrame.pop for extract columns:
df1['C'] = df1.pop('A').fillna(df1.pop('B'))
print (df1)
C
0 keep1
1 keep2
2 NaN
3 keep4
4 NaN

Pandas - Conditional drop duplicates based on number of NaN

I have a Pandas 0.24.2 dataframe for Python 3.7x as below. I want to drop_duplicates() with the same Name based on a conditional logic. A similar question can be found here: Pandas - Conditional drop duplicates but it gets more complicated in my case
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6 ],
'Name': ['A', 'B', 'C', 'A', 'B', 'C' ],
'Value1':[1, np.NaN, 0, np.NaN, 1, np.NaN],
'Value2':[np.NaN, 0, np.NaN, 1, np.NaN, 0 ],
'Value3':[np.NaN, 0, np.NaN, 1, np.NaN, np.NaN]
})
How is it possible to:
Drop duplicates for same 'Name' records, keeping the one that has less NaNs?
If they have the same number of NaNs, keeping the one that has NOT a NaN in 'Value1'?
The desired output would be:
Id Name Value1 Value2 Value3
2 2 B NaN 0 0
3 3 C 0 NaN NaN
4 4 A NaN 1 1
Idea is create helper columns for both conditions, sorting and remove duplicates:
df1 = df.assign(count= df.isna().sum(axis=1),
count_val1 = df['Value1'].isna().view('i1'))
df2 = (df1.sort_values(['count', 'count_val1'])[df.columns]
.drop_duplicates('Name')
.sort_index())
print (df2)
Id Name Value1 Value2 Value3
1 2 B NaN 0.0 0.0
2 3 C 0.0 NaN NaN
3 4 A NaN 1.0 1.0
Here is a different solution. The goal is to create two columns that help sort the duplicate rows that will be deleted.
First, we create the columns.
df['count_nan'] = df.isnull().sum(axis=1)
Value1_nan = []
for row in df['Value1']:
if row >= 0:
Value1_nan.append(0)
else:
Value1_nan.append(1)
df['Value1_nan'] = Value1_nan
We then sort the columns so that the column with the most NaNs appears first.
df.sort_values(by=['Name','count_nan', 'Value1'], inplace=True, ascending = [True, True, True])
Finally, we drop the "last" duplicate line. That is, we keep the line with the smallest number of NaNs followed by the line with the smallest number of NaNs in Value1
df = df.drop_duplicates(subset = ['Name'],keep='first')

Splitting dictionary/list into Separate Columns

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})
this is output
a b
0 1 {'c': 1}
1 2 [{'c': 4}, {'d': 3}]
2 3 [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]
I need to split this column into separate columns.
How can i do that I used apply(pd.series) method This is what I'm getting as a output
0 1 c
0 NaN NaN 1.0
1 {'c': 4} {'d': 3} NaN
2 {'c': 5, 'd': 6} {'c': 5, 'd': 6} NaN
but I want like this if possible:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8
I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.
However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.
newcols = set()
for el in df['b']:
if isinstance(el, dict):
newcols.update(el.keys())
elif isinstance(el, list):
for i in el:
newcols.update(i.keys())
def extractvalues(x, col):
if isinstance(x['b'], dict):
return x['b'].get(col, np.nan)
elif isinstance(x['b'], list):
return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')
for nc in newcols:
df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)
df.drop('b', axis=1, inplace=True)
Your dataframe is now:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

mean of all the columns of a panda dataframe?

I'm trying to calculate the mean of all the columns of a DataFrame but it looks like having a value in the B column of row 6 prevents from calculating the mean on the C column. Why?
import pandas as pd
from decimal import Decimal
d = [
{'A': 2, 'B': None, 'C': Decimal('628.00')},
{'A': 1, 'B': None, 'C': Decimal('383.00')},
{'A': 3, 'B': None, 'C': Decimal('651.00')},
{'A': 2, 'B': None, 'C': Decimal('575.00')},
{'A': 4, 'B': None, 'C': Decimal('1114.00')},
{'A': 1, 'B': 'TEST', 'C': Decimal('241.00')},
{'A': 2, 'B': None, 'C': Decimal('572.00')},
{'A': 4, 'B': None, 'C': Decimal('609.00')},
{'A': 3, 'B': None, 'C': Decimal('820.00')},
{'A': 5, 'B': None, 'C': Decimal('1223.00')}
]
df = pd.DataFrame(d)
In : df
Out:
A B C
0 2 None 628.00
1 1 None 383.00
2 3 None 651.00
3 2 None 575.00
4 4 None 1114.00
5 1 TEST 241.00
6 2 None 572.00
7 4 None 609.00
8 3 None 820.00
9 5 None 1223.00
Tests:
# no mean for C column
In : df.mean()
Out:
A 2.7
dtype: float64
# mean for C column when row 6 is left out of the DF
In : df.head(5).mean()
Out:
A 2.4
B NaN
C 670.2
dtype: float64
# no mean for C column when row 6 is part of the DF
In : df.head(6).mean()
Out:
A 2.166667
dtype: float64
dtypes:
In : df.dtypes
Out:
A int64
B object
C object
dtype: object
In : df.head(5).dtypes
Out:
A int64
B object
C object
dtype: object
You could use particular columns if you need only columns with numbers:
In [90]: df[['A','C']].mean()
Out[90]:
A 2.7
C 681.6
dtype: float64
or to change type as #jezrael advice in comment:
df['C'] = df['C'].astype(float)
Probably df.mean trying to convert all object to numeric and if it's fall then it's roll back and calculate only for actual numbers

Resources