I have a pandas DataFrame which has values that are not correct
data = {'Model':['A', 'B', 'A', 'B', 'A'], 'Value':[20, 40, 20, 40, -1]}
df = pd.DataFrame(data)
df
Out[46]:
Model Value
0 A 20
1 B 40
2 A 20
3 B 40
4 A -1
I would like to replace -1 with the unique values of A.
In this case it should be 20.
How do I go about it. I have tried the following.
In my case its a large DF with 2million rows.
df2 = df[df.model != -1]
pd.merge(df, df2, on='model', how='left')
Out:
MemoryError: Unable to allocate 5.74 TiB for an array with shape (788568381621,) and data type int64
You don't need to merge, which creates all possible pairs of rows with the same Model. The following will do
df['Value'] = df['Value'].mask(df['Value']!=-1).groupby(df['Model']).transform('first')
Or you can also use map:
s = (df[df['Value'] != -1].drop_duplicates('Model')
.set_index('Model')['Value'])
df['Value'] = df['Model'].map(s)
Here's a quick solution:
df['Value'] = df.groupby('Model').transform('max')
Related
I have simple Pandas DataFrame with 3 columns. I am trying to Transpose it into and then rename that new dataframe and I am having bit trouble.
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
I tried using
df =df.T
which transpose the DataFrame into:
TotalInvoicedPrice,123
TotalProductCost,18
ShippingCost,5
So now i have to add column names to this data frame "Metrics" and "Values"
I tried using
df.columns["Metrics","Values"]
but im getting errors.
What I need to get is DataFrame that looks like:
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Let's reset the index then set the column labels
df.T.reset_index().set_axis(['Metrics', 'Values'], axis=1)
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Maybe you can avoid transpose operation (little performance overhead)
#YOUR DATAFRAME
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
#FORM THE LISTS FROM YOUR COLUMNS AND FIRST ROW VALUES
l1 = df.columns.values.tolist()
l2 = df.iloc[0].tolist()
#CREATE A DATA FRAME.
df2 = pd.DataFrame(list(zip(l1, l2)),columns = ['Metrics', 'Values'])
print(df2)
This is what my dataframe looks like:
df = {"a": [[1,2,3], [4,5,6]],
"b": [[11,22,33], [44,55,66]],
"c": [[111,222,333], [444,555,666]]}
df = pd.Dataframe(df)
I want for each cell, keep only the item that I choose its index.
I've tried doing this way to keep the first item of the list
df = df.apply(lambda x:x[0])
but it doesn't work.
Anyone can enlighten me on this ? Thanks,
If need first value for all columns need processing lambda function elementwise by DataFrame.applymap:
df = df.applymap(lambda x:x[0])
print(df)
a b c
0 1 11 111
1 4 44 444
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
I have following dataframe df with 3 rows where 3rd row consists of all empty strings. I am trying to drop all the rows which has all the columns empty but somehow the rows are not getting dropped. Below is my snippet.
import pandas as pd
d = {'col1': [1, 2, ''], 'col2': [3, 4, '']}
df = pd.DataFrame(data=d)
df = df.dropna(how='all')
Please suggest where I am doing wrong?
You don't have NaN values. You have '', which is not NaN. So:
df[df.ne('').any(1)]
Say I have a pandas dataframe. I can access the columns either by their name or by their index.
Is there a simple way in which I can retrieve the column index given its name?
Use get_loc on the columns Index object to return the ordinal index value:
In [283]:
df = pd.DataFrame(columns=list('abcd'))
df
Out[283]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
In [288]:
df.columns.get_loc('b')
Out[288]:
1
What do you mean by index exactly?
I bet you are referring to index as a list index, right?
Because Pandas has another kind of index too.
From my first understandying, you can do the following:
my_df = pd.DataFrame(columns=['A', 'B', 'C'])
my_columns = my_df.columns.tolist()
print my_columns # yields ['A', 'B', 'C'], therefore you can recover the index by just doing the following
my_columns.index('C') #yields 2