pandas df are being read as dict - excel

I'm having some trouble with pandas. I opened a .xlsx file with pandas, but when I try to filter any information, it shows me the error
AttributeError: 'dict' object has no attribute 'head' #(or iloc, or loc, or anything else from DF/pandas)#
So, I did some research and realized that my table turned into a dictionary (why?).
I'm trying to convert this mess into a proper dictionary, so I can convert it into a properly df, because right now, it shows some characteristics from both. I need a df, just it.
Here is the code:
import pandas as pd
df = pd.read_excel('report.xlsx', sheet_name = ["May"])
print(df)
Result: it shows the table plus "[60 rows x 24 columns]"
But when I try to filter or iterate, it shows all dicts possible attibute errors.
Somethings I tried: .from_dict, xls.parse/(df.to_dict).
When I try to convert df to dict properly, it shows
ValueError: If using all scalar values, you must pass an index
I tried this link: [https://stackoverflow.com/questions/17839973/constructing-pandas-dataframe-from-values-in-variables-gives-valueerror-if-usi)][1], but it didn't work. For some reason, it said in one of the errors that I should provide 2-d parameters, that's why I tried to create a new dict and do a sort of 'append', but it didn't work too...
Then I tried all stuff to set an index, but it doesn't let me rename columns because it says .iloc is not an attribute from dict)
I'm new in python, but I never saw a 'pd.read_excel' open a DataFrame as 'dict'. What should I do?
tks!
[1]: Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"

if its a dict of DataFrames try...
>>> dict_df = {"a":pd.DataFrame([{1:2,3:4},{1:4,4:6}]), "b":pd.DataFrame([{7:9},{1:4}])}
>>> dict_df
{'a': 1 3 4
0 2 4.0 NaN
1 4 NaN 6.0, 'b': 7 1
0 9.0 NaN
1 NaN 4.0}
>>> pd.concat(dict_df.values(),keys=dict_df.keys(), axis=1)
a b
1 3 4 7 1
0 2 4.0 NaN 9.0 NaN
1 4 NaN 6.0 NaN 4.0

Related

pandas split all list column and get first value

I am trying to get first element in the list for all rows and column into a single dataframe. All of the rows and columns have list format. It contains 2 elements in each list. Here is what I tried. What syntax should I use to apply entire dataframe in pandas?
import pandas as pd
import numpy as np
def my_function(x):
return x.replace('\[','').replace('\]','').split(',')[0]
t = pd.DataFrame(data={'col1': ['[blah,blah]','[test,bing]',np.NaN], 'col2': ['[math,sci]',np.NaN,['number','4']]})
print(t)
not working:
t.apply(my_function) # AttributeError: 'Series' object has no attribute 'split'
t.apply(lambda x: str(x).replace('\[','').replace('\]','').split(',')[0]) # does not work
t.apply(lambda x: list(x)[0]) # gives first column and doesn't split
trying to get this:
col1 col2
blah math
test NaN
NaN number
Use replace:
>>> t.replace(r'\[([^,]*).*', r'\1', regex=True)
col1 col2
0 blah math
1 test NaN
2 NaN number
But I think you have an error when you create your sample dataframe. I changed to:
t = pd.DataFrame(data={'col1': ['[blah,blah]','[test,bing]',np.NaN],
'col2': ['[math,sci]',np.NaN,'[number,4]']})
# The problem is here ------------------------------^^^^^^^^^^^^
Link to regex101

Trying to divide two columns of a dataframe but get Nan

Background:
I deal with a dataframe and want to divide the two columns of this dataframe to get a new column. The code is shown below:
import pandas as pd
df = {'drive_mile': [15.1, 2.1, 7.12], 'price': [40, 9, 31]}
df = pd.DataFrame(df)
df['price/km'] = df[['drive_mile', 'price']].apply(lambda x: x[1]/x[0])
print(df)
And I get the below result:
drive_mile price price/km
0 15.10 40 NaN
1 2.10 9 NaN
2 7.12 31 NaN
Why would this happen? And how can I fix it?
As pointed out in the comments, you missed the axis=1 parameter to perform the division on the right dimension using apply. This is because you end up with different indices when joining back in the DataFrame.
However, more importantly, do not use apply to perform a division!. Apply is often much less efficient compared to vectorial operations.
Use div:
df['price/km'] = df['drive_mile'].div(df['price'])
Or /:
df['price/km'] = df['drive_mile']/df['price']

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

Python dataframe with value 'NA' not fetching

I am trying to read a excel with below data:
But when i tried to debug the dataframe its showing only:
Could you explain why the NA is not showing in the dataframe.
Also is there any way to fetch NA .
Python version : 3.7
In pd.read_excel there's an argument for this called na_values.
Quoted from the documentation:
Additional strings to recognize as NA/NaN.
Furthermore you have to overwrite the default NaN values, which is also empty cell '', with the parameter keep_default_na=False
Again quoting from the documentation:
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
So the following should help your problem:
df = pd.read_excel('Filename.xlsx', na_values='NA', keep_default_na=False)
Output
Item Status
0 Soap NaN
1 butter
2 Rice NaN
3 pen Available

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Resources