pandas split all list column and get first value - python-3.x

I am trying to get first element in the list for all rows and column into a single dataframe. All of the rows and columns have list format. It contains 2 elements in each list. Here is what I tried. What syntax should I use to apply entire dataframe in pandas?
import pandas as pd
import numpy as np
def my_function(x):
return x.replace('\[','').replace('\]','').split(',')[0]
t = pd.DataFrame(data={'col1': ['[blah,blah]','[test,bing]',np.NaN], 'col2': ['[math,sci]',np.NaN,['number','4']]})
print(t)
not working:
t.apply(my_function) # AttributeError: 'Series' object has no attribute 'split'
t.apply(lambda x: str(x).replace('\[','').replace('\]','').split(',')[0]) # does not work
t.apply(lambda x: list(x)[0]) # gives first column and doesn't split
trying to get this:
col1 col2
blah math
test NaN
NaN number

Use replace:
>>> t.replace(r'\[([^,]*).*', r'\1', regex=True)
col1 col2
0 blah math
1 test NaN
2 NaN number
But I think you have an error when you create your sample dataframe. I changed to:
t = pd.DataFrame(data={'col1': ['[blah,blah]','[test,bing]',np.NaN],
'col2': ['[math,sci]',np.NaN,'[number,4]']})
# The problem is here ------------------------------^^^^^^^^^^^^
Link to regex101

Related

pandas df are being read as dict

I'm having some trouble with pandas. I opened a .xlsx file with pandas, but when I try to filter any information, it shows me the error
AttributeError: 'dict' object has no attribute 'head' #(or iloc, or loc, or anything else from DF/pandas)#
So, I did some research and realized that my table turned into a dictionary (why?).
I'm trying to convert this mess into a proper dictionary, so I can convert it into a properly df, because right now, it shows some characteristics from both. I need a df, just it.
Here is the code:
import pandas as pd
df = pd.read_excel('report.xlsx', sheet_name = ["May"])
print(df)
Result: it shows the table plus "[60 rows x 24 columns]"
But when I try to filter or iterate, it shows all dicts possible attibute errors.
Somethings I tried: .from_dict, xls.parse/(df.to_dict).
When I try to convert df to dict properly, it shows
ValueError: If using all scalar values, you must pass an index
I tried this link: [https://stackoverflow.com/questions/17839973/constructing-pandas-dataframe-from-values-in-variables-gives-valueerror-if-usi)][1], but it didn't work. For some reason, it said in one of the errors that I should provide 2-d parameters, that's why I tried to create a new dict and do a sort of 'append', but it didn't work too...
Then I tried all stuff to set an index, but it doesn't let me rename columns because it says .iloc is not an attribute from dict)
I'm new in python, but I never saw a 'pd.read_excel' open a DataFrame as 'dict'. What should I do?
tks!
[1]: Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"
if its a dict of DataFrames try...
>>> dict_df = {"a":pd.DataFrame([{1:2,3:4},{1:4,4:6}]), "b":pd.DataFrame([{7:9},{1:4}])}
>>> dict_df
{'a': 1 3 4
0 2 4.0 NaN
1 4 NaN 6.0, 'b': 7 1
0 9.0 NaN
1 NaN 4.0}
>>> pd.concat(dict_df.values(),keys=dict_df.keys(), axis=1)
a b
1 3 4 7 1
0 2 4.0 NaN 9.0 NaN
1 4 NaN 6.0 NaN 4.0

pandas: removing duplicate values in rows with same index in two columns

I have a dataframe as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'text':['she is good', 'she is bad'], 'label':['she is good', 'she is good']})
I would like to compare row wise and if two same-indexed rows have the same values, replace the duplicate in the 'label' column with the word 'same'.
Desired output:
pos label
0 she is good same
1 she is bad she is good
so far, i have tried the following, but it returns an error:
ValueError: Length of values (1) does not match length of index (2)
df['label'] =np.where(df.query("text == label"), df['label']== ' ',df['label']==df['label'] )
Your syntax is not correct, have a look at the documentation of numpy.where.
Check for equality between your two columns, and replace the values in your label column:
import numpy as np
df['label'] = np.where(df['text'].eq(df['label']),'same',df['label'])
prints:
text label
0 she is good same
1 she is bad she is good

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

convert datetime to date python --> error: unhashable type: 'numpy.ndarray'

Pandas by default represent dates with datetime64 [ns], so I have in my columns this format [2016-02-05 00:00:00] but I just want the date 2016-02-05, so I applied this code for a few columns:
df3a['MA'] = pd.to_datetime(df3a['MA'])
df3a['BA'] = pd.to_datetime(df3a['BA'])
df3a['FF'] = pd.to_datetime(df3a['FF'])
df3a['JJ'] = pd.to_datetime(df3a['JJ'])
.....
but it gives me as result this error: TypeError: type unhashable: 'numpy.ndarray'
my question is: why i got this error and how do i convert datetime to date for multiple columns (around 50)?
i will be grateful for your help
One way to achieve what you'd like is with a DatetimeIndex. I've first created an Example DataFrame with 'date' and 'values' columns and tried from there on to reproduce the error you've got.
import pandas as pd
import numpy as np
# Example DataFrame with a DatetimeIndex (dti)
dti = pd.date_range('2020-12-01','2020-12-17') # dates from first of december up to date
values = np.random.choice(range(1, 101), len(dti)) # random values between 1 and 100
df = pd.DataFrame({'date':dti,'values':values}, index=range(len(dti)))
print(df.head())
>>> date values
0 2020-12-01 85
1 2020-12-02 100
2 2020-12-03 96
3 2020-12-04 40
4 2020-12-05 27
In the example, just the dates are already shown without the time in the 'date' column, I guess since it is a DatetimeIndex.
What I haven't tested but might can work for you is:
# Your dataframe
df3a['MA'] = pd.DatetimeIndex(df3a['MA'])
...
# automated transform for all columns (if all columns are datetimes!)
for label in df3a.columns:
df3a[label] = pd.DatetimeIndex(df3a[label])
Use DataFrame.apply:
cols = ['MA', 'BA', 'FF', 'JJ']
df3a[cols] = df3a[cols].apply(pd.to_datetime)

How to replace all occurrences of a string in a pandas data frame with NaN values?

I am just trying to figure out if there is a quick way to replace all occurrences of a string in a pandas data frame with NaN values. Like something that will check each value in the data frame and replace it with a NaN value if it's a str datatype.
I know we can do this for a certain string using replace method as:
df.replace('Sample String', np.nan)
Thanks
Edit:
You can use this simple example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['one', 'two', 'three', 'four']})
df['col1'] = df['col1'].map(lambda x: np.nan if x in ['two', 'four'] else x)
df:
0 one
1 NaN
2 three
3 NaN
Name: col1, dtype: object

Resources