How to check if a Dataframe contains a list or dictionary - python-3.x

I have a dataframe :
col1 col2 col3 col4
A 11 [{'id':2}] {"price": 0.0}
B 21 [{'id':3}] {"price": 2.0}
C 31 [{'id':4}] {"price": 3.0}
I want to find out what all columns are of datatype 'list' and 'dictionary' and probably store the result into another list.
How do i go about it ?
when i use this :
data.applymap(type).apply(pd.value_counts)
output is :
col1 col2 col3 col4
0 a 11 [{'id':2}] {"price": 0.0}
1 b 21 [{'id':3}] {"price": 2.0}
2 c 31 [{'id':4}] {"price": 3.0}

IIUC,
we can use apply and literal_eval from the ast standard library to build up a dictionary:
for performance reasons, lets work with the first row of the data frame as apply is computationally quite heavy.
from ast import literal_eval
data_dict = {}
for col in df.columns:
try:
col_type = df[col].iloc[:1].apply(literal_eval).apply(type)[0]
data_dict[col] = col_type
except (ValueError,SyntaxError):
data_dict[col] = 'unable to evaluate'
print(data_dict)
{'col1': 'unable to evaluate',
'col2': 'unable to evaluate',
'col3': list,
'col4': dict}

Related

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

Using groupby and filters on a dataframe

I have a dataframe with both string and integer values.
Attaching a sample data dictionary to understand the dataframe that I have:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12]
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
I need to extract data as under:
Max value from col4
Grouped by col1
Filtered out col3 from the result if value is Y
Filter col5 from the result to show only values not more than 5.
So I tried something and faced following problems.
1- I used following method to find max value from the data. But I am not able to find max value from each group.
print(dataframe['col4'].max()) #this worked to get one max value
print(dataframe.groupby('col1').max() #this doesn't work
Second one doesn't work for me as that returns maximum value for col2 as well. I need the result to have col2 value against the max row under each group.
2- I am not able to apply filter on both col3 (str) and col5 (int) in one command. Any way to do that?
print(dataframe[dataframe['col3'] != 'Y' & dataframe['col5'] < 6]) #generates an error
The output that I am expecting through this is:
col1 col2 col3 col4 col5
0 A 10 X 45 3
3 B 10 X 56 4
6 C 10 X 87 4
10 D 20 X 43 4
#
# 78 is max in group A, but ignored as col5 is 6 (we need < 6)
# Similarly, 89 is max in group D, but ignored as col3 is Y.
I apologize if I am doing something wrong. I am quite new to this.
Thank you.
I'm not a python developer, but im my opinion you do it in a wrong way.
You shoud have a list of structure insted of structure of list.
Then you can start workin on such list.
This is an example solution, so probably it coud be done im much smootcher way:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12],
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
newData = [];
for i in range(len(data['col1'])):
newData.append({'col1' : data['col1'][i], 'col2' : data['col2'][i], 'col3' : data['col3'][i], 'col4' : data['col4'][i], 'col5' : data['col5'][i]})
withoutY = list(filter(lambda d: d['col3'] != 'Y', newData))
lessThen5 = list(filter(lambda d: d['col5'] < 5, withoutY))
values = set(map(lambda d: d['col1'], lessThen5))
groupped = [[d1 for d1 in lessThen5 if d1['col1']==d2] for d2 in values]
result = [];
for i in range(len(groupped)):
result.append(max(groupped[i], key = lambda g: g['col4']))
sortedResult = sorted(result, key = lambda r: r['col1'])
print (sortedResult)
result:
[
{'col1': 'A', 'col2': 10, 'col3': 'X', 'col4': 45, 'col5': 3},
{'col1': 'B', 'col2': 10, 'col3': 'X', 'col4': 56, 'col5': 4},
{'col1': 'C', 'col2': 10, 'col3': 'X', 'col4': 87, 'col5': 4},
{'col1': 'D', 'col2': 20, 'col3': 'X', 'col4': 43, 'col5': 4}
]
Ok, I didn't actually notice.
So i was try something like this:
#fd is a filtered data
fd=data.query('col3 != "Y"').query('col5 < 6')
# or fd=data[data.col3 != 'Y'][data.col5 < 6]
#m is max for col4 grouped by col1
m=fd.groupby('col1')['col4'].max()
This will group by col1 and get max from col4, but in result we have 2 colums (col1 and col4).
I don't know what do you want to achieve.
If you want to have all line, here is the code:
result=fd[lambda x: x.col4 == m.get(x.col1).values]
You need to be careful, because you not alway will have one line for "col1".
E.g. For data
data = pd.DataFrame({
'col1': ['A','A','A','A','B','B','B','B','C','C','C','D','D','D'],
'col2': [20,10,20,30,10,20,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,45,23,78,45,56,12,34,87,54,43,89,43,12],
'col5': [1,3,4,6,1,4,3,2,4,3,5,3,4,6]})
Result will be:
col1 col2 col3 col4 col5
0 A 20 X 45 1
1 A 10 X 45 3
5 B 20 X 56 4
8 C 10 X 87 4
12 D 20 X 43 4
Additionally if you want to have normal index instead of ..., 8, 9 12, you could use "where" instead of "query"

Combine text in dataframe python

Suppose I have this DataFrame:
df = pd.DataFrame({'col1': ['AC1', 'AC2', 'AC3', 'AC4', 'AC5'],
'col2': ['A', 'B', 'B', 'A', 'C'],
'col3': ['ABC', 'DEF', 'FGH', 'IJK', 'LMN']})
I want to comnbine text of 'col3' if values in 'col2' are duplicated. The result should be like this:
col1 col2 col3
0 AC1 A ABC, IJK
1 AC2 B DEF, FGH
2 AC3 B DEF, FGH
3 AC4 A ABC, IJK
4 AC5 C LMN
I start this excercise by finding duplicated values in this dataframe:
col2 = df['col2']
df1 = df[col2.isin(col2[col2.duplicated()])]
Any suggestion what I should do next?
You can use
a = df.groupby('col2').apply(lambda group: ','.join(group['col3']))
df['col3'] = df['col2'].map(a)
Output
print(df)
col1 col2 col3
0 AC1 A ABC,IJK
1 AC2 B DEF,FGH
2 AC3 B DEF,FGH
3 AC4 A ABC,IJK
4 AC5 C LMN
You might want to leverage the groupby and apply functions in Pandas
df.groupby('col2').apply(lambda group: ','.join(group['col3']))

How to add column name to cell in pandas dataframe?

How do I take a normal data frame, like the following:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
and produce a dataframe where the column name is added to the cell in the frame, like the following:
d = {'col1': ['col1=1', 'col1=2'], 'col2': ['col2=3', 'col2=4']}
df = pd.DataFrame(data=d)
df
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
Any help is appreciated.
Make a new DataFrame containing the col*= strings, then add it to the original df with its values converted to strings. You get the desired result because addition concatenates strings:
>>> pd.DataFrame({col:str(col)+'=' for col in df}, index=df.index) + df.astype(str)
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can use apply to set column name in cells and then join them with '=' and the values.
df.apply(lambda x: x.index+'=', axis=1)+df.astype(str)
Out[168]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can try this
df.ne(0).mul(df.columns)+'='+df.astype(str)
Out[1118]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

Independent Indexing in pandas.DataFrame

I have data with two indpendent indexes, say a date and an integer. Both determine unique rows. Now, I want to access rows by eather the date or the integer. This does not seem to work if I create the data frame via
import pandas as pd
df = pd.DataFrame(data=[['a', 'b'], ['c', 'd'], ['e', 'f']], columns=['col1', 'col2'],
index=[[pd.to_datetime('2017-10-13'), pd.to_datetime('2017-10-14'), pd.to_datetime('2017-10-15')],
[123, 124, 125]])
since the indexes will be hierarchical. The data frame will be
col1 col2
2017-10-13 123 a b
2017-10-14 124 c d
2017-10-15 125 e f
With .loc I can access for example via date, say df.loc['2017-10-13'] works nicely and as expected (actually even better since the string seems to be converted to datetime format automatically). Unfortunately, if I want to access a line via the integer index (for example with df.loc[123]) I get
KeyError: 'the label [123] is not in the [index]'
Does anyone know how to access lines via the integer index now?
You need tuples for seelct values in MultiIndex:
print (df.loc[('2017-10-13', 123)])
col1 a
col2 b
Name: (2017-10-13 00:00:00, 123), dtype: object
print (df.loc[('2017-10-13', 123),:])
col1 col2
2017-10-13 123 a b
If complicated select use slicers:
idx = pd.IndexSlice
print (df.loc[idx['2017-10-13', 123]])
col1 a
col2 b
Name: (2017-10-13 00:00:00, 123), dtype: object
idx = pd.IndexSlice
print (df.loc[idx['2017-10-13', 123],:])
col1 col2
2017-10-13 123 a b
idx = pd.IndexSlice
print (df.loc[idx['2017-10-13', 123], 'col1'])
2017-10-13 123 a
Name: col1, dtype: object
EDIT:
Need function DataFrame.xs:
print (df.xs(123, level=1))
col1 col2
2017-10-13 a b
print (df.xs(123, level=1, drop_level=False))
col1 col2
2017-10-13 123 a b
You can also do query by setting name to the index i.e
df.index.names=('a','b')
df.query('b==123')
Output :
col1 col2
a b
2017-10-13 123 a b

Resources