Pandas add empty rows as filler - python-3.x

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF

Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Related

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

Duplicate rows in a dataframe according to a criterion from the table

I have a dataframe like this:
d = {'col1': ['a', 'b'], 'col2': [2, 4]}
df = pd.DataFrame(data=d)
df
>> col1 col2
0 a 2
1 b 4
and i want to duplicate the rows by col2 and get a table like this:
>> col1 col2
0 a 2
1 a 2
2 b 4
3 b 4
4 b 4
5 b 4
Thanks to everyone for the help!
Here's my solution using some numpy:
numRows = np.sum(df.col2)
blankSpace = np.zeros(numRows,).astype(str)
d2 = {'col1': blankSpace, 'col2': blankSpace}
df2 = pd.DataFrame(data=d2)
counter = 0
for i in range(df.shape[0]):
letter = df.col1[i]
numRowsForLetter = df.col2[i]
for j in range(numRowsForLetter):
df2.at[counter, 'col1'] = letter
df2.at[counter, 'col2'] = numRowsForLetter
counter += 1
df2 is your output dataframe!

How to put unique values of 1 series as columns and count each occurence of the unique values from the series per quarter?

I have a df that looks like this:
date col1
0 2020-01-09T19:25 a
1 2020-01-09T13:27 a
2 2020-01-04T13:44 b
3 2019-12-31T15:37 b
4 2019-12-23T21:47 c
I want to assign the unique values of col1 as columns headers and group the dates by quarter and count the unique values of col1 by quarter.
I can groupby quarters and count like so:
df['date'] = pd.to_datetime(df['date'])
df = df.groupby(df['date'].dt.to_period('Q'))['col1'].agg(['count'])
The df now looks like this:
count
dateresponded
2019Q4 2
2020Q1 3
I cant tell what the count of the unique values are by broken down.
I want the df to look like this:
a b c
dateresponded
2019Q4 1 1
2020Q1 2 1
IIUC, you want pd.crosstab
new_df = pd.crosstab(df['date'].dt.to_period('Q'),df['col1'],
rownames=['dateresponded'],
colnames=[None])
print(new_df)
We could also use groupby + DataFrame.unstack.We can rename the axis using DataFrame.rename_axis.
new_df = (df.groupby([df['date'].dt.to_period('Q'),'col1'])
.size()
.unstack(fill_value = 0)
.rename_axis(columns = None,index = 'dateresponded'))
print(new_df)
new_df = (df.groupby(df['date'].dt.to_period('Q'))
.col1
.value_counts()
.unstack(fill_value = 0)
.rename_axis(columns = None,index = 'dateresponded'))
print(new_df)
Output
a b c
dateresponded
2019Q4 0 1 1
2020Q1 2 1 0

Filter a dataframe with NOT and AND condition

I know this question has been asked multiple times, but for some reason it is not working for my case.
So I want to filter the dataframe using the NOT and AND condition.
For example, my dataframe df looks like:
col1 col2
a 1
a 2
b 3
b 4
b 5
c 6
Now, I want to use a condition to remove where col1 has "a" AND col2 has 2
My resulting dataframe should look like:
col1 col2
a 1
b 3
b 4
b 5
c 6
I tried this: Even though I used & but it removes all the rows which have "a" in col1 .
df = df[(df['col1'] != "a") & (df['col2'] != "2")]
To remove cells where col1 is "a" AND col2 is 2 means to keep cells where col1 isn't "a" OR col2 isn't 2 (negation of A AND B is NOT(A) OR NOT(B)):
df = df[(df['col1'] != "a") | (df['col2'] != 2)] # or "2", depending on whether the `2` is an int or a str

How to add column name to cell in pandas dataframe?

How do I take a normal data frame, like the following:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
and produce a dataframe where the column name is added to the cell in the frame, like the following:
d = {'col1': ['col1=1', 'col1=2'], 'col2': ['col2=3', 'col2=4']}
df = pd.DataFrame(data=d)
df
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
Any help is appreciated.
Make a new DataFrame containing the col*= strings, then add it to the original df with its values converted to strings. You get the desired result because addition concatenates strings:
>>> pd.DataFrame({col:str(col)+'=' for col in df}, index=df.index) + df.astype(str)
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can use apply to set column name in cells and then join them with '=' and the values.
df.apply(lambda x: x.index+'=', axis=1)+df.astype(str)
Out[168]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4
You can try this
df.ne(0).mul(df.columns)+'='+df.astype(str)
Out[1118]:
col1 col2
0 col1=1 col2=3
1 col1=2 col2=4

Resources