What's the most Pythonic/efficient way to conditionally split lengthy rows? - python-3.x

I have a data frame with a column Results which can get pretty lengthy. Since the data frame is ending up in an Excel report, this is problematic as Excel will only make a row so tall before it just simply doesn't display all data. Instead, what I'd like to do is split rows with results of a certain length into multiple rows.
I wrote some code on a small scale data frame to split the results into chunks of 2. I haven't worked out how to put each chunk in a new row. Additionally, I'm not sure this will be the most efficient method when my data frame goes from 6 rows to 35k+. What is the most efficient/Pythonic way to achieve what I want?
Original data frame
Result Date
0 [SUCCESS] 10/10/2019
1 [SUCCESS] 10/09/2019
2 [FAILURE] 10/08/2019
3 [Pending, Pending, SUCCESS] 10/07/2019
4 [FAILURE] 10/06/2019
5 [Pending, SUCCESS] 10/05/2019
Goal output
Result Date
0 [SUCCESS] 10/10/2019
1 [SUCCESS] 10/09/2019
2 [FAILURE] 10/08/2019
3 [Pending, Pending] 10/07/2019
4 [SUCCESS] 10/07/2019
5 [FAILURE] 10/06/2019
6 [Pending, SUCCESS] 10/05/2019
Code
import pandas as pd
import numpy as np
data = {'Result': [['SUCCESS'], ['SUCCESS'], ['FAILURE'], ['Pending', 'Pending', 'SUCCESS'], ['FAILURE'], ['Pending', 'SUCCESS']], 'Date': ['10/10/2019', '10/09/2019', '10/08/2019', '10/07/2019', '10/06/2019', '10/05/2019']}
df = pd.DataFrame(data)
df['Length of Results'] = df['Result'].str.len()
def chunks(l, n):
for i in range(0, len(l), n):
yield l[i:i + n]
for i in range(len(df)):
if df['Length of Results'][i] > 2:
df['Result'][i] = list(chunks(df['Result'][i], 2))
else:
pass
df['Chunks'] = 1
for i in range(len(df)):
if df['Length of Results'][i] > 2:
df['Chunks'][i] = len(df['Result'][i])
else:
pass
df = df.loc[np.repeat(df.index.values, df.Chunks)]
df = df.reset_index(drop=True)
Code currently produces
Result Date Length of Results Chunks
0 [SUCCESS] 10/10/2019 1 1
1 [SUCCESS] 10/09/2019 1 1
2 [FAILURE] 10/08/2019 1 1
3 [[Pending, Pending], [SUCCESS]] 10/07/2019 3 2
4 [[Pending, Pending], [SUCCESS]] 10/07/2019 3 2
5 [FAILURE] 10/06/2019 1 1
6 [Pending, SUCCESS] 10/05/2019 2 1
df.to_dict()
{'Result': {0: ['SUCCESS'], 1: ['SUCCESS'], 2: ['FAILURE'], 3: [['Pending', 'Pending'], ['SUCCESS']], 4: [['Pending', 'Pending'], ['SUCCESS']], 5: ['FAILURE'], 6: ['Pending', 'SUCCESS']}, 'Date': {0: '10/10/2019', 1: '10/09/2019', 2: '10/08/2019', 3: '10/07/2019', 4: '10/07/2019', 5: '10/06/2019', 6: '10/05/2019'}, 'Length of Results': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 2}, 'Chunks': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 1, 6: 1}}

I would recommend putting each pending/SUCCESS/FAILURE on one row, using df.explode if you are on pandas 0.25+:
df = df.explode('Result')
df.groupby(['Date','Result']).count().reset_index(name='Length')
Output:
Date Result Length
0 10/05/2019 Pending 1
1 10/05/2019 SUCCESS 1
2 10/06/2019 FAILURE 1
3 10/07/2019 Pending 2
4 10/07/2019 SUCCESS 1
5 10/08/2019 FAILURE 1
6 10/09/2019 SUCCESS 1
7 10/10/2019 SUCCESS 1

You can do
s=df[['Date','Result']].explode('Result')
t=s.groupby(['Date','Result'])['Result'].transform('size')>1
s.groupby([s.Date,t]).Result.agg(list).reset_index(level=0).reset_index(drop=True)
Out[65]:
Date Result
0 10/05/2019 [Pending, SUCCESS]
1 10/06/2019 [FAILURE]
2 10/07/2019 [SUCCESS]
3 10/07/2019 [Pending, Pending]
4 10/08/2019 [FAILURE]
5 10/09/2019 [SUCCESS]
6 10/10/2019 [SUCCESS]

Related

Count positive, negative or zero values numbers for multiple columns in Python

Given a dataset as follows:
[{'id': 1, 'ltp': 2, 'change': nan},
{'id': 2, 'ltp': 5, 'change': 1.5},
{'id': 3, 'ltp': 3, 'change': -0.4},
{'id': 4, 'ltp': 0, 'change': 2.0},
{'id': 5, 'ltp': 5, 'change': -0.444444},
{'id': 6, 'ltp': 16, 'change': 2.2}]
Or
id ltp change
0 1 2 NaN
1 2 5 1.500000
2 3 3 -0.400000
3 4 0 2.000000
4 5 5 -0.444444
5 6 16 2.200000
I would like to count the number of positive, negative and 0 values for columns ltp and change, the result may like this:
columns positive negative zero
0 ltp 5 0 1
1 change 3 2 0
How could I do that with Pandas or Numpy? Thanks.
Updated: if I need groupby type and count following the logic above
id ltp change type
0 1 2 NaN a
1 2 5 1.500000 a
2 3 3 -0.400000 a
3 4 0 2.000000 b
4 5 5 -0.444444 b
5 6 16 2.200000 b
The expected output:
type columns positive negative zero
0 a ltp 3 0 0
1 a change 1 1 0
2 b ltp 2 0 1
3 b change 2 1 0
Use np.sign with selected columns first, then counts values in value_counts, transpose, replaced missing values and last rename columns names by dictionary with convert index to column columns:
d= {-1:'negative', 1:'positive', 0:'zero'}
df = (np.sign(df[['ltp','change']])
.apply(pd.value_counts)
.T
.fillna(0)
.astype(int)
.rename(columns=d)
.rename_axis('columns')
.reset_index())
print (df)
columns negative zero positive
0 ltp 0 1 5
1 change 2 0 3
EDIT: Another solution with type column with DataFrame.melt, mapping column with np.sign and count values by crosstab:
d= {-1:'negative', 1:'positive', 0:'zero'}
df1 = df.melt(id_vars='type', value_vars=['ltp','change'], var_name='columns')
df1['value'] = np.sign(df1['value']).map(d)
df1 = (pd.crosstab([df1['type'],df1['columns']], df1['value'])
.rename_axis(columns=None)
.reset_index())
print (df1)
type columns negative positive zero
0 a change 1 1 0
1 a ltp 0 3 0
2 b change 1 2 0
3 b ltp 0 2 1

creating new column values depending on other column values in a dataframe

I have a data frame and a snippet of it is given below.
data = {'ID':['A', 'A', 'A,'A', 'B', 'B', 'B', 'B', 'C', 'C'],
'Date':['03/25/2021', '03/25/2021',03/27/2021', '03/29/2021', '03/10/2021','03/11/2021','03/15/2021','03/16/2021', '03/21/2021','03/25/2021']}
df = pd.DataFrame(data)
I am looking for a final result which should look like this.
The explanation: For each ID, the study_date starts from the starting date and ends on the last date. The missing dates in the middle have to be filled. If the date was missing from the original dataframe, then 'missing_date' column will have value 1 or else 0. And the study day column is the number of days from the starting to ending days incrementing in order.
I tried some stuff but I have been stuck on this for a while now. Any help is greatly appreciated.
Thanks.
Try:
def fn(x):
dr = pd.date_range(x["Date"].min(), x["Date"].max())
out = pd.DataFrame({"Date": dr}, index=range(1, len(dr) + 1))
out["Missing_Date"] = (~out["Date"].isin(x["Date"])).astype(int)
return out
# if the "Date" column is not converted:
df["Date"] = pd.to_datetime(df["Date"])
x = (
df.groupby("ID")
.apply(fn)
.reset_index()
.rename(columns={"level_1": "StudyDay"})
)
print(x)
Prints:
ID StudyDay Date Missing_Date
0 A 1 2021-03-25 0
1 A 2 2021-03-26 1
2 A 3 2021-03-27 0
3 A 4 2021-03-28 1
4 A 5 2021-03-29 0
5 B 1 2021-03-10 0
6 B 2 2021-03-11 0
7 B 3 2021-03-12 1
8 B 4 2021-03-13 1
9 B 5 2021-03-14 1
10 B 6 2021-03-15 0
11 B 7 2021-03-16 0
12 C 1 2021-03-21 0
13 C 2 2021-03-22 1
14 C 3 2021-03-23 1
15 C 4 2021-03-24 1
16 C 5 2021-03-25 0

Faster way to make dataframe from dictionary in Python

I'm trying to make pd.DataFrame from dictionary like this:
x = {6: 8416, 2: 8361, 5: 8343, 4: 8326, 1: 8292, 3: 8262}
I want the both numbers as a rows in separate columns and add column names 'Y' and 'Z'.
I done it somehow manually but I'm searching for faster way for datasets where is not possible to do it by hand anymore
You can do this way
x = {6: 8416, 2: 8361, 5: 8343, 4: 8326, 1: 8292, 3: 8262}
pd.DataFrame([(k,v) for k,v in x.items()], columns=["Y","Z"])
Output:
Y Z
0 6 8416
1 2 8361
2 5 8343
3 4 8326
4 1 8292
5 3 8262
Here is one-way:
pd.DataFrame.from_dict(x, 'index', columns=['y']).rename_axis('x').reset_index()
Output:
x y
0 6 8416
1 2 8361
2 5 8343
3 4 8326
4 1 8292
5 3 8262

Replace data type in dataFrame

I wonder how to replace types in data frame. In this sample I want to replace all strings to 0 or NaN. Here is my simple df and I try too do:
df.replace(str, 0, inplace=True)
or
df.replace({str: 0}, inplace=True)
but above solutions does not work.
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
check this code will visit every cell in the data frame , and if it was nan or string will replace them with 0
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2, 3, np.nan],
'B': [np.nan, 6, 7, 8, 9],
'C': ['a', 10, 500, 'd', 'e']})
print("before >>> \n",df)
def replace_nan_and_strings(cell_value):
if pd.isnull(cell_value) or isinstance(cell_value,str):
return 0
else :
return cell_value
new_df=df.applymap(lambda (x):replace_nan_strings(x))
print("after >>> \n",new_df)
Try this:
df = df.replace('[a-zA-Z]', 0, regex=True)
This is how I tested it:
'''
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
'''
import pandas as pd
df = pd.read_clipboard()
df = df.replace('[a-zA-Z]', 0, regex=True)
print(df)
Output:
0 1 2
0 NaN 1 0
1 2.0 3 0
2 4.0 0 5
3 10.0 20 30
New scenario as requested in the comments below:
Input:
'''
0 '1' 2
0 NaN 1 'b'
1 2 3 'c'
2 '4' 'd' 5
3 10 20 30
'''
Output:
0 '1' 2
0 NaN 1 0
1 2 3 0
2 '4' 0 5
3 10 20 30

Get value from another dataframe column based on condition

I have a dataframe like below:
>>> df1
a b
0 [1, 2, 3] 10
1 [4, 5, 6] 20
2 [7, 8] 30
and another like:
>>> df2
a
0 1
1 2
2 3
3 4
4 5
I need to create column 'c' in df2 from column 'b' of df1 if column 'a' value of df2 is in coulmn 'a' df1. In df1 each tuple of column 'a' is a list.
I have tried to implement from following url, but got nothing so far:
https://medium.com/#Imaadmkhan1/using-pandas-to-create-a-conditional-column-by-selecting-multiple-columns-in-two-different-b50886fabb7d
expect result is
>>> df2
a c
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
Use Series.map by flattening values from df1 to dictionary:
d = {c: b for a, b in zip(df1['a'], df1['b']) for c in a}
print (d)
{1: 10, 2: 10, 3: 10, 4: 20, 5: 20, 6: 20, 7: 30, 8: 30}
df2['new'] = df2['a'].map(d)
print (df2)
a new
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
EDIT: I think problem is mixed integers in list in column a, solution is use if/else for test it for new dictionary:
d = {}
for a, b in zip(df1['a'], df1['b']):
if isinstance(a, list):
for c in a:
d[c] = b
else:
d[a] = b
df2['new'] = df2['a'].map(d)
Use :
m=pd.DataFrame({'a':np.concatenate(df.a.values),'b':df.b.repeat(df.a.str.len())})
df2.merge(m,on='a')
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
First we unnest the list df1 to rows, then we merge them on column a:
df1 = df1.set_index('b').a.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'a'})
print(df1, '\n')
df_final = df2.merge(df1, on='a')
print(df_final)
b a
0 10 1.0
1 10 2.0
2 10 3.0
0 20 4.0
1 20 5.0
2 20 6.0
0 30 7.0
1 30 8.0
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20

Resources