I'm working with thousands of pd.series each with a multi-index that has 2 static index, a dynamic one, and then timestamps:
start = np.concatenate((np.random.rand(3), [np.nan]*3))
end = np.concatenate(([np.nan]*3, np.random.rand(3)))
index1 = pd.MultiIndex(levels = [["X"], ["Y"], ["A"], ["d1","d2","d3","d4","d5","d6"]],
labels = [[0,0,0,0,0,0], [0,0,0,0,0,0], [0,0,0,0,0,0], [0,1,2,3,4,5]],
names = ["static1", "static2", "dynamo", "timestamps"])
i1_start = pd.Series(start, index=index1, name="col1")
i1_end = pd.Series(end, index=index1, name="col2")
index2 = index1 = pd.MultiIndex(levels = [["X"], ["Y"], ["B"], ["d1","d2","d3","d4","d5","d6"]],
labels = [[0,0,0,0,0,0], [0,0,0,0,0,0], [0,0,0,0,0,0], [0,1,2,3,4,5]],
names = ["static1", "static2", "dynamo", "timestamps"])
i2_start = pd.Series(start, index=index2, name="col1")
i2_end = pd.Series(end, index=index2, name="col2")
data = [i1_start, i1_end, i2_start, i2_end]
df = pd.DataFrame(data).T
df
Here are the results of turning it into a dataframe:
col1 col2 col1 col2
static1 static2 dynamo timestamps
X Y A d1 0.248504 NaN NaN NaN
d2 0.424774 NaN NaN NaN
d3 0.333638 NaN NaN NaN
d4 NaN 0.987744 NaN NaN
d5 NaN 0.093231 NaN NaN
d6 NaN 0.918666 NaN NaN
B d1 NaN NaN 0.248504 NaN
d2 NaN NaN 0.424774 NaN
d3 NaN NaN 0.333638 NaN
d4 NaN NaN NaN 0.987744
d5 NaN NaN NaN 0.093231
d6 NaN NaN NaN 0.918666
I'm looking for advice on how to groupby the series with the same series.names and concat/merge/join them so that the columns line up, instead of having an entire triangle of just null values.
I think you need concat with sum or max and parameter axis=1 with level=0:
data = [i1_start, i1_end, i2_start, i2_end]
df = pd.concat(data, 1).sum(axis=1, level=0)
#same as
#df = pd.concat(data, 1).groupby(axis=1, level=0).sum()
#alternative
df = pd.concat(data, 1).max(axis=1, level=0)
print (df)
col1 col2
static1 static2 dynamo timestamps
X Y A d1 0.771148 NaN
d2 0.074757 NaN
d3 0.526310 NaN
d4 NaN 0.975088
d5 NaN 0.992226
d6 NaN 0.465135
B d1 0.771148 NaN
d2 0.074757 NaN
d3 0.526310 NaN
d4 NaN 0.975088
d5 NaN 0.992226
d6 NaN 0.465135
How about this?
df.fillna(0).sum(1)
That is, replace NaN with zero and sum all the columns for each row.
Related
I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.
I want to add multiple empty rows at start of my dataframe. I have tried using list but it dosen't seem to return optimum result:
Example df:
Col1
col2
col3
col4
One
Two
Three
four
2
4
5
8
Desired df:
Col1
col2
col3
col4
One
Two
Three
four
2
4
5
8
Column names should also start from the nth row, I want to add n empty rows at the beginning of my Dataframe.
I'm not sure why you would want to do this but I did it by splitting up the original dataframe into a dataframe with a row of the column names and a separate dataframe of the data. I then created a dataframe of nans to be the blank rows and joined the 3 together. You will need to import numpy for this.
I created a variable no_cols to be the number of columns in the dataframe and no_empty_rows to be how many empty rows to simplify code:
no_cols = len(df.columns)
no_empty_rows = 6
Then I turned the columns into their own dataframe, with 1 row which is the column names, and headers as np.nan:
cols = pd.DataFrame([df.columns], columns = [np.nan]*no_cols)
NaN NaN NaN NaN
0 Col1 col2 col3 col4
Next I renamed the columns in the original dataframe to nan:
df.columns = [np.nan]*no_cols
NaN NaN NaN NaN
0 One Two Three four
1 2 4 5 8
Then I created a new dataframe of nans, with 6 blank rows (this can be changed):
df_empty_rows = (pd.DataFrame(data=[[np.nan]*no_cols]*no_empty_rows,
columns=[np.nan]*no_cols,
index=[np.nan]*no_empty_rows))
NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
You can then append together all 3. First I put the columns and data of df back together and reset their index, then append that to df_empty_rows:
df_out = df_empty_rows.append(cols.append(df).reset_index(drop=True))
NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
0.0 Col1 col2 col3 col4
1.0 One Two Three four
2.0 2 4 5 8
Full code:
no_cols = len(df.columns)
no_empty_rows = 6
cols = pd.DataFrame([df.columns], columns=[np.nan]*no_cols)
df.columns = [np.nan]*no_cols
df_empty_rows = (pd.DataFrame(data=[[np.nan]*no_cols]*no_empty_rows,
columns=[np.nan]*no_cols,
index=[np.nan]*no_empty_rows))
df_out = df_empty_rows.append(cols.append(df).reset_index(drop=True))
I have a list of values [0.1, 0.43, 0.58] and a dataframe df with several columns. I added three new columns in my dataframe with NaN values, and I want to replace them with the ones from the list. Each list value split into each new column in that exact order.
The dataframe is 4 columns (no index shown), with 3 new columns.
Name A B C New1 New2 New3
Elem1 NaN NaN NaN NaN NaN NaN
Elem2 NaN NaN NaN NaN NaN NaN
Elem3 NaN NaN NaN NaN NaN NaN
Expected result:
Name A B C New1 New2 New3
Elem1 NaN NaN NaN NaN NaN NaN
Elem2 NaN NaN NaN 0.1 0.43 0.58
Elem3 NaN NaN NaN NaN NaN NaN
If l is your list, then:
df.loc[df.Name=='Elem2', 'New1':'New3'] = l
I am trying to clean np values in an open sourced data.
I am using python3, Jupyter and pandas.
response = urllib.request.urlopen('https://resources.lendingclub.com/LoanStats3c.csv.zip')
import shutil
url = 'https://resources.lendingclub.com/LoanStats3c.csv.zip'
file_name = 'LoanStats3c.csv.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
loan=pd.read_csv(open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loan.describe()
# remove all columns with all NAs
loan = loan.dropna(axis=1, how = 'all')
loan.describe()
# remove all rows with any NAs
loan = loan.dropna(axis = 0)
loan.describe()
But, the results are all columns with all NAs:
loan_amnt funded_amnt funded_amnt_inv installment annual_inc dti \
count 0.0 0.0 0.0 0.0 0.0 0.0
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN
Why all rows with valid values are gone and only the NA columns are left ?
thanks
When you'r using .dropna() like that all ocurrences with NaN values are deleted from dataframe
loan.dropna(axis=1, how = 'all')
Will delete the columns with all values in NaN
loan.dropna(axis = 0)
Will delete the rows with a least one value in NaN
I saw the file and i'm pretty sure that every rows has at least one column in NaN
Finally when using .describe() the dataframe is empty and the values that are shown are a descriptive statistics of that, if you want to see the real DF use print(df) or in jupyter just let the variable at the end of the block
some code
some code
some code
variable = pd.DataFrame([])
#print(variable)
variable
That would show you the value of the variable
I have a pandas dataframe as below:
data = {'A' :[1,2,3],
'B':[2,17,17],
'C1' :["C1",np.nan,np.nan],
'C2' :[np.nan,"C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
df
A B C1 C2
0 1 2 C1 NaN
1 2 17 NaN C2
2 3 17 NaN NaN
I want to create a variable "C" based on "C1" and"C2"(there could be "C4", "C5". If any of C's has the value "C"= value from C's(C1, C2, C3....). My output in this case should look like below:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Try this
df1 = df.filter(regex='^C\d+')
df['C'] = df1[df1.isin(df1.columns)].bfill(1).iloc[:,0]
Out[117]:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
If you want to strictly compare values matching to its own column name, Use eq instead of isin as follows
df['C'] = df1[df1.eq(df1.columns, axis=1)].bfill(1).iloc[:,0]
IIUC
df['C']=df.filter(like='C').bfill(axis=1).iloc[:,0]
df
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
IIUC,
we can filter your columns by the word C then aggregate the values with an agg call:
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
print(df)
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN