I have an un-formatted excel sheet as input file. I need to re-arrange and write in the another excel file. I need to calculate the emp working hrs who is working in different project and different client.
Here RAM is working in 1st project alone, but mohan is working in 1st and 2nd project and we have to calculate his working hr of 1st and 2nd.
Input
Output
>>> df # input dataframe
Employee Name Project Client hours
0 Ram NaN NaN NaN
1 NaN 1st Project NaN NaN
2 NaN NaN ABC 5.0
3 NaN NaN NaN 5.0
4 Mohan NaN NaN NaN
5 NaN 1st Project NaN NaN
6 NaN NaN DEF 10.0
7 NaN NaN DEF 10.0
8 NaN 2nd Project NaN NaN
9 NaN NaN GEH 1.0
10 NaN NaN NaN 1.0
11 NaN NaN NaN 11.0
For each column except the last one, replace NaN by previous value and move the column to index then drop all rows that are empty. Finally, drop the initial RangeIndex.
for col in df.columns[:-1]:
df[col] = df[col].ffill()
df = df.set_index(col, append=True)
df = df.dropna(how="all")
df = df.droplevel(0)
>>> df # output dataframe
hours
Employee Name Project Client
Ram 1st Project ABC 5.0
ABC 5.0
Mohan 1st Project DEF 10.0
DEF 10.0
2nd Project GEH 1.0
GEH 1.0
GEH 11.0
Edit: Output correct excel file
df.set_index("hours", append=True).to_excel("output.xlsx")
Related
I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.
I am trying to understand why I am getting NaN for all rows when I extract non na values in a specific column. This happens only when I read in the excel file. It works fine with the csv
df=pd.read_excel('q.xlsx',sheet_name=None)
cols=['Name','Age','City']
for k,v in df.items():
if k=="Sheet1":
mod_cols=v.columns.to_list()
#The below is to filter on the column that is extra apart from the ones defined in cols.
#The reason I am doing this because I have multiple sheets in
#the excel file and when I iterate over the entire excel file, I want to filter on that additional column in each
#of those sheets. For this example, will focus on the first sheet
diff=set(mod_cols)-set(cols)
#diff is State in this case
d=v[~v[diff].isna()]
d
Name Age City State
0 NaN NaN NaN NaN
1 NaN NaN NaN NJ
2 NaN NaN NaN NaN
3 NaN NaN NaN NY
4 NaN NaN NaN NaN
5 NaN NaN NaN NC
6 NaN NaN NaN NaN
However with csv, it returns perfectly
df=pd.read_csv('q.csv')
d=df[~df['State'].isna()]
d
Name Age City State
1 Joe 31 Newark NJ
3 Mike 32 NYC NY
5 Moe 33 Durham NC
I am trying to clean np values in an open sourced data.
I am using python3, Jupyter and pandas.
response = urllib.request.urlopen('https://resources.lendingclub.com/LoanStats3c.csv.zip')
import shutil
url = 'https://resources.lendingclub.com/LoanStats3c.csv.zip'
file_name = 'LoanStats3c.csv.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
loan=pd.read_csv(open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loan.describe()
# remove all columns with all NAs
loan = loan.dropna(axis=1, how = 'all')
loan.describe()
# remove all rows with any NAs
loan = loan.dropna(axis = 0)
loan.describe()
But, the results are all columns with all NAs:
loan_amnt funded_amnt funded_amnt_inv installment annual_inc dti \
count 0.0 0.0 0.0 0.0 0.0 0.0
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN
Why all rows with valid values are gone and only the NA columns are left ?
thanks
When you'r using .dropna() like that all ocurrences with NaN values are deleted from dataframe
loan.dropna(axis=1, how = 'all')
Will delete the columns with all values in NaN
loan.dropna(axis = 0)
Will delete the rows with a least one value in NaN
I saw the file and i'm pretty sure that every rows has at least one column in NaN
Finally when using .describe() the dataframe is empty and the values that are shown are a descriptive statistics of that, if you want to see the real DF use print(df) or in jupyter just let the variable at the end of the block
some code
some code
some code
variable = pd.DataFrame([])
#print(variable)
variable
That would show you the value of the variable
This question already has answers here:
Filter out rows with more than certain number of NaN
(3 answers)
Closed 4 years ago.
I am trying to remove the rows in the data frame with more than 7 null values. Please suggest something that is efficient to achieve this.
If I understand correctly, you need to remove rows only if total nan's in a row is more than 7:
df = df[df.isnull().sum(axis=1) < 7]
This will keep only rows which have nan's less than 7 in the dataframe, and will remove all having nan's > 7.
dropna has a thresh argument. Subtract your desired number from the number of columns.
thresh : int, optional Require that many non-NA values.
df.dropna(thresh=df.shape[1]-7, axis=0)
Sample Data:
print(df)
0 1 2 3 4 5 6 7
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN 5.0
2 6.0 7.0 8.0 9.0 NaN NaN NaN NaN
3 NaN NaN 11.0 12.0 13.0 14.0 15.0 16.0
df.dropna(thresh=df.shape[1]-7, axis=0)
0 1 2 3 4 5 6 7
1 NaN NaN NaN NaN NaN NaN NaN 5.0
2 6.0 7.0 8.0 9.0 NaN NaN NaN NaN
3 NaN NaN 11.0 12.0 13.0 14.0 15.0 16.0
The last 2 real numbers in each row of my data were measured with error. I want to replace them with np.NAN. The number of real numbers differs by row (i.e., each row already has some NAN's at differing amount). Column headers indicate measurement number, index was a experimental trial.Values in a cell equal a measurement reading. Some trials had more measurement readings than others;thus, some rows have more NAN's than others. The below code creates a data frame similar to mine.
import pandas as pd
import numpy as np
data = np.array(([1,2,3,4,5,2,np.NaN],
[2,2,3,2,3,np.NaN,np.NaN],[4,4,5,1,np.NaN,np.NaN,np.nan]))
df1 = pd.DataFrame(data, columns = ['0','1','2','3','4','5','6'])
The data frame yielded from code that looks similar to mine:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 5.0 2.0 NAN
1 2.0 2.0 3.0 2.0 3.0 NAN NAN
2 4.0 4.0 5.0 1.0 NAN NAN NAN
This is what I want the new data frame to look like:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 NAN NAN NAN
1 2.0 2.0 3.0 NAN NAN NAN NAN
2 4.0 4.0 NAN NAN NAN NAN NAN
I have tryed counting the NAN and using that to locate the position of the last and second to last numeric values, but it gets me no where.
Ultimately, what I want to do is ignore the NAN's in the original data frame and take the last two real values (i.e., the integers) in a row and replace them with np.NAN. One of the main issues is the position of the last 2 real numbers in a row can differ by row. Making the original data frame look like the new data frame in the above examples.
Method #1 would be simply to shift everything over by 2 and keep the values which remain non-null:
In [61]: df.where(df.shift(-2, axis=1).notnull())
Out[61]:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 NaN NaN NaN
1 2.0 2.0 3.0 NaN NaN NaN NaN
2 4.0 4.0 NaN NaN NaN NaN NaN
Method #2 would be to count the number of non-null values from the right, and only keep non-null values after the second:
In [62]: df.where((df.notnull().iloc[:, ::-1].cumsum(axis=1) > 2))
Out[62]:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 NaN NaN NaN
1 2.0 2.0 3.0 NaN NaN NaN NaN
2 4.0 4.0 NaN NaN NaN NaN NaN
This isn't as pretty, but would allow for finer levels of customization if we needed to shift differently for each row, for example if it weren't true that we had a row of non-null values followed by null values.