I am trying to understand why I am getting NaN for all rows when I extract non na values in a specific column. This happens only when I read in the excel file. It works fine with the csv
df=pd.read_excel('q.xlsx',sheet_name=None)
cols=['Name','Age','City']
for k,v in df.items():
if k=="Sheet1":
mod_cols=v.columns.to_list()
#The below is to filter on the column that is extra apart from the ones defined in cols.
#The reason I am doing this because I have multiple sheets in
#the excel file and when I iterate over the entire excel file, I want to filter on that additional column in each
#of those sheets. For this example, will focus on the first sheet
diff=set(mod_cols)-set(cols)
#diff is State in this case
d=v[~v[diff].isna()]
d
Name Age City State
0 NaN NaN NaN NaN
1 NaN NaN NaN NJ
2 NaN NaN NaN NaN
3 NaN NaN NaN NY
4 NaN NaN NaN NaN
5 NaN NaN NaN NC
6 NaN NaN NaN NaN
However with csv, it returns perfectly
df=pd.read_csv('q.csv')
d=df[~df['State'].isna()]
d
Name Age City State
1 Joe 31 Newark NJ
3 Mike 32 NYC NY
5 Moe 33 Durham NC
Related
I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.
I am writing a script to scrape a series of tables in a pdf into python using tabula-py.
This is fine. I do get the data. But the data is multi-line, and useless in reality.
I would like to merge the rows where the first column (Tag is not NaN).
I was about to put the whole thing in an iterator, and do it manually, but I realize that pandas is a powerful tool, but I don't have the pandas vocabulary to search for the right tool. Any help is much appreciated.
My Code
filename='tags.pdf'
tagTableStart=2 #784
tagTableEnd=39 #822
tableHeadings = ['Tag','Item','Length','Description','Value']
pageRange = "%d-%d" % (tagTableStart, tagTableEnd)
print ("Scanning pages %s" % pageRange)
# extract all the tables in that page range
tables = tabula.read_pdf(filename, pages=pageRange)
How The data is stored in the DataFrame:
(Empty fields are NaN)
Tag
Item
Length
Description
Value
AA
Some
2
Very Very
Text
Very long
Value
AB
More
4
Other Very
aaaa
Text
Very long
bbbb
Value
cccc
How I want the data:
This is almost as it is displayed in the pdf (I couldn't figure out how to make text multi line in SO editor)
Tag
Item
Length
Description
Value
AA
Some\nText
2
Very Very\nVery long\nValue
AB
More\nText
4
Other Very\nVery long\n Value
aaaa\nbbbb\ncccc
Actual sample output (obfuscated)
Tag Item Length Description Value
0 AA PYTHROM-PARTY-I 20 Some Current defined values are :
1 NaN NaN NaN texst Byte1:
2 NaN NaN NaN NaN C
3 NaN NaN NaN NaN DD
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN DD
6 NaN NaN NaN NaN DD
7 NaN NaN NaN NaN DD
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN B :
10 NaN NaN NaN NaN JLSAFISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
11 NaN NaN NaN NaN ISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
12 NaN NaN NaN NaN upon ISFLIHAJSLIhdsflhdliugdyg89o7fgy
13 NaN NaN NaN NaN asdsadct on the dasdsaf the
14 NaN NaN NaN NaN actsdfion.
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN SLKJDBFDLFKJBDSFLIUFy7dfsdfiuojewv
17 NaN NaN NaN NaN csdfgfdgfd.
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN fgfdgdfgsdfgfdsgdfsgfdgfdsgsdfgfdg
20 BB PRESENT-AMOUNT-BOX 11 Lorem Ipsum NaN
21 CC SOME-OTHER-VALUE 1 sdlkfgsdsfsdf 1
22 NaN NaN NaN device NaN
23 NaN NaN NaN ueghkjfgdsfdskjfhgsdfsdfkjdshfgsfliuaew8979vfhsdf NaN
24 NaN NaN NaN dshf87hsdfe4ir8hod9 NaN
Create groups from ID columns then join each rows:
agg_func = dict(zip(df.columns, [lambda s: '\n'.join(s).strip()] * len(df.columns)))
out = df.fillna('').groupby(df['Tag'].ffill(), as_index=False).agg(agg_func)
Output:
>>> out
Tag Item Length Description Value
0 AA Some\nText 2 Very Very\nVery long\nValue
1 AB More\nText 4 Other Very\nVery long\nValue aaaa\nbbbb\ncccc
agg_func is equivalent to write:
{'Tag': lambda s: '\n'.join(s).strip(),
'Item': lambda s: '\n'.join(s).strip(),
'Length': lambda s: '\n'.join(s).strip(),
'Description': lambda s: '\n'.join(s).strip(),
'Value': lambda s: '\n'.join(s).strip()}
I have an un-formatted excel sheet as input file. I need to re-arrange and write in the another excel file. I need to calculate the emp working hrs who is working in different project and different client.
Here RAM is working in 1st project alone, but mohan is working in 1st and 2nd project and we have to calculate his working hr of 1st and 2nd.
Input
Output
>>> df # input dataframe
Employee Name Project Client hours
0 Ram NaN NaN NaN
1 NaN 1st Project NaN NaN
2 NaN NaN ABC 5.0
3 NaN NaN NaN 5.0
4 Mohan NaN NaN NaN
5 NaN 1st Project NaN NaN
6 NaN NaN DEF 10.0
7 NaN NaN DEF 10.0
8 NaN 2nd Project NaN NaN
9 NaN NaN GEH 1.0
10 NaN NaN NaN 1.0
11 NaN NaN NaN 11.0
For each column except the last one, replace NaN by previous value and move the column to index then drop all rows that are empty. Finally, drop the initial RangeIndex.
for col in df.columns[:-1]:
df[col] = df[col].ffill()
df = df.set_index(col, append=True)
df = df.dropna(how="all")
df = df.droplevel(0)
>>> df # output dataframe
hours
Employee Name Project Client
Ram 1st Project ABC 5.0
ABC 5.0
Mohan 1st Project DEF 10.0
DEF 10.0
2nd Project GEH 1.0
GEH 1.0
GEH 11.0
Edit: Output correct excel file
df.set_index("hours", append=True).to_excel("output.xlsx")
I'm working with Python 3 on Mac OS 10.11.06 (el capitan).
I have a .csv dataset consisting of about 3,700 time series sets (of unequal lengths). The data are currently formatted as follows:
Current Format
trade_date price_usd ticker
0 2016-01-01 434.33000 BTC
1 2016-01-02 433.44000 BTC
2 2016-01-03 430.01000 BTC
3 2016-01-04 433.09000 BTC
4 2016-01-05 431.96000 BTC
... ... ... ...
2347227 2020-10-19 74.13000 BRAIN
2347228 2020-10-20 71.97000 BRAIN
2347229 2020-10-21 76.64000 BRAIN
2347230 2020-10-22 80.90000 BRAIN
2347231 2020-10-19 0.15004 DAOFI
Ignoring the default numerical index for the moment, notice that the datetime column, trade_date, is such that the sequence of values repeats with each new ticker group. My goal is to transform the data such that each ticker name becomes a column header under which its corresponding daily prices are listed in correct order with the datetime value on which it was recorded (i.e. the datetime index does not repeat and the daily price values for the ticker symbols are the rows):
Target Format
trade_date ticker1 ticker2 ... tickerN
day1 t1p1 t2p1 ... tNp1
day2 t1p2 t2p2 ... etc...
.
.
.
dayK
Thus far I've tried various approaches, including experiments with various methods, e.g. stack()/unstack(), groupby(), etc., as well as custom functions that attempt to iterate through the values to assign them to a new DF in which I created a structured frame into which to drop the values, but to no avail (see failed attempt below).
New, empty target data frame with ticker symbol as col and trade_date range as index:
BTC ETH XRP MKR LTC USDT BCH XLM EOS BNB ... MTLX INDEX WOA HAUT THRM YFED NMT DOKI BRAIN DAOFI
2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Failed attempt to populate the above ...
for element in crypto_df['ticker']:
if element == new_df.column and crypto['trade_date'] == new_df.index:
df['ticker'] = element
new_df.head()
My ultimate goal is to produce a multi-series time series forecast using FBProphet because of its ability to handle multiple time series forecasts in a "single" model.
One last thought I've just had is that one could maybe create separate data frames for each ticker, then rejoin along the datetime index, creating the separate columns in the new DF along the way, but that seems a bit round-about (I've literally just done this for a couple thousand .csv files with equities data, for example)... But I'd still like to find a more direct solution, if there is one? Surely this scenario will arise again in the future!
Thanks for any thoughts ...
You can set_index and unstack:
print(df.set_index(["trade_date", "ticker"]).unstack("ticker"))
price_usd
ticker BRAIN BTC DAOFI
trade_date
2016-01-01 NaN 434.33 NaN
2016-01-02 NaN 433.44 NaN
2016-01-03 NaN 430.01 NaN
2016-01-04 NaN 433.09 NaN
2016-01-05 NaN 431.96 NaN
2020-10-19 74.13 NaN 0.15004
2020-10-20 71.97 NaN NaN
2020-10-21 76.64 NaN NaN
2020-10-22 80.90 NaN NaN
First use .groupby(), then use .unstack():
import pandas as pd
from io import StringIO
text = """
trade_date price_usd ticker
2016-01-01 434.33000 BTC
2016-01-02 433.44000 BTC
2016-01-02 430.01000 Google
2016-01-03 433.09000 BTC
2016-01-03 431.96000 Google
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
df.groupby(['trade_date', 'ticker'])['price_usd'].mean().unstack()
Resulting dataframe:
trade_date ticker BTC Google
2016-01-01 434.33 NaN
2016-01-02 433.44 430.01
2016-01-03 433.09 431.96
I am having some issues on how to sort columns based on the last row getting a 'KeyError: -1'
I have tried df.sort_values(by=[-1], axis=1,na_position='last') and
df.sort_values(by=df[-1], axis=1,na_position='last')
timestamp AKS AGI AA ATI ... TK TNP USDP ZTO
2019-09-10 NaN NaN NaN NaN ... 0.063570 0.057432 -0.121778 0.098429
2019-09-11 NaN NaN NaN NaN ... 0.083130 0.043919 -0.128889 0.104712
2019-09-12 NaN NaN NaN NaN ... 0.080685 0.047297 -0.130667 0.135079
2019-09-13 NaN NaN NaN NaN ... 0.090465 0.020270 -0.123556 0.112565
2019-09-16 NaN NaN NaN NaN ... NaN NaN NaN NaN
some code
sorted_df = df.sort_values(by=[-1], axis=1,na_position='last')
Expected results to be that the columns are sorted by the last row
Well if you want to know what you last column is, you can do that by using
df.columns.tolist()[-1]
So if you want to sort a df by the last column it turns out as
df.sort_values(by=df.columns.tolist()[-1], ascending = False)
and then you can na_position
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end