dividing columns in pandas - python-3.x

It's been a while for me to try figure out what is wrong with my code, but I guess I need help. Here is my problem. I have two dataframe which I want to divide each other, but the result was always NaN.
In: merged.head()
Out:
15 25 35 45 55 \
Wavelength
460 1400.008821 770.372081 585.551187 547.150493 570.912408
470 1525.470457 807.362906 573.020215 500.984110 489.206952
480 1848.337785 873.034042 524.651886 410.963453 351.929723
490 2214.989164 992.325996 566.273806 413.544512 340.781474
500 2901.445401 1353.389196 807.889110 648.437549 615.930982
In: white.head()
Out:
White
Wavelength
460 289.209506
470 366.083846
480 473.510106
490 524.090532
500 553.715322
when I try to do division, the result was:
In: ref = merged.div(white.White,axis = 0)
In: ref.head()
Out:
15 25 35 45 55 65 75 85
Wavelength
460 NaN NaN NaN NaN NaN NaN NaN NaN
470 NaN NaN NaN NaN NaN NaN NaN NaN
480 NaN NaN NaN NaN NaN NaN NaN NaN
490 NaN NaN NaN NaN NaN NaN NaN NaN
500 NaN NaN NaN NaN NaN NaN NaN NaN
In this case, what is wrong in my code?
I also try
ref = merged[["15","25","35","45","55","65","75","85"]].div(white.White,axis = 0)
with the same result

I think problem is in different dtype of index and data are not aligned, so get NaNs - need same.
#sample dtypes, maybe swapped
print (merged.index.dtype)
object
print (white.index.dtype)
int64
So solution is convert indexes by astype:
merged.index = merged.index.astype(int)
white.index = white.index.astype(int)
Or:
merged.index = merged.index.astype(str)
white.index = white.index.astype(str)
#white.index is int, not necessary cast
merged.index = merged.index.astype(int)
ref = merged.div(white.White,axis = 0)
print (ref)
15 25 35 45 55
Wavelength
460 4.840812 2.663716 2.024661 1.891883 1.974044
470 4.166997 2.205404 1.565270 1.368496 1.336325
480 3.903481 1.843750 1.108006 0.867909 0.743236
490 4.226348 1.893425 1.080489 0.789071 0.650234
500 5.239959 2.444197 1.459033 1.171067 1.112360

Related

concat a transposed dataframe to another dataframe and write it to a spreadsheet

I have a data frame looks like this:
Date Demand Forecast Error [Error] Error% Error^2
1 47
2 70
3 95
4 122 61 51 51 54 23
5 142 86 34 51 54 23
6 155 110 34 51 45 36
7 189 130 54 51 45 86
8 208 152
9 160 174
10 142 176
11 160
12 160
13 160
and the df2 that looks like:
Bias MAPE MAE RMSE_rel
0 0.143709 0.273529 42.285714 43.198692
1 22.952381 0.273529 0.264758 0.270475
the df2 would be transposed with new columns absulote,scaled to look like this:
df2.set_index('Bias').T
df2.columns= ["Absulote", "Scaled"]
Absulote Scaled
MAPE 0.273529 0.273529
MAE 42.285714 0.264758
RMSE_rel 43.198692 0.270475
and there is no Bias
to Concatenate both I do this :
complete_df = pd.concat([df1, df2],axis=0, ignore_index=True)
I get this result:
Demand Forecast Error Absulote Scaled
0 47.0 NaN NaN NaN NaN
1 70.0 NaN NaN NaN NaN
2 95.0 NaN NaN NaN NaN
3 122.0 70.666667 51.333333 NaN NaN
4 142.0 95.666667 46.333333 NaN NaN
5 155.0 119.666667 35.333333 NaN NaN
6 189.0 139.666667 49.333333 NaN NaN
7 208.0 162.000000 46.000000 NaN NaN
8 160.0 184.000000 -24.000000 NaN NaN
9 142.0 185.666667 -43.666667 NaN NaN
10 NaN 170.000000 NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN 0.273529 0.273529
15 NaN NaN NaN 42.285714 0.264758
16 NaN NaN NaN 43.198692 0.270475
there is no Bias MAPE MAE RMSE_rel
I want the result to imitate the following :
is there anyway to achieve that?

Trying to add content to an existing dataFrame

I've been trying for days to find a solution to my problem. I am trying to add content to a pre-existing column in a data frame, however when I print the screen my program shows me that only the first 100 lines are being modified, it is never shown beyond line 100 and the items are not added but overwritten. I've tried several ways and it always gives an error, the closest way to what I want is this print that I put in the post. Could someone help me, I would be very grateful.
import functions
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
pd.set_option('max_rows', 10)
# Module of which the page link returns
site = functions.pagina()
#Browser Options
options = Options()
options.headless = True # Prevent browser from opening
driver = webdriver.Firefox(options=options) #Include options in browser
#Run browser with URL
driver.get(site)
# Wait in seconds for the page to load
print('Esperando 5s')
sleep(5)
#Create Data Frame
df = pd.DataFrame(columns = ['rank','nome','classe','item_lvl','servidor','mortes_temporada'], index=None)
while True:
#Get the 'next page' variable
next_page = driver.find_element_by_css_selector('#pagination-hook > nav > ul > li:nth-child(2) > a')
#As long as the Next variable is empty
if next_page is not None:
#Url of current page
driver.get(driver.current_url)
print(driver.current_url[-6:].upper())
#Attempting to insert data into existing rank column
ranks = pd.Series([rank_page.text for rank_page in driver.find_elements_by_xpath('//div[4]//div[1]/table/tbody/tr/td[1]')])
#MY PROBLEM ADDING CONTENT TO A COLUMN
df['rank'] =+ ranks
#Click on Next
driver.execute_script("arguments[0].click();", next_page)
sleep(4)
else:
break
print('*' * 80)
print(df)
print('*' * 80)
RETURN PRINT - IMAGE
Esperando 5s
PAGE=1
********************************************************************************
rank nome classe item_lvl servidor mortes_temporada
0 1 NaN NaN NaN NaN NaN
1 2 NaN NaN NaN NaN NaN
2 3 NaN NaN NaN NaN NaN
3 4 NaN NaN NaN NaN NaN
4 5 NaN NaN NaN NaN NaN
.. ... ... ... ... ... ...
95 96 NaN NaN NaN NaN NaN
96 97 NaN NaN NaN NaN NaN
97 98 NaN NaN NaN NaN NaN
98 99 NaN NaN NaN NaN NaN
99 100 NaN NaN NaN NaN NaN
[100 rows x 6 columns]
********************************************************************************
PAGE=2
********************************************************************************
rank nome classe item_lvl servidor mortes_temporada
0 101 NaN NaN NaN NaN NaN
1 102 NaN NaN NaN NaN NaN
2 103 NaN NaN NaN NaN NaN
3 104 NaN NaN NaN NaN NaN
4 105 NaN NaN NaN NaN NaN
.. ... ... ... ... ... ...
95 196 NaN NaN NaN NaN NaN
96 197 NaN NaN NaN NaN NaN
97 198 NaN NaN NaN NaN NaN
98 199 NaN NaN NaN NaN NaN
99 200 NaN NaN NaN NaN NaN
[100 rows x 6 columns]
********************************************************************************
When you add data to a column in Pandas, and df is a DataFrame, you have to include data for the rest of the columns for every row.
In your case, the fix is:
df2 = pd.DataFrame({
"rank": [rank_page.text for rank_page in driver.find_elements_by_xpath('//div[4]//div[1]/table/tbody/tr/td[1]')],
"nome": np.nan,
"classe": np.nan,
"item_lvl": np.nan,
"servidor": np.nan,
"mortes_temporada": np.nan
})
df = df.append(df2)

Merging rows in pandas DataFrame

I am writing a script to scrape a series of tables in a pdf into python using tabula-py.
This is fine. I do get the data. But the data is multi-line, and useless in reality.
I would like to merge the rows where the first column (Tag is not NaN).
I was about to put the whole thing in an iterator, and do it manually, but I realize that pandas is a powerful tool, but I don't have the pandas vocabulary to search for the right tool. Any help is much appreciated.
My Code
filename='tags.pdf'
tagTableStart=2 #784
tagTableEnd=39 #822
tableHeadings = ['Tag','Item','Length','Description','Value']
pageRange = "%d-%d" % (tagTableStart, tagTableEnd)
print ("Scanning pages %s" % pageRange)
# extract all the tables in that page range
tables = tabula.read_pdf(filename, pages=pageRange)
How The data is stored in the DataFrame:
(Empty fields are NaN)
Tag
Item
Length
Description
Value
AA
Some
2
Very Very
Text
Very long
Value
AB
More
4
Other Very
aaaa
Text
Very long
bbbb
Value
cccc
How I want the data:
This is almost as it is displayed in the pdf (I couldn't figure out how to make text multi line in SO editor)
Tag
Item
Length
Description
Value
AA
Some\nText
2
Very Very\nVery long\nValue
AB
More\nText
4
Other Very\nVery long\n Value
aaaa\nbbbb\ncccc
Actual sample output (obfuscated)
Tag Item Length Description Value
0 AA PYTHROM-PARTY-I 20 Some Current defined values are :
1 NaN NaN NaN texst Byte1:
2 NaN NaN NaN NaN C
3 NaN NaN NaN NaN DD
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN DD
6 NaN NaN NaN NaN DD
7 NaN NaN NaN NaN DD
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN B :
10 NaN NaN NaN NaN JLSAFISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
11 NaN NaN NaN NaN ISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
12 NaN NaN NaN NaN upon ISFLIHAJSLIhdsflhdliugdyg89o7fgy
13 NaN NaN NaN NaN asdsadct on the dasdsaf the
14 NaN NaN NaN NaN actsdfion.
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN SLKJDBFDLFKJBDSFLIUFy7dfsdfiuojewv
17 NaN NaN NaN NaN csdfgfdgfd.
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN fgfdgdfgsdfgfdsgdfsgfdgfdsgsdfgfdg
20 BB PRESENT-AMOUNT-BOX 11 Lorem Ipsum NaN
21 CC SOME-OTHER-VALUE 1 sdlkfgsdsfsdf 1
22 NaN NaN NaN device NaN
23 NaN NaN NaN ueghkjfgdsfdskjfhgsdfsdfkjdshfgsfliuaew8979vfhsdf NaN
24 NaN NaN NaN dshf87hsdfe4ir8hod9 NaN
Create groups from ID columns then join each rows:
agg_func = dict(zip(df.columns, [lambda s: '\n'.join(s).strip()] * len(df.columns)))
out = df.fillna('').groupby(df['Tag'].ffill(), as_index=False).agg(agg_func)
Output:
>>> out
Tag Item Length Description Value
0 AA Some\nText 2 Very Very\nVery long\nValue
1 AB More\nText 4 Other Very\nVery long\nValue aaaa\nbbbb\ncccc
agg_func is equivalent to write:
{'Tag': lambda s: '\n'.join(s).strip(),
'Item': lambda s: '\n'.join(s).strip(),
'Length': lambda s: '\n'.join(s).strip(),
'Description': lambda s: '\n'.join(s).strip(),
'Value': lambda s: '\n'.join(s).strip()}

Get next non nan values in pandas dataframe

I have a data frame like below:
A B
10 NaN
NaN 20
NaN NaN
NaN NaN
NaN NaN
NaN 50
20 NaN
NaN 30
NaN NaN
30 30
40 NaN
NaN 10
Here I need to return previous and next B column value for each non NaN values of column A.
The code which I'm using is:
df['prev_b'] = NP.where(df['A'].notna(), df['B'].shift(-1),NP.nan)
df['next_b'] = NP.where(df['A'].notna(), df['B'].shift(1),NP.nan)
The required output is:
A B prev_b next_b
10 NaN NaN 20
NaN 20 NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN NaN NaN NaN
NaN 50 NaN Nan
20 NaN 50 30
NaN 30 NaN NaN
NaN NaN NaN NaN
30 30 30 30
40 NaN 30 10
NaN 10 NaN NaN
Someone help me in correcting my logic.
Use a forward or backward fill instead in your numpy where; it should correctly align to get your next/previous non-nan value:
df.assign(
prev_b=np.where(df.A.notna(), df.B.ffill(), np.nan),
next_b=np.where(df.A.notna(), df.B.bfill(), np.nan),
)

Element-wise division by rows between dataframe and series

I've just started with pandas some weeks ago and now I am trying to perform an element-wise division on rows, but couldn't figure out the proper way to achieve it. Here is my case and data
date type id ... 1096 1097 1098
0 2014-06-13 cal 1 ... 17.949524 16.247619 15.465079
1 2014-06-13 cow 32 ... 0.523429 -0.854286 -1.520952
2 2014-06-13 cow 47 ... 7.676000 6.521714 5.892381
3 2014-06-13 cow 107 ... 4.161714 3.048571 2.419048
4 2014-06-13 cow 137 ... 3.781143 2.557143 1.931429
5 2014-06-13 cow 255 ... 3.847273 2.509091 1.804329
6 2014-06-13 cow 609 ... 6.097714 4.837714 4.249524
7 2014-06-13 cow 721 ... 3.653143 2.358286 1.633333
8 2014-06-13 cow 817 ... 6.044571 4.934286 4.373333
9 2014-06-13 cow 837 ... 9.649714 8.511429 7.884762
10 2014-06-13 cow 980 ... 1.817143 0.536571 -0.102857
11 2014-06-13 cow 1730 ... 8.512571 7.114286 6.319048
12 2014-06-13 dark 1 ... 168.725714 167.885715 167.600001
my_data.columns
Index(['date', 'type', 'id', '188', '189', '190', '191', '192', '193', '194',
...
'1089', '1090', '1091', '1092', '1093', '1094', '1095', '1096', '1097',
'1098'],
dtype='object', length=914)
My goal is to divide all the rows by the row with "type" == "cal", but from the column '188' to the column '1098' (911 columns)
These are the approaches I have tried:
Extracting the row of interest and using it with apply(), divide() and
operator '/':
>>> cal_r = my_data[my_data["type"]=="cal"].iloc[:,3:]
my_data.apply(lambda x: x.iloc[3:]/cal_r, axis=1)
0 188 189 190 191 192 193 194 195 ... 1091 10...
1 188 189 190 ... 10...
2 188 189 190 ... 109...
3 188 189 190 ... 1096...
4 188 189 190 191 ... ...
5 188 189 190 ... 10...
6 188 189 190 ... 109...
7 188 189 190 ... 1096...
8 188 189 190 ... 1096...
9 188 189 190 ... 1096 ...
10 188 189 190 ... 1...
11 188 189 190 ... 109...
12 188 189 190 191 ... ...
dtype: object
>>> mydata.apply(lambda x: x.iloc[3:].divide(cal_r,axis=1), axis=1)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 6014, in apply
return op.get_result()
File "/usr/local/lib/python3.5/dist-packages/pandas/core/apply.py", line 142, in get_result
return self.apply_standard()
File "/usr/local/lib/python3.5/dist-packages/pandas/core/apply.py", line 248, in apply_standard
self.apply_series_generator()
File "/usr/local/lib/python3.5/dist-packages/pandas/core/apply.py", line 277, in apply_series_generator
results[i] = self.f(v)
File "<input>", line 1, in <lambda>
File "/usr/local/lib/python3.5/dist-packages/pandas/core/ops.py", line 1375, in flex_wrapper
self._get_axis_number(axis)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 375, in _get_axis_number
.format(axis, type(self)))
ValueError: ("No axis named 1 for object type <class 'pandas.core.series.Series'>", 'occurred at index 0')
Without using apply:
>>> my_data.iloc[:,3:].divide(cal_r)
188 189 190 191 192 193 ... 1093 1094 1095 1096 1097 1098
0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
The commands my_data.iloc[:,3:].divide(cal_r, axis=1) and my_data.iloc[:,3:]/cal_r give the same result, divides just the first row.
If I select just one row, it is done well:
my_data.iloc[5,3:]/cal_r
188 189 190 ... 1096 1097 1098
0 48.8182 48.8274 22.4476 ... 0.214338 0.154428 0.116671
[1 rows x 911 columns]
Is there something basic I am missing? I suspect that I will need to replicate the cal_r row the same number of rows of the whole data.
Any hint or guidance is really appreciated.
Related: divide pandas dataframe elements by its line max
I believe you need convert Series to numpy array for divide by 1d array:
cal_r = my_data.iloc[(my_data["type"]=="cal").values, 3:]
print (cal_r)
1096 1097 1098
0 17.949524 16.247619 15.465079
my_data.iloc[:, 3:] /= cal_r.values
print (my_data)
date type id 1096 1097 1098
0 2014-06-13 cal 1 1.000000 1.000000 1.000000
1 2014-06-13 cow 32 0.029161 -0.052579 -0.098348
2 2014-06-13 cow 47 0.427644 0.401395 0.381012
3 2014-06-13 cow 107 0.231857 0.187632 0.156420
4 2014-06-13 cow 137 0.210654 0.157386 0.124890
5 2014-06-13 cow 255 0.214338 0.154428 0.116671
6 2014-06-13 cow 609 0.339715 0.297749 0.274782
7 2014-06-13 cow 721 0.203523 0.145147 0.105614
8 2014-06-14 cow 817 0.336754 0.303693 0.282788
9 2014-06-14 cow 837 0.537603 0.523857 0.509843
10 2014-06-14 cow 980 0.101236 0.033025 -0.006651
11 2014-06-14 cow 1730 0.474251 0.437866 0.408601
12 2014-06-14 dark 1 9.400010 10.332943 10.837319
Or convert one row DataFrame to Series by DataFrame.squeeze or select first row by position to Series:
my_data.iloc[:, 3:] = my_data.iloc[:, 3:].div(cal_r.squeeze())
#alternative
#my_data.iloc[:, 3:] = my_data.iloc[:, 3:].div(cal_r.iloc[0])
print (my_data)
date type id 1096 1097 1098
0 2014-06-13 cal 1 1.000000 1.000000 1.000000
1 2014-06-13 cow 32 0.029161 -0.052579 -0.098348
2 2014-06-13 cow 47 0.427644 0.401395 0.381012
3 2014-06-13 cow 107 0.231857 0.187632 0.156420
4 2014-06-13 cow 137 0.210654 0.157386 0.124890
5 2014-06-13 cow 255 0.214338 0.154428 0.116671
6 2014-06-13 cow 609 0.339715 0.297749 0.274782
7 2014-06-13 cow 721 0.203523 0.145147 0.105614
8 2014-06-14 cow 817 0.336754 0.303693 0.282788
9 2014-06-14 cow 837 0.537603 0.523857 0.509843
10 2014-06-14 cow 980 0.101236 0.033025 -0.006651
11 2014-06-14 cow 1730 0.474251 0.437866 0.408601
12 2014-06-14 dark 1 9.400010 10.332943 10.837319

Resources