I parsed a table from a website using Selenium (by xpath), then used pd.read_html on the table element, and now I'm left with what looks like a list that makes up the table. It looks like this:
[Empty DataFrame
Columns: [Symbol, Expiration, Strike, Last, Open, High, Low, Change, Volume]
Index: [], Symbol Expiration Strike Last Open High Low Change Volume
0 XPEV Dec20 12/18/2020 46.5 3.40 3.00 5.05 2.49 1.08 696.0
1 XPEV Dec20 12/18/2020 47.0 3.15 3.10 4.80 2.00 1.02 2359.0
2 XPEV Dec20 12/18/2020 47.5 2.80 2.67 4.50 1.89 0.91 2231.0
3 XPEV Dec20 12/18/2020 48.0 2.51 2.50 4.29 1.66 0.85 3887.0
4 XPEV Dec20 12/18/2020 48.5 2.22 2.34 3.80 1.51 0.72 2862.0
5 XPEV Dec20 12/18/2020 49.0 1.84 2.00 3.55 1.34 0.49 4382.0
6 XPEV Dec20 12/18/2020 50.0 1.36 1.76 3.10 1.02 0.30 14578.0
7 XPEV Dec20 12/18/2020 51.0 1.14 1.26 2.62 0.78 0.31 4429.0
8 XPEV Dec20 12/18/2020 52.0 0.85 0.95 2.20 0.62 0.19 2775.0
9 XPEV Dec20 12/18/2020 53.0 0.63 0.79 1.85 0.50 0.13 1542.0]
How do I turn this into an actual dataframe, with the "Symbol, Expiration, etc..." as the header, and the far left column as the index?
I've been trying several different things, but to no avail. Where I left off was trying:
# From reading the html of the table step
dfs = pd.read_html(table.get_attribute('outerHTML'))
dfs = pd.DataFrame(dfs)
... and when I print the new dfs, I get this:
0 Empty DataFrame
Columns: [Symbol, Expiration, ...
1 Symbol Expiration Strike Last Open ...
Per pandas.read_html docs,
This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.
According to your list output the non-empty dataframe is the second element in that list. So retrieve it by indexing (remember Python uses zero as first index of iterables). Do note you can use data frames stored in lists or dicts.
dfs[1].head()
dfs[1].tail()
dfs[1].describe()
...
single_df = dfs[1].copy()
del dfs
Or index on same call
single_df = pd.read_html(...)[1]
Related
The floating point numbers with finite precision are represented with different precision in identical conditions
It is detected and tested on python version 3.x under Linux and Windows. And take the negative effect for the next calculation.
for i in range(100):
k = 1 + i / 100;
print(k)
1.0
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.1
1.11
1.12
1.13
1.1400000000000001
1.15
1.16
1.17
1.18
1.19
1.2
1.21
1.22
1.23
1.24
1.25
1.26
1.27
1.28
1.29
1.3
1.31
1.32
1.33
1.34
1.35
1.3599999999999999
1.37
1.38
1.3900000000000001
1.4
1.41
1.42
1.43
1.44
1.45
1.46
1.47
1.48
1.49
1.5
1.51
1.52
1.53
1.54
1.55
1.56
1.5699999999999998
1.58
1.5899999999999999
1.6
1.6099999999999999
1.62
1.63
1.6400000000000001
1.65
1.6600000000000001
1.67
1.6800000000000002
1.69
1.7
1.71
1.72
1.73
1.74
1.75
1.76
1.77
1.78
1.79
1.8
1.81
1.8199999999999998
1.83
1.8399999999999999
1.85
1.8599999999999999
1.87
1.88
1.8900000000000001
1.9
1.9100000000000001
1.92
1.9300000000000002
1.94
1.95
1.96
1.97
1.98
1.99
It is possible to set the precision in the following way:
for i in range(100):
k = 1 + i / 100;
print("%.Nf"%k)
Where N - decimal numbers.
Keep in mind, that regularly you don't need a lot of them, though the number could be really huge.
I have the following data.
x y
0.00 0.00
0.03 1.74
0.05 2.60
0.08 3.04
0.11 3.47
0.13 3.90
0.16 4.33
0.19 4.59
0.21 4.76
0.20 3.90
0.18 3.12
0.18 2.60
0.16 2.17
0.15 1.73
0.13 1.47
0.12 1.21
0.14 2.60
0.17 3.47
0.21 3.90
0.23 4.33
0.26 4.76
0.28 5.19
0.31 5.45
0.33 5.62
0.37 5.79
0.38 5.97
0.42 6.14
0.44 6.22
0.47 6.31
0.49 6.39
0.51 6.48
I used =max()/2 to obtain the 50%th percentile, which in this case is 3.24.
The point 3.24 does not exist for the y values but it falls in between the 3.04 and 3.47.
How can I find the address of these 2 cells?
Note: The 50th percentile also hits on the other part of the graph, but I only require the first instance.
Assuming you data in columns A and B, with header row in 1 (first numbers in row 2). Assuming you =max()/2 formula is in D2
Use aggregate to determine the first row where the Y value exceeds you mean. Then do it again and subtract 1 from the row.
=AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)
That will return the row number of 6. First occurrence exceeding the value in D2.
=AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1
That will give you row number of 5.
use the row numbers in conjunction with INDEX and you can pull the X value.
=INDEX(A:A,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1)
=INDEX(A:A,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1))
That will give you the X values. if you want the corresponding Y values, simply change the index look up range from A:A to B:B.
=INDEX(B:B,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1)
=INDEX(B:B,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1))
So I have two dataframes, one is a single dataframe of a dictionary of dataframes stocks['OPK'], and the other is just a simple Pandas dataframe df.
Here is a slice of df, df.loc['2010-01-04':, 'Open'] that I'm interesting in comparing with the other dataframe.
Date Open
2010-01-04 1.80
2010-01-05 1.64
2010-01-06 1.90
2010-01-07 1.79
2010-01-08 1.92
2010-01-11 1.90
2010-01-12 1.89
2010-01-13 1.82
2010-01-14 1.84
2010-01-15 1.85
2010-01-19 1.77
This is the other dataframe stocks['OPK'].Open
2010-01-04 1.80
2010-01-05 1.64
2010-01-06 NaN
2010-01-07 1.79
2010-01-08 NaN
2010-01-11 1.90
2010-01-12 1.89
2010-01-13 1.82
2010-01-14 NaN
2010-01-15 1.85
2010-01-19 NaN
As you can, the second dataframe has missing values.
Since both indexes are of the datetime format, I want to be able to compare stock['OPK'].Open to df.loc['2010-01-04':, 'Open'] and fill in the missing values with the values from the the first datframe, df
I can do a boolean filter with this code, but I don't know how to proceed from there:
stocks['OPK'].Open == df.loc['2010-01-04':, 'Open']
The problem with pd.merge and its respective options is that it seems to add extra
columns. I just want to fill in the missing values (if there are any) through comparison of another dataframe which might have the missing values.
Thank you.
You can use fillna()
df2 = df2.fillna(df1)
Another and faster way is combine_first
df2 = df2.combine_first(df1)
Both will return
Date Open
0 2010-01-04 1.80
1 2010-01-05 1.64
2 2010-01-06 1.90
3 2010-01-07 1.79
4 2010-01-08 1.92
5 2010-01-11 1.90
6 2010-01-12 1.89
7 2010-01-13 1.82
8 2010-01-14 1.84
9 2010-01-15 1.85
10 2010-01-19 1.77
I want to find the rows from pandas Dataframe_1 if the value in the fourth column within this row exists in any row of the entire first column of Dataframe_2. I need to copy these rows to the new table.
EDIT
Here I also include the dataframes:
Dataframe_1:
1 2 3 4
0
chr1 128611 128681 cuffs_1_128645 .
chr1 186868 186933 cuffs_2_186901 .
chr1 186978 187035 cuffs_3_187015 .
chr1 187054 187122 cuffs_4_187082 .
chr1 262712 262773 cuffs_5_262742 .
Dataframe_2:
1 2 3 4 5 6 7 8
0
cuffs_100001_101338862 1.24 3.11 1.86 11.19 5.59 8.08 0.62 0
cuffs_100004_101354225 2.49 0.62 1.86 1.86 2.49 1.24 0.00 0
cuffs_100045_101386584 14.92 14.92 3.11 10.57 5.59 15.54 0.62 0
cuffs_100089_101719129 2.49 0.62 1.86 5.59 1.86 1.86 0.00 0
cuffs_100111_101726996 6.84 0.00 3.73 3.11 6.84 2.49 0.62 0
Both dataframes are imported from .csv and are huge, so here I've put only a few rows and columns.
This is what I tried:
import pandas as pd
df1 = pd.DataFrame.from_csv(Dataframe_1, sep = '\t', index_col=list(range(0,1,2)), header = None)
df2 = pd.DataFrame.from_csv(Dataframe_2, sep = '\t', index_col=list(range(0,1,2)), header = None)
df1 = df1[df1[3] == df2[0]]
df1.to_csv(fileout, sep = '\t', header = False)
When performing this I get eight (or so) lines of response referring to the pandas package files, index.pyx and hashtable.pyx which I don't understand.
Got it!
Apparently, none of the tested commands for filtering, be it df1 = df1[df1[3].isin(df2[0])] or df1 = df1[df1[3] == df2[0]] recognise the "0" columns, which represented the rows indexes. The way out would be to import the Dataframe_2 assigning the columns not like (0,1,2) but (1,2,3) this will lead to the following formatting of the df2:
0 2 3 4 5 6 7 8
1
1.24 cuffs_100001_101338862 3.11 1.86 11.19 5.59 8.08 0.62 0
2.49 cuffs_100004_101354225 0.62 1.86 1.86 2.49 1.24 0.00 0
14.92 cuffs_100045_101386584 14.92 3.11 10.57 5.59 15.54 0.62 0
2.49 cuffs_100089_101719129 0.62 1.86 5.59 1.86 1.86 0.00 0
6.84 cuffs_100111_101726996 0.00 3.73 3.11 6.84 2.49 0.62 0
Where the "0" column is no longer the index for rows. Then we can apply df1 = df1[df1[3].isin(df2[0])]. NOTE: application of df1 = df1[df1[3] == df2[0]] will raise the error message Series lengths must match to compare
Thanks!
I have 3 Tables stored as named ranges.
The user picks which range to search through using a drop down box. The named ranges are Table1, Table2 and Table 3.
Table1
0.7 0.8 0.9
50 1.08 1.06 1.04
70 1.08 1.06 1.05
95 1.08 1.07 1.05
120 1.09 1.07 1.05
Table2
0.7 0.8 0.9
16 1.06 1.04 1.03
25 1.06 1.05 1.03
35 1.06 1.05 1.03
Table 3
0.7 0.8 0.9
50 1.21 1.16 1.11
70 1.22 1.16 1.12
95 1.22 1.16 1.12
120 1.22 1.16 1.12
Then they pick a value from the header row, and a value from the first column.
i.e. the user picks, Table3, 0.8 and 95. My formula should return 1.16.
I am halfway there using indirect (table1), however I need to extract the header row, and first column so I can use something like
=INDEX(INDIRECT(pickedtable),MATCH(picked colref,INDIRECT(pickedtable:1)), MATCH(picked rowref,INDIRECT(1:pickedtable)))
Any idea how to achieve this?
INDIRECT(pickedtable) should work OK to get the table but to get first column or row from the table you can use INDEX with that, so following your original approach this formula should work
=INDEX(INDIRECT(pickedtable),MATCH(pickedcolref,INDEX(INDIRECT(pickedtable),0,1),0),MATCH(pickedrowref,INDEX(INDIRECT(pickedtable),1,0),0))
or you can use HLOOKUP or VLOOKUP to shorten as per chris neilsen's approach, e.g. with VLOOKUP
=VLOOKUP(pickedcolref,INDIRECT(pickedtable),MATCH(pickedrowref,INDEX(INDIRECT(pickedtable),1,0),0))
Try this
=HLOOKUP(pickedcolref,
IF(pickedtable=1,Table1,IF(pickedtable=2,Table2,IF(pickedtable=3,Table3,""))),
MATCH(pickedrowref,
OFFSET(
IF(pickedtable=1,Table1,IF(pickedtable=2,Table2,IF(pickedtable=3,Table3,""))),
0,0,,1)
,0)
,FALSE)