I have a df in spark:
(I am actually working on this dataset it is not possible to paste whole data so here is the link)
df = https://www.kaggle.com/schirmerchad/bostonhoustingmlnd?select=housing.csv
Now I found the outliers as below (22 rows in total):
def IQR(df,column):
quantiles = sdf.approxQuantile(column, [0.25, 0.75], 0)
q1 = quantiles[0]
q3 = quantiles[1]
IQR = q3-q1
lower = q1 - 1.5*IQR
upper = q3+ 1.5*IQR
return (lower,upper)
lower, upper = IQR(df,'RM')
lower,upper = 4.8374999999999995 7.617500000000001
outliers = df.filter((df['RM'] > upper) | (df['RM'] < lower))
Now below are the outliers detected :
RM LSTAT PTRATIO MEDV
8.069 4.21 18 812700
7.82 3.57 18 919800
7.765 7.56 17.8 835800
7.853 3.81 14.7 1018500
8.266 4.14 17.4 940800
8.04 3.13 17.4 789600
7.686 3.92 17.4 980700
8.337 2.47 17.4 875700
8.247 3.95 17.4 1014300
8.259 3.54 19.1 898800
8.398 5.91 13 1024800
7.691 6.58 18.6 739200
7.82 3.76 14.9 953400
7.645 3.01 14.9 966000
3.561 7.12 20.2 577500
3.863 13.33 20.2 485100
4.138 37.97 20.2 289800
4.368 30.63 20.2 184800
4.652 28.28 20.2 220500
4.138 23.34 20.2 249900
4.628 34.37 20.2 375900
4.519 36.98 20.2 147000
Now I want to replace the outliers with the ml predicted values, after the ml process I got the predicted values as below:-
RM LSTAT PTRATIO MEDV column_assem column prediction
8.069 4.21 18 812700 {"vectorType":"dense","length":3,"values":[4.21,18,812700]} {"vectorType":"dense","length":3,"values":[812699.9991344779,32.9872628621034,25.697942748362507]} 7.138307692307692
7.82 3.57 18 919800 {"vectorType":"dense","length":3,"values":[3.57,18,919800]} {"vectorType":"dense","length":3,"values":[919799.999082192,36.25675952004636,26.656936598060938]} 7.138307692307692
7.765 7.56 17.8 835800 {"vectorType":"dense","length":3,"values":[7.56,17.8,835800]} {"vectorType":"dense","length":3,"values":[835799.9989959698,37.18609141885786,25.87518521779868]} 7.138307692307692
7.853 3.81 14.7 1018500 {"vectorType":"dense","length":3,"values":[3.81,14.7,1018500]} {"vectorType":"dense","length":3,"values":[1018499.9990279829,40.25963007114179,24.285126110831364]} 7.138307692307692
8.266 4.14 17.4 940800 {"vectorType":"dense","length":3,"values":[4.14,17.4,940800]} {"vectorType":"dense","length":3,"values":[940799.9990507461,37.621770135316275,26.279618209844216]} 7.138307692307692
8.04 3.13 17.4 789600 {"vectorType":"dense","length":3,"values":[3.13,17.4,789600]} {"vectorType":"dense","length":3,"values":[789599.999195178,31.094759131505864,24.832393813608636]} 7.138307692307692
7.686 3.92 17.4 980700 {"vectorType":"dense","length":3,"values":[3.92,17.4,980700]} {"vectorType":"dense","length":3,"values":[980699.9990305867,38.858227336579965,26.637789595102927]} 7.138307692307692
8.337 2.47 17.4 875700 {"vectorType":"dense","length":3,"values":[2.47,17.4,875700]} {"vectorType":"dense","length":3,"values":[875699.9991585133,33.577861049146954,25.59625197564997]} 7.138307692307692
8.247 3.95 17.4 1014300 {"vectorType":"dense","length":3,"values":[3.95,17.4,1014300]} {"vectorType":"dense","length":3,"values":[1014299.9990056665,40.11446130241714,26.949909126197]} 7.138307692307692
8.259 3.54 19.1 898800 {"vectorType":"dense","length":3,"values":[3.54,19.1,898800]} {"vectorType":"dense","length":3,"values":[898799.9990899825,35.406713649671325,27.56000332051734]} 7.138307692307692
8.398 5.91 13 1024800 {"vectorType":"dense","length":3,"values":[5.91,13,1024800]} {"vectorType":"dense","length":3,"values":[1024799.9989586923,42.669988999612016,22.74784587477886]} 7.138307692307692
7.691 6.58 18.6 739200 {"vectorType":"dense","length":3,"values":[6.58,18.6,739200]} {"vectorType":"dense","length":3,"values":[739199.9990946348,32.64270527156902,25.73328780757773]} 7.138307692307692
7.82 3.76 14.9 953400 {"vectorType":"dense","length":3,"values":[3.76,14.9,953400]} {"vectorType":"dense","length":3,"values":[953399.9990744753,37.82403517229104,23.880552758747136]} 7.138307692307692
7.645 3.01 14.9 966000 {"vectorType":"dense","length":3,"values":[3.01,14.9,966000]} {"vectorType":"dense","length":3,"values":[965999.9990932231,37.53477931241747,23.960460322415766]} 7.138307692307692
3.561 7.12 20.2 577500 {"vectorType":"dense","length":3,"values":[7.12,20.2,577500]} {"vectorType":"dense","length":3,"values":[577499.9991773808,27.20258411502299,25.862694427868608]} 6.376732394366198
3.863 13.33 20.2 485100 {"vectorType":"dense","length":3,"values":[13.33,20.2,485100]} {"vectorType":"dense","length":3,"values":[485099.999013695,30.032948373359417,25.311342678468208]} 6.043858108108108
4.138 37.97 20.2 289800 {"vectorType":"dense","length":3,"values":[37.97,20.2,289800]} {"vectorType":"dense","length":3,"values":[289799.99824280146,47.51591753902686,24.707706732637366]} 5.2370714285714275
4.368 30.63 20.2 184800 {"vectorType":"dense","length":3,"values":[30.63,20.2,184800]} {"vectorType":"dense","length":3,"values":[184799.99858809082,36.35256433967503,23.378827944979733]} 5.2370714285714275
4.652 28.28 20.2 220500 {"vectorType":"dense","length":3,"values":[28.28,20.2,220500]} {"vectorType":"dense","length":3,"values":[220499.9986495131,35.3082739723793,23.59425617851294]} 5.2370714285714275
4.138 23.34 20.2 249900 {"vectorType":"dense","length":3,"values":[23.34,20.2,249900]} {"vectorType":"dense","length":3,"values":[249899.99881098093,31.44714189260281,23.625084354536643]} 6.043858108108108
4.628 34.37 20.2 375900 {"vectorType":"dense","length":3,"values":[34.37,20.2,375900]} {"vectorType":"dense","length":3,"values":[375899.9983146336,47.06252004732307,25.328138233469573]} 5.2370714285714275
4.519 36.98 20.2 147000 {"vectorType":"dense","length":3,"values":[36.98,20.2,147000]} {"vectorType":"dense","length":3,"values":[146999.99838054206,41.31545014321207,23.33912202640834]} 5.2370714285714275
If it is one value I am aware of lit() to replace it but when there are multiple values how do we replace with the original one's?
Assuming that the original dataframe is called df and the machine-learning transformed dataframe is called ml, you can do a join and replace the RM column with the prediction value if the row satisfy the outlier condition:
df2 = df.join(ml, df.columns, 'left').withColumn(
'RM',
F.when(
(F.col('RM') > upper) | (F.col('RM') < lower),
F.col('prediction')
).otherwise(F.col('RM'))
).select(df.columns)
B x(cm)
24.5 4.2
25.5 4.5
26.5 5.0
27.5 5.4
28.5 5.9
29.5 6.6
30.5 7.2
31.5 7.9
32.5 8.6
33.5 9.3
34.5 10.0
35.5 10.5
36.5 10.9
37.5 11.1
38.5 11.1
39.5 10.8
40.5 10.3
41.5 9.8
42.5 9.2
43.5 8.4
44.5 7.7
45.5 7.1
46.5 6.4
47.5 5.9
48.5 5.4
49.5 5.0
50.5 4.6
51.5 4.2
This is my data.
And y(x) = a/(b**2 + x**2)**3/2 is the equation to which I want to fit the above data but the problem I am facing is that value of b is coming negative. So I want to know how will I change the origin of the graph to get the right result
A few things:
are you sure the function is f(x) = a/(b**2 + x**2)**3/2 and not f(x) = a/(b**2 + x**2)**(3/2), mind the parentheses around (3/2).
gnuplot has integer division (a common pitfall for unexpected results), hence, (3/2) will be evaluated to 1 instead of the expected 1.5.
why not letting gnuplot find the offset? Just introduce a variable c which will account the x-offset and let it fit.
depending on your model, i.e. if the exponent is variable, you could also add a variable d for the exponent and let it find by the gnuplot fitting algorithm.
sometimes it's better if you help the fitting with good starting values.
Then you have to judge whether the fitted values are making sense or not, e.g. b<0 or d=0.794 ...
Code:
### fitting with finding x-offset automatically
reset session
$Data <<EOD
B x(cm)
24.5 4.2
25.5 4.5
26.5 5.0
27.5 5.4
28.5 5.9
29.5 6.6
30.5 7.2
31.5 7.9
32.5 8.6
33.5 9.3
34.5 10.0
35.5 10.5
36.5 10.9
37.5 11.1
38.5 11.1
39.5 10.8
40.5 10.3
41.5 9.8
42.5 9.2
43.5 8.4
44.5 7.7
45.5 7.1
46.5 6.4
47.5 5.9
48.5 5.4
49.5 5.0
50.5 4.6
51.5 4.2
EOD
f1(x) = a1/(b1**2 + (x-c1)**2)**(3/2)
f2(x) = a2/(b2**2 + (x-c2)**2)**(3./2)
f3(x) = a3/(b3**2 + (x-c3)**2)**d3
set fit quiet nolog
fit f1(x) $Data u 1:2 via a1,b1,c1
fit f2(x) $Data u 1:2 via a2,b2,c2
a3=11; b3=1; c3=40; d3=1.5 # sometimes it's better to help the fitting with some good starting values
fit f3(x) $Data u 1:2 via a3,b3,c3,d3
print sprintf("% 9s% 9s% 9s% 9s","a","b","c","d")
print sprintf("%9.3g %9.3g %9.3g",a1,b1,c1)
print sprintf("%9.3g %9.3g %9.3g",a2,b2,c2)
print sprintf("%9.3g %9.3g %9.3g %9.3g",a3,b3,c3,d3)
plot $Data u 1:2 w p pt 7,\
f1(x) w l lc "red",\
f2(x) w l lc "web-green", \
f3(x) w l lc "web-blue"
### end of code
Result:
a b c d
1.17e+03 10.3 37.9
2.73e+04 -13.6 37.9
343 8.66 37.9 0.794
I'm not sure, but I think the equation that you're trying to fit to may be inappropriate for the data. Perhaps you could rewrite your equation such that it's clearer.
Here's an example using the quadratic equation y(x) = a*x**2 + b*x + c to fit:
test.dat
24.5 4.2
25.5 4.5
26.5 5.0
27.5 5.4
28.5 5.9
29.5 6.6
30.5 7.2
31.5 7.9
32.5 8.6
33.5 9.3
34.5 10.0
35.5 10.5
36.5 10.9
37.5 11.1
38.5 11.1
39.5 10.8
40.5 10.3
41.5 9.8
42.5 9.2
43.5 8.4
44.5 7.7
45.5 7.1
46.5 6.4
47.5 5.9
48.5 5.4
49.5 5.0
50.5 4.6
51.5 4.2
quad_fit.gp
set term pos col
set out 'xy_fit.ps'
set title 'Quadratic Regression Example Scatterplot'
set ylabel 'Y'
set xlabel 'X'
set style line 1 ps 1.5 pt 7 lc 'red'
set style line 2 lw 1.5 lc 'blue'
set grid
f(x) = a*(x**2) + b*x + c
fit f(x) 'test.dat' using 1:2 via a, b, c
p 'test.dat' ls 1 t 'Datapoints', f(x) ls 2 t 'Quadratic Regression'
set out
Running gnuplot quad_fit.gp produces:
This code is used to display a logo from a reactjs project, but the text of it does not make any sense. What are all the numbers?
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 841.9 595.3"><g fill="#61DAFB"><path d="M666.3 296.5c0-32.5-40.7-63.3-103.1-82.4 14.4-63.6 8-114.2-20.2-130.4-6.5-3.8-14.1-5.6-22.4-5.6v22.3c4.6 0 8.3.9 11.4 2.6 13.6 7.8 19.5 37.5 14.9 75.7-1.1 9.4-2.9 19.3-5.1 29.4-19.6-4.8-41-8.5-63.5-10.9-13.5-18.5-27.5-35.3-41.6-50 32.6-30.3 63.2-46.9 84-46.9V78c-27.5 0-63.5 19.6-99.9 53.6-36.4-33.8-72.4-53.2-99.9-53.2v22.3c20.7 0 51.4 16.5 84 46.6-14 14.7-28 31.4-41.3 49.9-22.6 2.4-44 6.1-63.6 11-2.3-10-4-19.7-5.2-29-4.7-38.2 1.1-67.9 14.6-75.8 3-1.8 6.9-2.6 11.5-2.6V78.5c-8.4 0-16 1.8-22.6 5.6-28.1 16.2-34.4 66.7-19.9 130.1-62.2 19.2-102.7 49.9-102.7 82.3 0 32.5 40.7 63.3 103.1 82.4-14.4 63.6-8 114.2 20.2 130.4 6.5 3.8 14.1 5.6 22.5 5.6 27.5 0 63.5-19.6 99.9-53.6 36.4 33.8 72.4 53.2 99.9 53.2 8.4 0 16-1.8 22.6-5.6 28.1-16.2 34.4-66.7 19.9-130.1 62-19.1 102.5-49.9 102.5-82.3zm-130.2-66.7c-3.7 12.9-8.3 26.2-13.5 39.5-4.1-8-8.4-16-13.1-24-4.6-8-9.5-15.8-14.4-23.4 14.2 2.1 27.9 4.7 41 7.9zm-45.8 106.5c-7.8 13.5-15.8 26.3-24.1 38.2-14.9 1.3-30 2-45.2 2-15.1 0-30.2-.7-45-1.9-8.3-11.9-16.4-24.6-24.2-38-7.6-13.1-14.5-26.4-20.8-39.8 6.2-13.4 13.2-26.8 20.7-39.9 7.8-13.5 15.8-26.3 24.1-38.2 14.9-1.3 30-2 45.2-2 15.1 0 30.2.7 45 1.9 8.3 11.9 16.4 24.6 24.2 38 7.6 13.1 14.5 26.4 20.8 39.8-6.3 13.4-13.2 26.8-20.7 39.9zm32.3-13c5.4 13.4 10 26.8 13.8 39.8-13.1 3.2-26.9 5.9-41.2 8 4.9-7.7 9.8-15.6 14.4-23.7 4.6-8 8.9-16.1 13-24.1zM421.2 430c-9.3-9.6-18.6-20.3-27.8-32 9 .4 18.2.7 27.5.7 9.4 0 18.7-.2 27.8-.7-9 11.7-18.3 22.4-27.5 32zm-74.4-58.9c-14.2-2.1-27.9-4.7-41-7.9 3.7-12.9 8.3-26.2 13.5-39.5 4.1 8 8.4 16 13.1 24 4.7 8 9.5 15.8 14.4 23.4zM420.7 163c9.3 9.6 18.6 20.3 27.8 32-9-.4-18.2-.7-27.5-.7-9.4 0-18.7.2-27.8.7 9-11.7 18.3-22.4 27.5-32zm-74 58.9c-4.9 7.7-9.8 15.6-14.4 23.7-4.6 8-8.9 16-13 24-5.4-13.4-10-26.8-13.8-39.8 13.1-3.1 26.9-5.8 41.2-7.9zm-90.5 125.2c-35.4-15.1-58.3-34.9-58.3-50.6 0-15.7 22.9-35.6 58.3-50.6 8.6-3.7 18-7 27.7-10.1 5.7 19.6 13.2 40 22.5 60.9-9.2 20.8-16.6 41.1-22.2 60.6-9.9-3.1-19.3-6.5-28-10.2zM310 490c-13.6-7.8-19.5-37.5-14.9-75.7 1.1-9.4 2.9-19.3 5.1-29.4 19.6 4.8 41 8.5 63.5 10.9 13.5 18.5 27.5 35.3 41.6 50-32.6 30.3-63.2 46.9-84 46.9-4.5-.1-8.3-1-11.3-2.7zm237.2-76.2c4.7 38.2-1.1 67.9-14.6 75.8-3 1.8-6.9 2.6-11.5 2.6-20.7 0-51.4-16.5-84-46.6 14-14.7 28-31.4 41.3-49.9 22.6-2.4 44-6.1 63.6-11 2.3 10.1 4.1 19.8 5.2 29.1zm38.5-66.7c-8.6 3.7-18 7-27.7 10.1-5.7-19.6-13.2-40-22.5-60.9 9.2-20.8 16.6-41.1 22.2-60.6 9.9 3.1 19.3 6.5 28.1 10.2 35.4 15.1 58.3 34.9 58.3 50.6-.1 15.7-23 35.6-58.4 50.6zM320.8 78.4z"/><circle cx="420.9" cy="296.5" r="45.7"/><path d="M520.5 78.1z"/></g></svg>
See docs about path in SVG
A path is described using the concept of a current point. In an
analogy with drawing on paper, the current point can be thought of as
the location of the pen. The position of the pen can be changed, and
the outline of a shape (open or closed) can be traced by dragging the
pen in either straight lines or curves.
Paths represent the geometry of the outline of an object, defined in
terms of moveto (set a new current point), lineto (draw a straight
line), curveto (draw a curve using a cubic Bézier), arc (elliptical or
circular arc) and closepath (close the current shape by connecting to
the last moveto) commands. Compound paths (i.e., a path with multiple
subpaths) are possible to allow effects such as "donut holes" in
objects.
That's an svg file. Those numbers are setting the path of the drawing (as the direction of a pencil in paper)
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 841.9 595.3"><g fill="#61DAFB"><path d="M666.3 296.5c0-32.5-40.7-63.3-103.1-82.4 14.4-63.6 8-114.2-20.2-130.4-6.5-3.8-14.1-5.6-22.4-5.6v22.3c4.6 0 8.3.9 11.4 2.6 13.6 7.8 19.5 37.5 14.9 75.7-1.1 9.4-2.9 19.3-5.1 29.4-19.6-4.8-41-8.5-63.5-10.9-13.5-18.5-27.5-35.3-41.6-50 32.6-30.3 63.2-46.9 84-46.9V78c-27.5 0-63.5 19.6-99.9 53.6-36.4-33.8-72.4-53.2-99.9-53.2v22.3c20.7 0 51.4 16.5 84 46.6-14 14.7-28 31.4-41.3 49.9-22.6 2.4-44 6.1-63.6 11-2.3-10-4-19.7-5.2-29-4.7-38.2 1.1-67.9 14.6-75.8 3-1.8 6.9-2.6 11.5-2.6V78.5c-8.4 0-16 1.8-22.6 5.6-28.1 16.2-34.4 66.7-19.9 130.1-62.2 19.2-102.7 49.9-102.7 82.3 0 32.5 40.7 63.3 103.1 82.4-14.4 63.6-8 114.2 20.2 130.4 6.5 3.8 14.1 5.6 22.5 5.6 27.5 0 63.5-19.6 99.9-53.6 36.4 33.8 72.4 53.2 99.9 53.2 8.4 0 16-1.8 22.6-5.6 28.1-16.2 34.4-66.7 19.9-130.1 62-19.1 102.5-49.9 102.5-82.3zm-130.2-66.7c-3.7 12.9-8.3 26.2-13.5 39.5-4.1-8-8.4-16-13.1-24-4.6-8-9.5-15.8-14.4-23.4 14.2 2.1 27.9 4.7 41 7.9zm-45.8 106.5c-7.8 13.5-15.8 26.3-24.1 38.2-14.9 1.3-30 2-45.2 2-15.1 0-30.2-.7-45-1.9-8.3-11.9-16.4-24.6-24.2-38-7.6-13.1-14.5-26.4-20.8-39.8 6.2-13.4 13.2-26.8 20.7-39.9 7.8-13.5 15.8-26.3 24.1-38.2 14.9-1.3 30-2 45.2-2 15.1 0 30.2.7 45 1.9 8.3 11.9 16.4 24.6 24.2 38 7.6 13.1 14.5 26.4 20.8 39.8-6.3 13.4-13.2 26.8-20.7 39.9zm32.3-13c5.4 13.4 10 26.8 13.8 39.8-13.1 3.2-26.9 5.9-41.2 8 4.9-7.7 9.8-15.6 14.4-23.7 4.6-8 8.9-16.1 13-24.1zM421.2 430c-9.3-9.6-18.6-20.3-27.8-32 9 .4 18.2.7 27.5.7 9.4 0 18.7-.2 27.8-.7-9 11.7-18.3 22.4-27.5 32zm-74.4-58.9c-14.2-2.1-27.9-4.7-41-7.9 3.7-12.9 8.3-26.2 13.5-39.5 4.1 8 8.4 16 13.1 24 4.7 8 9.5 15.8 14.4 23.4zM420.7 163c9.3 9.6 18.6 20.3 27.8 32-9-.4-18.2-.7-27.5-.7-9.4 0-18.7.2-27.8.7 9-11.7 18.3-22.4 27.5-32zm-74 58.9c-4.9 7.7-9.8 15.6-14.4 23.7-4.6 8-8.9 16-13 24-5.4-13.4-10-26.8-13.8-39.8 13.1-3.1 26.9-5.8 41.2-7.9zm-90.5 125.2c-35.4-15.1-58.3-34.9-58.3-50.6 0-15.7 22.9-35.6 58.3-50.6 8.6-3.7 18-7 27.7-10.1 5.7 19.6 13.2 40 22.5 60.9-9.2 20.8-16.6 41.1-22.2 60.6-9.9-3.1-19.3-6.5-28-10.2zM310 490c-13.6-7.8-19.5-37.5-14.9-75.7 1.1-9.4 2.9-19.3 5.1-29.4 19.6 4.8 41 8.5 63.5 10.9 13.5 18.5 27.5 35.3 41.6 50-32.6 30.3-63.2 46.9-84 46.9-4.5-.1-8.3-1-11.3-2.7zm237.2-76.2c4.7 38.2-1.1 67.9-14.6 75.8-3 1.8-6.9 2.6-11.5 2.6-20.7 0-51.4-16.5-84-46.6 14-14.7 28-31.4 41.3-49.9 22.6-2.4 44-6.1 63.6-11 2.3 10.1 4.1 19.8 5.2 29.1zm38.5-66.7c-8.6 3.7-18 7-27.7 10.1-5.7-19.6-13.2-40-22.5-60.9 9.2-20.8 16.6-41.1 22.2-60.6 9.9 3.1 19.3 6.5 28.1 10.2 35.4 15.1 58.3 34.9 58.3 50.6-.1 15.7-23 35.6-58.4 50.6zM320.8 78.4z"/><circle cx="420.9" cy="296.5" r="45.7"/><path d="M520.5 78.1z"/></g></svg>
If you render it , you'll be able to see the icon:
render: function() {
return (
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 841.9 595.3">
(...) //your svg contents here
</svg>
);
}
Reading this answer, you could also render it like:
import { ReactComponent as logosvg} from './logo.svg';
const App = () => (
<div>
<logosvg/>
</div>
);
Which also would be a lot cleaner, since you don't need to include the entire contents of the svg file in your code.
Regardless of how you render it, the result is this:
So all those numbers inside the path are the ones defining the symbol (guiding the pencil).
Besides the numbers inside <path>, the other elements are also setting its properties. For example, <circle cx="420.9" cy="296.5" r="45.7"/> is the one responsible for drawing the inner circle.
Take that part off and you lose the nucleum (so the logo is not as cool as before):
I'm trying to practice BeautifulSoup and urlopen by using Basketball-Reference datasets. When I try and get individual player's stats, everything works fine, but then I tried to use the same code for Team's stats and apparently urlopen isn't finding the right table.
The following code is to get the "headers" from the page.
def fetch_years():
#Determine the urls
url = "https://www.basketball-reference.com/leagues/NBA_2000.html?sr&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool#team-stats-per_game::none"
html = urlopen(url)
soup = BeautifulSoup(html)
soup.find_all('tr')
headers = [th.get_text() for th in soup.find_all('tr')[0].find_all('th')]
headers = headers[1:]
print(headers)
I'm trying to get the Team's stats per game data, in a format like:
['Tm', 'G', 'MP', 'FG', ...]
Instead, the header data I'm getting is:
['W', 'L', 'W/L%', ...]
which is the very first table in the 1999-2000 season information about the teams (under the name 'Division Standings').
If you use that same code for a player's data such as this one, you get the result I'm looking for:
Age Tm Lg Pos G GS MP FG ... DRB TRB AST STL BLK TOV PF PTS
0 20 OKC NBA PG 82 65 32.5 5.3 ... 2.7 4.9 5.3 1.3 0.2 3.3 2.3 15.3
1 21 OKC NBA PG 82 82 34.3 5.9 ... 3.1 4.9 8.0 1.3 0.4 3.3 2.5 16.1
2 22 OKC NBA PG 82 82 34.7 7.5 ... 3.1 4.6 8.2 1.9 0.4 3.9 2.5 21.9
3 23 OKC NBA PG 66 66 35.3 8.8 ... 3.1 4.6 5.5 1.7 0.3 3.6 2.2 23.6
4 24 OKC NBA PG 82 82 34.9 8.2 ... 3.9 5.2 7.4 1.8 0.3 3.3 2.3 23.2
The code to webscrape came originally from here.
the sports -reference.com sites are trickier than your standard ones. The tables are rendered after loading the page (with the exception of a few tables on the pages), so you'd need to use Selenium to let it render first, then pull the html source code.
However, the other option is if you look at the html source, you'll see those tables are within the comments. You could use BeautifulSoup to pull out the comments tags, then search through those for the table tags.
This will return a list of dataframes, and the Team Per Game stats are the table in index position 1:
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
def fetch_years():
#Determine the urls
url = "https://www.basketball-reference.com/leagues/NBA_2000.html?sr&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool#team-stats-per_game::none"
html = requests.get(url)
soup = BeautifulSoup(html.text)
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
return tables
tables = fetch_years()
Output:
print (tables[1].to_string())
Rk Team G MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1.0 Sacramento Kings* 82 241.5 40.0 88.9 0.450 6.5 20.2 0.322 33.4 68.7 0.487 18.5 24.6 0.754 12.9 32.1 45.0 23.8 9.6 4.6 16.2 21.1 105.0
1 2.0 Detroit Pistons* 82 241.8 37.1 80.9 0.459 5.4 14.9 0.359 31.8 66.0 0.481 23.9 30.6 0.781 11.2 30.0 41.2 20.8 8.1 3.3 15.7 24.5 103.5
2 3.0 Dallas Mavericks 82 240.6 39.0 85.9 0.453 6.3 16.2 0.391 32.6 69.8 0.468 17.2 21.4 0.804 11.4 29.8 41.2 22.1 7.2 5.1 13.7 21.6 101.4
3 4.0 Indiana Pacers* 82 240.6 37.2 81.0 0.459 7.1 18.1 0.392 30.0 62.8 0.478 19.9 24.5 0.811 10.3 31.9 42.1 22.6 6.8 5.1 14.1 21.8 101.3
4 5.0 Milwaukee Bucks* 82 242.1 38.7 83.3 0.465 4.8 13.0 0.369 33.9 70.2 0.483 19.0 24.2 0.786 12.4 28.9 41.3 22.6 8.2 4.6 15.0 24.6 101.2
5 6.0 Los Angeles Lakers* 82 241.5 38.3 83.4 0.459 4.2 12.8 0.329 34.1 70.6 0.482 20.1 28.9 0.696 13.6 33.4 47.0 23.4 7.5 6.5 13.9 22.5 100.8
6 7.0 Orlando Magic 82 240.9 38.6 85.5 0.452 3.6 10.6 0.338 35.1 74.9 0.468 19.2 26.1 0.735 14.0 31.0 44.9 20.8 9.1 5.7 17.6 24.0 100.1
7 8.0 Houston Rockets 82 241.8 36.6 81.3 0.450 7.1 19.8 0.358 29.5 61.5 0.480 19.2 26.2 0.733 12.3 31.5 43.8 21.6 7.5 5.3 17.4 20.3 99.5
8 9.0 Boston Celtics 82 240.6 37.2 83.9 0.444 5.1 15.4 0.331 32.2 68.5 0.469 19.8 26.5 0.745 13.5 29.5 43.0 21.2 9.7 3.5 15.4 27.1 99.3
9 10.0 Seattle SuperSonics* 82 241.2 37.9 84.7 0.447 6.7 19.6 0.339 31.2 65.1 0.480 16.6 23.9 0.695 12.7 30.3 43.0 22.9 8.0 4.2 14.0 21.7 99.1
10 11.0 Denver Nuggets 82 242.1 37.3 84.3 0.442 5.7 17.0 0.336 31.5 67.2 0.469 18.7 25.8 0.724 13.1 31.6 44.7 23.3 6.8 7.5 15.6 23.9 99.0
11 12.0 Phoenix Suns* 82 241.5 37.7 82.6 0.457 5.6 15.2 0.368 32.1 67.4 0.477 17.9 23.6 0.759 12.5 31.2 43.7 25.6 9.1 5.3 16.7 24.1 98.9
12 13.0 Minnesota Timberwolves* 82 242.7 39.3 84.3 0.467 3.0 8.7 0.346 36.3 75.5 0.481 16.8 21.6 0.780 12.4 30.1 42.5 26.9 7.6 5.4 13.9 23.3 98.5
13 14.0 Charlotte Hornets* 82 241.2 35.8 79.7 0.449 4.1 12.2 0.339 31.7 67.5 0.469 22.7 30.0 0.758 10.8 32.1 42.9 24.7 8.9 5.9 14.7 20.4 98.4
14 15.0 New Jersey Nets 82 241.8 36.3 83.9 0.433 5.8 16.8 0.347 30.5 67.2 0.454 19.5 24.9 0.784 12.7 28.2 40.9 20.6 8.8 4.8 13.6 23.3 98.0
15 16.0 Portland Trail Blazers* 82 241.2 36.8 78.4 0.470 5.0 13.8 0.361 31.9 64.7 0.493 18.8 24.7 0.760 11.8 31.2 43.0 23.5 7.7 4.8 15.2 22.7 97.5
16 17.0 Toronto Raptors* 82 240.9 36.3 83.9 0.433 5.2 14.3 0.363 31.2 69.6 0.447 19.3 25.2 0.765 13.4 29.9 43.3 23.7 8.1 6.6 13.9 24.3 97.2
17 18.0 Cleveland Cavaliers 82 242.1 36.3 82.1 0.442 4.2 11.2 0.373 32.1 70.9 0.453 20.2 26.9 0.750 12.3 30.5 42.8 23.7 8.7 4.4 17.4 27.1 97.0
18 19.0 Washington Wizards 82 241.5 36.7 81.5 0.451 4.1 10.9 0.376 32.6 70.6 0.462 19.1 25.7 0.743 13.0 29.7 42.7 21.6 7.2 4.7 16.1 26.2 96.6
19 20.0 Utah Jazz* 82 240.9 36.1 77.8 0.464 4.0 10.4 0.385 32.1 67.4 0.476 20.3 26.2 0.773 11.4 29.6 41.0 24.9 7.7 5.4 14.9 24.5 96.5
20 21.0 San Antonio Spurs* 82 242.1 36.0 78.0 0.462 4.0 10.8 0.374 32.0 67.2 0.476 20.1 27.0 0.746 11.3 32.5 43.8 22.2 7.5 6.7 15.0 20.9 96.2
21 22.0 Golden State Warriors 82 240.9 36.5 87.1 0.420 4.2 13.0 0.323 32.3 74.0 0.437 18.3 26.2 0.697 15.9 29.7 45.6 22.6 8.9 4.3 15.9 24.9 95.5
22 23.0 Philadelphia 76ers* 82 241.8 36.5 82.6 0.442 2.5 7.8 0.323 34.0 74.8 0.454 19.2 27.1 0.708 14.0 30.1 44.1 22.2 9.6 4.7 15.7 23.6 94.8
23 24.0 Miami Heat* 82 241.8 36.3 78.8 0.460 5.4 14.7 0.371 30.8 64.1 0.481 16.4 22.3 0.736 11.2 31.9 43.2 23.5 7.1 6.4 15.0 23.7 94.4
24 25.0 Atlanta Hawks 82 241.8 36.6 83.0 0.441 3.1 9.9 0.317 33.4 73.1 0.458 18.0 24.2 0.743 14.0 31.3 45.3 18.9 6.1 5.6 15.4 21.0 94.3
25 26.0 Vancouver Grizzlies 82 242.1 35.3 78.5 0.449 4.0 11.0 0.361 31.3 67.6 0.463 19.4 25.1 0.774 12.3 28.3 40.6 20.7 7.4 4.2 16.8 22.9 93.9
26 27.0 New York Knicks* 82 241.8 35.3 77.7 0.455 4.3 11.4 0.375 31.0 66.3 0.468 17.2 22.0 0.781 9.8 30.7 40.5 19.4 6.3 4.3 14.6 24.2 92.1
27 28.0 Los Angeles Clippers 82 240.3 35.1 82.4 0.426 5.2 15.5 0.339 29.9 67.0 0.446 16.6 22.3 0.746 11.6 29.0 40.6 18.0 7.0 6.0 16.2 22.2 92.0
28 29.0 Chicago Bulls 82 241.5 31.3 75.4 0.415 4.1 12.6 0.329 27.1 62.8 0.432 18.1 25.5 0.709 12.6 28.3 40.9 20.1 7.9 4.7 19.0 23.3 84.8
29 NaN League Average 82 241.5 36.8 82.1 0.449 4.8 13.7 0.353 32.0 68.4 0.468 19.0 25.3 0.750 12.4 30.5 42.9 22.3 7.9 5.2 15.5 23.3 97.5
I am trying to colour a splot surface using pm3d and wanted to colour using values from another column instead of the z-axis.
The input file (test.file, tab separated) is :
atom_num residue_name X Y Z
288 1 45.3 36.6 79.3
301 1 38.9 197.4 72.5
314 1 118.2 53.8 76.5
327 1 58.2 139.1 78.5
353 1 1.9 14.4 71.9
366 1 156.9 180.0 72.1
379 1 183.2 5.4 69.5
392 1 71.7 155.4 75.8
457 1 83.4 11.8 74.8
613 1 97.1 180.7 77.5
626 1 145.2 160.3 71.7
678 2 73.1 76.3 81.0
704 3 30.3 46.5 79.3
717 2 216.0 130.7 85.5
743 2 55.0 137.2 74.4
756 2 23.4 67.3 78.3
769 2 46.9 156.1 77.3
821 2 145.4 143.9 80.7
990 2 7.8 119.3 79.8
1016 3 44.3 67.3 76.7
1042 3 12.8 44.4 74.3
1055 3 149.1 79.9 78.2
1068 3 100.8 35.8 76.1
1081 3 57.6 196.8 76.8
1094 3 214.7 122.8 79.5
1107 3 82.0 190.0 74.4
1120 3 150.9 39.4 71.3
1133 3 50.4 143.7 75.3
1146 1 42.9 104.7 74.3
1159 1 139.0 48.8 73.4
1172 1 66.8 165.3 71.5
1198 1 190.7 150.1 84.2
1211 1 92.1 5.1 75.8
1224 1 211.8 177.7 74.1
1237 1 131.6 0.2 73.6
1250 2 103.8 104.2 76.6
1276 2 132.4 5.0 70.0
1289 2 94.4 9.4 73.0
1302 2 72.6 33.7 74.3
1315 2 14.4 162.6 74.7
1406 2 171.4 143.6 86.1
1419 2 209.5 52.9 77.4
1445 2 11.6 14.7 72.3
1458 1 115.5 165.0 73.0
1549 1 147.1 45.5 76.1
1575 1 115.8 36.6 74.5
1588 1 35.8 37.3 76.2
1601 1 65.4 28.2 76.9
1614 1 13.4 199.9 76.5
The commands I am using is:
set dgrid3d 30,30
set hidden3d
set palette rgbformulae 33,13,10
splot "test.file" u 3:4:5 w pm3d
The image is appearing like this:
The plot is by default colouring based on the Z-axis value (column 5). I am stuck colouring the plot using the values of Residue Name (column 2), which ranges from 1-3. Is there an option to define which coloumn to choose for colouring? Ideally I would like to have the same plot but coloured according to the column 2, so that I can see which "Residue types" lie in which contours.
Any help would hugely helpful.
As your residue is an integer, it is unclear whether you want it interpolated onto the grid.
However, if that's what you want, you can use the solution in Plotting 3D surface from scatter points and a png on the same 3D graph but don't use with pm3d when writing tables. Here's a solution with a quick and somewhat dirty unix trick to merge the tables:
set terminal push #Save current terminal settings
set terminal unknown #dummy terminal
set table "surface.dat"
set dgrid3d
splot 'test.dat' using 3:4:5
set table "residue.dat"
splot 'test.dat' using 3:4:2
unset dgrid3d
unset table
set term pop #reset current terminal settings
!paste surface.dat residue.dat > test_grid.dat
splot "test_grid.dat" u 1:2:3:7 w pm3d