I am having a problem while importing data with pandas.
Here is the situation. I have a file which I have to skip the initial 40 lines.
Afterwards, it needs to separate the data with ';' and use ',' as a decimal.
Then, the first and second columns should be assigned to the variable x and y, respectively.
Here is the code I am using:
data = pd.read_csv(path, sep=';', skiprows=40, header=None,engine='python', decimal=",")
# Separates the data into x vector and y vector
x=data.loc[:,0].values
y=data.loc[:,1].values
The data vector comes out as:
0 29.49486 0.10915 -0.30708
1 30.45667 0.17562 -0.30724
2 31.41848 0.23216 -0.30735
3 32.38029 0.27814 -0.30750
4 33.34211 0.31412 -0.30764
5 34.30390 0.34117 -0.30794
.
.
.
166 189.15537 0.41301 -0.16899
167 190.11718 0.41302 -7,7716e-002
168 191.07899 2,7883e-002
Everything, except the last 2 lines are imported as expected.
when assigning the vectors to variables with:
x=data.loc[:,0].values
y=data.loc[:,1].values
I get the x vector as float64 and the y as an object.
What am I doing wrong while importing?
Kind Regards!
Related
I learnt how to differentiate on octave through this. But I now want to plot graph of the same function on octave. I am not able to do so. I want to know what the right command for plotting the 'ffd' equation in the code is.
`f = #(x) x.^2 + 3*x - 1 + 5*x.*sin(x);
pkg load symbolic
syms x;
ff = f(x);
ffd = diff(ff, x) `
What I have tried so far: I have tried adding these lines of code to the end and it didn't work. The graph was empty and I got an error.
real
ffd = (sym) 5⋅x⋅cos(x) + 2⋅x + 5⋅sin(x) + 3
error: __go_line__: invalid value for array property "ydata", unable to create graphics handle
error: called from
__plt__\>__plt2vs__ at line 466 column 15
__plt__\>__plt2__ at line 245 column 14
__plt__ at line 112 column 18
plot at line 229 column 10
real at line 10 column 1\`
I was expecting it to plot the graph of (sym) 5⋅x⋅cos(x) + 2⋅x + 5⋅sin(x) + 3 i.e., ffd, but it didn't work
In order to draw graphs of the differentiated functions we need to have code similar to this and in this case I am using pkg symbolic
pkg load symbolic
syms x;
x = sym('x');
y = #(x) x.^3;
yy = y(x);
ffd = diff(diff(yy,x));
ffe = diff(yy,x);
ez1=ezplot(y,[-10,10])
hold on
ez2=ezplot(ffe,[-10,10])
hold on
ez3=ezplot(ffd,[-10,10])
note: hold on function is used to draw more than one graphs on the same screen. However, if you modify the program and run it again it doesn't clear the screen of the previous graphs, if anyone knows how to do this, please leave a comment.
I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).
I want to create a matrix with random numbers in J programming language when the required shape is derived from other variables.
I could create such a matrix with ? 3 5 $ 0 if i specify its shape using literal integers. But I am struggling to find a way to create such a matrix when the shape is # y and # x instead of 3 and 5 shown in above example.
I have tried ? 0 $~ # y, # x and it has not worked.
I think I need some way to apply # over a list of variables and return a list of numbers which should be placed after $~, somewhat like map functionality of other languages. Is there a way to do this?
I think that ?#:$ is what you are looking for
3 5 ?#:$ 0
0.031974 0.272734 0.792653 0.439747 0.136448
0.332198 0.00904103 0.7896 0.78304 0.682833
0.27289 0.855249 0.0922516 0.185466 0.257876
The general structure for this is x u#:v y <-> (u (x v y)) where u and v are the verbs and the arguments are x and y.
Hope this helps.
Rereading your question it looks as if you want the shape to be based on the number of items in the arguments. Here I would use # to count the items in each argument, then use , to create the left argument for $&0 and apply ? to the result.
3 4 5 (?#:($&0 #:,))&# 5 3 3 4 5
0.179395 0.456545 0.805514 0.471521 0.0967092
0.942029 0.30713 0.228288 0.693909 0.338689
0.632752 0.618275 0.100224 0.959804 0.517927
Is this closer to what you had in mind?
And as often the case, I thought of another approach overnight
3 4 5 ?#0:"0/ 1 2 3 4 5
0.271366 0.291846 0.0493541 0.72488 0.47988
0.50287 0.980205 0.58541 0.778901 0.0755205
0.0114588 0.523955 0.535905 0.5333 0.984908
I am working on a new routine inside some codes based on OOP, and encountered a problem while modifying the array of the data (short example of the code is below).
Basically, this routine is about taking the array R, transposing it and then sorting it, and then filter out the data below the pre-determined value of thres. Then, I re-transpose back this array into its original dimension, and then plot each of its rows with the first element of T.
import numpy as np
import matplotlib.pyplot as plt
R = np.random.rand(3,8)
R = R.transpose() # transpose the random matrix
R = R[R[:,0].argsort()] # sort this matrix
print(R)
T = ([i for i in np.arange(1,9,1.0)],"temps (min)")
thres = float(input("Define the threshold of coherence: "))
if thres >= 0.0 and thres <= 1.0 :
R = R[R[:, 0] >= thres] # how to filter unwanted values? changing to NaN / zeros ?
else :
print("The coherence value is absurd or you're not giving a number!")
print("The final results are ")
print(R)
print(R.transpose())
R.transpose() # re-transpose this matrix
ax = plt.subplot2grid( (4,1),(0,0) )
ax.plot(T[0],R[0])
ax.set_ylabel('Coherence')
ax = plt.subplot2grid( (4,1),(1,0) )
ax.plot(T[0],R[1],'.')
ax.set_ylabel('Back-azimuth')
ax = plt.subplot2grid( (4,1),(2,0) )
ax.plot(T[0],R[2],'.')
ax.set_ylabel('Velocity\nkm/s')
ax.set_xlabel('Time (min)')
However, I encounter an error
ValueError: x and y must have same first dimension, but have shapes (8,) and (3,)
I comment the part of where I think the problem might reside (how to filter unwanted values?), but then the question remains.
How can I plot this two arrays (R and T) while still being able to filter out unwanted values below thres? Can I transform these unwanted values to zero or NaN and then successfully plot them? If yes, how can I do that?
Your help would be much appreciated.
With the help of a techie friend, the problem is simply resolved by keeping this part
R = R[R[:, 0] >= thres]
because removing unwanted elements is more preferable than changing them to NaN or zero. And then the problem with plotting is fixed by adding a slight modification in this part
ax.plot(T[0][:len(R[0])],R[0])
and also for the subsequent plotting part. This slices T into the same dimension as R.
I need to find the quadratic equation term of a graph I have plotted in R.
When I do this in excel, the term appears in a text box on the chart but I'm unsure how to move this to a cell for subsequent use (to apply to values requiring calibrating) or indeed how to ask for it in R. If it is summonable in R, is it saveable as an object to do future calculations with?
This seems like it should be a straightforward request in R, but I can't find any similar questions. Many thanks in advance for any help anyone can provide on this.
All the answers provide aspects of what you appear at want to do, but non thus far brings it all together. Lets consider Tom Liptrot's answer example:
fit <- lm(speed ~ dist + I(dist^2), cars)
This gives us a fitted linear model with a quadratic in the variable dist. We extract the model coefficients using the coef() extractor function:
> coef(fit)
(Intercept) dist I(dist^2)
5.143960960 0.327454437 -0.001528367
So your fitted equation (subject to rounding because of printing is):
\hat{speed} = 5.143960960 + (0.327454437 * dist) + (-0.001528367 * dist^2)
(where \hat{speed} is the fitted values of the response, speed).
If you want to apply this fitted equation to some data, then we can write our own function to do it:
myfun <- function(newdist, model) {
coefs <- coef(model)
res <- coefs[1] + (coefs[2] * newdist) + (coefs[3] * newdist^2)
return(res)
}
We can apply this function like this:
> myfun(c(21,3,4,5,78,34,23,54), fit)
[1] 11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907
[8] 18.369782
for some new values of distance (dist), Which is what you appear to want to do from the Q. However, in R we don't do things like this normally, because, why should the user have to know how to form fitted or predicted values from all the different types of model that can be fitted in R?
In R, we use standard methods and extractor functions. In this case, if you want to apply the "equation", that Excel displays, to all your data to get the fitted values of this regression, in R we would use the fitted() function:
> fitted(fit)
1 2 3 4 5 6 7 8
5.792756 8.265669 6.429325 11.608229 9.991970 8.265669 10.542950 12.624600
9 10 11 12 13 14 15 16
14.510619 10.268988 13.114445 9.428763 11.081703 12.122528 13.114445 12.624600
17 18 19 20 21 22 23 24
14.510619 14.510619 16.972840 12.624600 14.951557 19.289106 21.558767 11.081703
25 26 27 28 29 30 31 32
12.624600 18.369782 14.057455 15.796751 14.057455 15.796751 17.695765 16.201008
33 34 35 36 37 38 39 40
18.688450 21.202650 21.865976 14.951557 16.972840 20.343693 14.057455 17.340416
41 42 43 44 45 46 47 48
18.038887 18.688450 19.840853 20.098387 18.369782 20.576773 22.333670 22.378377
49 50
22.430008 21.93513
If you want to apply your model equation to some new data values not used to fit the model, then we need to get predictions from the model. This is done using the predict() function. Using the distances I plugged into myfun above, this is how we'd do it in a more R-centric fashion:
> newDists <- data.frame(dist = c(21,3,4,5,78,34,23,54))
> newDists
dist
1 21
2 3
3 4
4 5
5 78
6 34
7 23
8 54
> predict(fit, newdata = newDists)
1 2 3 4 5 6 7 8
11.346494 6.112569 6.429325 6.743024 21.386822 14.510619 11.866907 18.369782
First up we create a new data frame with a component named "dist", containing the new distances we want to get predictions for from our model. It is important to note that we include in this data frame a variable that has the same name as the variable used when we created our fitted model. This new data frame must contain all the variables used to fit the model, but in this case we only have one variable, dist. Note also that we don't need to include anything about dist^2. R will handle that for us.
Then we use the predict() function, giving it our fitted model and providing the new data frame just created as argument 'newdata', giving us our new predicted values, which match the ones we did by hand earlier.
Something I glossed over is that predict() and fitted() are really a whole group of functions. There are versions for lm() models, for glm() models etc. They are known as generic functions, with methods (versions if you like) for several different types of object. You the user generally only need to remember to use fitted() or predict() etc whilst R takes care of using the correct method for the type of fitted model you provide it. Here are some of the methods available in base R for the fitted() generic function:
> methods(fitted)
[1] fitted.default* fitted.isoreg* fitted.nls*
[4] fitted.smooth.spline*
Non-visible functions are asterisked
You will possibly get more than this depending on what other packages you have loaded. The * just means you can't refer to those functions directly, you have to use fitted() and R works out which of those to use. Note there isn't a method for lm() objects. This type of object doesn't need a special method and thus the default method will get used and is suitable.
You can add a quadratic term in the forumla in lm to get the fit you are after. You need to use an I()around the term you want to square as in the example below:
plot(speed ~ dist, cars)
fit1 = lm(speed ~ dist, cars) #fits a linear model
abline(fit1) #puts line on plot
fit2 = lm(speed ~ I(dist^2) + dist, cars) #fits a model with a quadratic term
fit2line = predict(fit2, data.frame(dist = -10:130))
lines(-10:130 ,fit2line, col=2) #puts line on plot
To get the coefficients from this use:
coef(fit2)
I dont think it is possible in Excel, as they only provide functions to get coefficients for a linear regression (SLOPE, INTERCEPT, LINEST) or for a exponential one (GROWTH, LOGEST), though you may have more luck by using Visual Basic.
As for R you can extract model coefficients using the coef function:
mdl <- lm(y ~ poly(x,2,raw=T))
coef(mdl) # all coefficients
coef(mdl)[3] # only the 2nd order coefficient
I guess you mean that you plot X vs Y values in Excel or R, and in Excel use the "Add trendline" functionality. In R, you can use the lm function to fit a linear function to your data, and this also gives you the "r squared" term (see examples in the linked page).