Histogram with ggplot2 requires a continuous x variable - linux

I have a dataset in a table format that looks like this:
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
....
If I use this command:
library(ggplot2)
ggplot(t, aes("frequency")) +
geom_histogram()
("t" is the name of my table)
Then RStudio says: "StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?"
I just want to see how many times a 3 or a 5 etc. occurs.
Thanks for your help.

It looks like your data is already aggregated? Maybe the ggplot2::geom_histogram() function might not appropriate for you to use? Have you tried the geom_col() function? This simply takes the numbers declared in the input data frame, and displays a column plot with that data.
Using the below code
# Declare data frame
t <- data.frame(test = c("test40", "test33", "test19", "test4521",
"test34", "test27", "test42", "test35"),
frequency = c(3, 5, 2, 1,
1, 3, 3, 1))
returns the data frame like this
# View data
print(t)
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
and therefore you can plot it like this
# Load package
library(ggplot2)
# Generate column plot
ggplot(t, aes(test, frequency)) +
geom_col()
If you simply wanted a count of the times that the number 2 or the number 3 occurred in your data frame, then yes the geom_histogram() is the correct function to use. See, the geom_histogram() function counts the frequency that a term occurs in the data frame, then returns the result. It has an internal validation that looks at the type of data that you are trying to plot across the x-axis, and notices that if it is discrete, then you need to parse the parameter stat="count" in the function. If you don't include this parameter, then ggplot will try to bin your data to create the histogram, which is illogical because all you want is a count.
Check out this link for a description of the difference between continuous and discrete data: What is the difference between discrete data and continuous data?
With this in mind, you can plot the histogram like this
# Generate histogram plot
ggplot(t, aes(frequency)) +
geom_histogram(stat="count")
I hope that helps mate.

Related

Merge regression results back to original dataframe

I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.
UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!
To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.

gnuplot plot data from data sets

the following data sets are generated from program:
1 **1 0.11111**
1 **2 0.22222**
1 **3 0.33333**
1 **4 0.44444**
2 1 0.00185
2 2 0.00005
2 3 0.12355
2 4 0.68124
3 1 0.54875
3 2 0.62155
3 3 0.35895
3 4 0.41588
My question: How do I plot the first 4 row(bold) in 2-dimensional figure? i.e. the following point should be plotted:
(1, 0.11111)
(2, 0.22222)
(3, 0.33333)
(4, 0.44444)
I know I can use "index" directive to plot multiple data sets, if so, double blank lines must come up from the file (in order to distinguish data sets). But I don't want any blank lines to come up. thanks
you can use every ::0::3 to plot up to the 4th row: Gnuplot plotting data from a file up to some row, and using 2:3 to plot using the 2nd and 3rd column.

Gnuplot with Linear Regression

I am trying to apply linear regression with (Fit(x)). Instead of having two columns in a data file, e.g. x and y values, this file have, for example, 5 columns. I want to pick the avg value of each column and feed it to the F(X) function.
Data:
A B C D E
2 2 5 10 20
4 5 6 11 1
6 8 7 12 4
8 9 12 13 8
10 11 10 14 17
Could I?
Thanks for help
Assume that your data is as specified and you wish to fit a function f(x)=m*x+b to the data where the 0-based column index (0,1,2,3 or 4) should be the x value and the column average should be the y value. We need to construct a new data file that contains the averages.
In gnuplot 5, we can use something called inline data. This is a special variable that behaves like a file. We will find the average of each column of the data and construct an inline data variable containing these. We do this by looping over the column indices and applying the stats function. The print command can be instructed to print to an inline data variable.
set print $l append
do for [i=1:5] {
stats datafile u i nooutput
print STATS_mean
}
set print # restores ordinary print behavior
With your data, we can see what is contained in $l by printing it with print $l:
6.0
7.0
8.0
12.0
10.0
We now can apply the fit command with this data
f(x) = m*x + b
fit f(x) $l u 0:1 via m,b
This will fit the data so that f(x) = average of column x (or as close to it as can be obtained with the fit).
In gnuplot 4.6, inline data is not available, but we can use a temporary file. Replacing all occurrences of $l with "tempfile" will work the same (except for the print $l command), but will add the data to a temporary file named tempfile.

How to make PivotChart with line breaks

I have data like the following:
x y f
1 1 1.2
1 2 1.4
1 3 1.6
3 1 3.2
3 2 3.4
3 3 3.6
5 1 5.2
5 2 5.4
5 3 5.6
If you insert a pivot chart, you can plot f vs x and y using a line chart, and the plot has two stacked x-axes where the lower x-axes values are 1 3 5 corresponding to x, and the upper x-axes has values 1 2 3 for each value of the lower x-axes, representing x = 1 and y = 1 2 3, then x = 2 and y = 1 2 3, and x = 3 and y = 1 2 3. The plot should show a single continuous line from left to right. What I would like is for the line to break when x changes values, so there are three short lines showing the influence of y for constant values of x.
This link makes a chart similar to what I'm describing in the answer. In terms of that figure, what I want is for the link to break every time the year changes. But the answer they have, and discussion doesn't get what I'm looking for. The only approach that I can think of is to modify the PivotTable data by hand and add a row at the location the data breaks. I tried to do something like that at work, but before modifying the table, I copied the table as values to a separate location. With the new data table, I was not able to create the plot with two x axis. If I created the plot, I could put a second value in when y = 3, and for f have NA(), which should create the break in the proper location.
For something that looks like:
Select each of the second and subsequent y 1 values (individually):
and Format Data Point..., Line, No line.
(BTW IMO better suited to Super User.)

Ignore #N/As in Excel LINEST function with multiple independent variables (known_x's)

I am trying to find the equation of a plane of best fit to a set of x,y,z data using the LINEST function. Some of the z data is missing, meaning that there are #N/As in the z column. For example:
A B C
(x) (y) (z)
1 1 1 5.1
2 2 1 5.4
3 3 1 5.7
4 1 2 #N/A
5 2 2 5.2
6 3 2 5.5
7 1 3 4.7
8 2 3 5
9 3 3 5.3
I would like to do =LINEST(C1:C9,A1:B9), but the #N/A causes this to return a value error.
I found a solution for a single independent variable (one column of known_x's, i.e. fitting a line to x,y data), but I have not been able to extend it for two independent variables (two known_x's columns, i.e. fitting a plane to x,y,z data). The solution I found is here: http://www.excelforum.com/excel-general/647448-linest-question.html, and the formula (slightly modified for my application) is:
=LINEST(
N(OFFSET(C1:C9,SMALL(IF(ISNUMBER(C1:C9),ROW(C1:C9)-ROW(C1)),
ROW(INDIRECT("1:"&COUNT(C1:C9)))),0,1)),
N(OFFSET(A1:A9,SMALL(IF(ISNUMBER(C1:C9),ROW(C1:C9)-ROW(C1)),
ROW(INDIRECT("1:"&COUNT(C1:C9)))),0,1)),
)
which is equivalent to =LINEST(C1:C9,A1:A9), ignoring the row containing the #N/A.
The formula from the posted link could probably be adapted but it is unwieldy. Least squares with missing data can be viewed as a regression with weight 1 for numeric values and weight 0 for non-numeric values. Based on this observation you could try this (with Ctrl+Shift+Enter in a 1x3 range):
=LINEST(IF(ISNUMBER(C1:C9),C1:C9,),IF(ISNUMBER(C1:C9),CHOOSE({1,2,3},1,A1:A9,B1:B9),),)
This gives the equation of the plane as z=-0.2x+0.3y+5 which can be checked against the results of using LINEST(C1:C8,A1:B8) with the error row removed.

Resources