I have a data frequency table and would like to calculate it's mean and standard deviation. The first column symbolises the frequency and second - the value of data.The way I need the mean to be calculated is (446*0+864*1+277*2+...+1*12)/(0+1+2+...+12) = ~1.35, yet when I use gnuplot stats, it gives me the output of separate columns. How can I change my code so that it would give me the output that I want?
Data table:
446 0
864 1
277 2
111 3
62 4
32 5
19 6
9 7
8 8
3 10
3 11
1 12
Gnuplot code:
stats "$input" using 2:1
Output:
* FILE:
Records: 12
Out of range: 0
Invalid: 0
Column headers: 0
Blank: 0
Data Blocks: 1
* COLUMNS:
Mean: 5.7500 152.9167
Std Dev: 3.7887 251.5374
Sample StdDev: 3.9572 262.7223
Skewness: 0.1569 1.9131
Kurtosis: 1.8227 5.5436
Avg Dev: 3.2500 188.0417
Sum: 69.0000 1835.0000
Sum Sq.: 569.0000 1.03986e+06
Mean Err.: 1.0937 72.6126
Std Dev Err.: 0.7734 51.3449
Skewness Err.: 0.7071 0.7071
Kurtosis Err.: 1.4142 1.4142
Minimum: 0.0000 [ 0] 1.0000 [11]
Maximum: 12.0000 [11] 864.0000 [ 1]
Quartile: 2.5000 5.5000
Median: 5.5000 25.5000
Quartile: 9.0000 194.0000
Linear Model: y = -46.89 x + 422.5
Slope: -46.89 +- 14.86
Intercept: 422.5 +- 102.4
Correlation: r = -0.7062
Sum xy: 2475
Try this:
Code:
### special mean
reset session
$Data <<EOD
446 0
864 1
277 2
111 3
62 4
32 5
19 6
9 7
8 8
3 10
3 11
1 12
EOD
stats $Data u ($1*$2):1
print STATS_sum_x, STATS_sum_y
print STATS_sum_x/STATS_sum_y
### end of code
Result:
* FILE:
Records: 12
Out of range: 0
Invalid: 0
Column headers: 0
Blank: 0
Data Blocks: 1
* COLUMNS:
Mean: 206.2500 152.9167
Std Dev: 252.3441 251.5374
Sample StdDev: 263.5648 262.7223
Skewness: 1.5312 1.9131
Kurtosis: 4.2761 5.5436
Avg Dev: 195.6667 188.0417
Sum: 2475.0000 1835.0000
Sum Sq.: 1.27460e+06 1.03986e+06
Mean Err.: 72.8455 72.6126
Std Dev Err.: 51.5095 51.3449
Skewness Err.: 0.7071 0.7071
Kurtosis Err.: 1.4142 1.4142
Minimum: 0.0000 [ 0] 1.0000 [11]
Maximum: 864.0000 [ 1] 864.0000 [ 1]
Quartile: 31.5000 5.5000
Median: 89.0000 25.5000
Quartile: 290.5000 194.0000
Linear Model: y = 0.7622 x - 4.279
Slope: 0.7622 +- 0.2032
Intercept: -4.279 +- 66.21
Correlation: r = 0.7646
Sum xy: 9.609e+05
Your values:
2475.0 1835.0
1.34877384196185
Related
I want to write data with headers into a file. The first three lines are unique, and can be considered as a 'block' which are then repeated with increments in x and y (0.12, 1) respectively. The data in the file should look like:
#X #Y Xmin Ymin Z
1 1 0.0000 0.000 0.0062
1 2 0.0000 0.350 0.0156
1 3 0.0000 0.750 0.0191
1 4 0.0000 1.000 0.0062
1 5 0.0000 1.350 0.0156
1 6 0.0000 1.750 0.0191
1 7 0.0000 2.000 0.0062
1 8 0.0000 2.350 0.0156
1 9 0.0000 2.750 0.0191
2 1 0.1200 0.000 0.0062
2 2 0.1200 0.350 0.0156
2 3 0.1200 0.750 0.0191
2 4 0.1200 1.000 0.0062
2 5 0.1200 1.350 0.0156
2 6 0.1200 1.750 0.0191
2 7 0.1200 2.000 0.0062
2 8 0.1200 2.350 0.0156
2 9 0.1200 2.750 0.0191
3 1 0.2400 0.000 0.0062
3 2 0.2400 0.350 0.0156
3 3 0.2400 0.750 0.0191
3 4 0.2400 1.000 0.0062
3 5 0.2400 1.350 0.0156
3 6 0.2400 1.750 0.0191
3 7 0.2400 2.000 0.0062
3 8 0.2400 2.350 0.0156
3 9 0.2400 2.750 0.0191
I tried to make the first three lines as 3 lists and write the first two columns and headers by two nested for loops but failed to write the repeating 3 line block.
l1 = [0.0000, 0.000, 0.0062]
l2 = [0.0000, 0.350, 0.0156]
l3 = [0.0000, 0.750, 0.0191]
pitch_x = 0.12
pitch_y = 1
with open('dataprep_test.txt', 'w') as f:
f.write('#x #y Xmin Ymin Z \n')
for i in range(1,4,1):
k =1
for j in range (1,4,1):
d_x = pitch_x*(i-1)
d_y = pitch_y*(j-1)
f.write('%d %d %f %f %f \n'%(i,k,(l1[0]+d_x),(l1[1]+d_y), l1[2]))
f.write('%d %d %f %f %f \n'%(i,k+1,(l2[0]+d_x),(l2[1]+d_y), l2[2]))
f.write('%d %d %f %f %f \n'%(i,k+2,(l3[0]+d_x),(l3[1]+d_y), l3[2]))
k=k+3
Is there a smarter way to do it using the python built-in functions and structures and methods (lists, dictionaries etc.)?
I'd just refactor the data generation into a generator function. You can also easily accept an arbitrary number of vectors.
def generate_data(initial_vectors, pitch_x, pitch_y, i_count=4, j_count=4):
for i in range(i_count):
for j in range(j_count):
d_x = pitch_x * i
d_y = pitch_y * j
for k, (x, y, z) in enumerate(initial_vectors, 1):
yield (i + 1, k, (x + d_x), (y + d_y), z)
def main():
l1 = [0.0000, 0.000, 0.0062]
l2 = [0.0000, 0.350, 0.0156]
l3 = [0.0000, 0.750, 0.0191]
with open('dataprep_test.txt', 'w') as f:
f.write('#x #y Xmin Ymin Z \n')
for i, k, x, y, z in generate_data([l1, l2, l3], pitch_x=0.12, pitch_y=1):
f.write(f'{i:d} {k:d} {x:f} {y:f} {z:f}\n')
if __name__ == '__main__':
main()
Furthermore, if a future version of your project might want to use JSON files instead, you could just json.dumps(list(generate_data(...)), etc.
You could do this, which gives every part:
file = 'F:\code\some_file.csv'
some_headers = ['x#', 'y#', 'Xmin','Ymin','Z']
# new lists
list_x = [1,1,1]
list_y = [1,2,3]
list_xmin = [0,0,0]
list_ymin = [0,0.35,0.75]
list_z = [0.0062,0.0156,0.0191]
# build new lists with whatever rules you need
for i in range(10):
list_x.append(i)
list_y.append(i*2)
list_xmin.append(i)
list_ymin.append(i*3)
list_z.append(i)
# write to file
with open(file, 'w') as csvfile:
# write headers
for i in some_headers:
csvfile.write(i + ',')
csvfile.write('\n')
# write data
for i in range(len(list_x)):
line_to_write = str(list_x[i]) + ',' + str(list_y[i]) + ',' + str(list_xmin[i])
line_to_write = line_to_write + ',' + str(list_ymin[i]) + ',' + str(list_z[i])
line_to_write = line_to_write + '\n'
csvfile.writelines(line_to_write)
# finished
print('done')
The result would be a csv file like this:
I aim to compare the restricted mean survival time between the two treatment groups in the Anderson dataset
Anderson dataset
Here is the structure of my data frame:
'data.frame': 42 obs. of 5 variables:
$ survt : num 19 17 13 11 10 10 9 7 6 6 ...
$ status: num 0 0 1 0 0 1 0 1 0 1 ...
$ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
$ logwbc: 'labelled' num 2.05 2.16 2.88 2.6 2.7 2.96 2.8 4.43 3.2 2.31 ...
..- attr(*, "label")= Named chr "log WBC"
.. ..- attr(*, "names")= chr "logwbc"
$ rx : Factor w/ 2 levels "New treatment",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= Named chr "Treatment"
.. ..- attr(*, "names")= chr "rx"
- attr(*, "codepage")= int 65001
I used the following code to compare the restricted mean survival time between the two treatment groups ("New treatment" vs. "Standard treatment):
time <- anderson$survt
status <- anderson$status
arm <- anderson$rx
rmst2(time, status, arm )
I get the following error:
Error in rmst2(time, status, arm) : object 'NOTE' not found
In addition: Warning messages:
1: In max(tt) : no non-missing arguments to max; returning -Inf
2: In min(ss[tt == tt0max]) :
no non-missing arguments to min; returning Inf
3: In max(tt) : no non-missing arguments to max; returning -Inf
4: In min(ss[tt == tt1max]) :
no non-missing arguments to min; returning Inf
Thanks
I converted the sex and rx variables from factor to numeric and the function worked.
Having a dataset like this:
y x size type total_neighbours res
113040 29 1204 15 3 2 0
66281 52 402 9 3 3 0
32296 21 1377 35 0 3 0
48367 3 379 139 0 4 0
33501 1 66 17 0 3 0
... ... ... ... ... ... ...
131230 39 1002 439 3 4 6
131237 40 1301 70 1 2 1
131673 26 1124 365 1 2 1
131678 27 1002 629 3 3 6
131684 28 1301 67 1 2 1
I would like to use random forest algorithm to predict the value of res column (res column can only take integer values between [0-6])
I'm doing it like this:
labels = np.array(features['res'])
features= features.drop('res', axis = 1)
features = np.array(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25,
random_state = 42)
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf.fit(train_features, train_labels);
predictions = rf.predict(test_features)
The prediction I get are the following:
array([1.045e+00, 4.824e+00, 4.608e+00, 1.200e-01, 5.982e+00, 3.660e-01,
4.659e+00, 5.239e+00, 5.982e+00, 1.524e+00])
I have no experience on this field so I don't quite understand the predictions.
How do I interpret them?
Is there any way to limit the predictions to the res column values (integers between [0-6])?
Thanks
As #MaxNoe said, I had a misconception about the model. I was using a regression to predict a discrete variable.
RandomForestClassifier is giving the expected output.
I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.
You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50
I have data in the following format:
| | Measurement 1 | | Measurement 2 | |
|------|---------------|------|---------------|------|
| | Mean | Std | Mean | Std |
| Time | | | | |
| 0 | 17 | 1.10 | 21 | 1.33 |
| 1 | 16 | 1.08 | 21 | 1.34 |
| 2 | 14 | 0.87 | 21 | 1.35 |
| 3 | 11 | 0.86 | 21 | 1.33 |
I am using the following code to generate a matplotlib line graph from this data, which shows the standard deviation as a filled in area, see below:
def seconds_to_minutes(x, pos):
minutes = f'{round(x/60, 0)}'
return minutes
fig, ax = plt.subplots()
mean_temperature_over_time['Measurement 1']['mean'].plot(kind='line', yerr=mean_temperature_over_time['Measurement 1']['std'], alpha=0.15, ax=ax)
mean_temperature_over_time['Measurement 2']['mean'].plot(kind='line', yerr=mean_temperature_over_time['Measurement 2']['std'], alpha=0.15, ax=ax)
ax.set(title="A Line Graph with Shaded Error Regions", xlabel="x", ylabel="y")
formatter = FuncFormatter(seconds_to_minutes)
ax.xaxis.set_major_formatter(formatter)
ax.grid()
ax.legend(['Mean 1', 'Mean 2'])
Output:
This seems like a very messy solution, and only actually produces shaded output because I have so much data. What is the correct way to produce a line graph from the dataframe I have with shaded error regions? I've looked at Plot yerr/xerr as shaded region rather than error bars, but am unable to adapt it for my case.
What's wrong with the linked solution? It seems pretty straightforward.
Allow me to rearrange your dataset so it's easier to load in a Pandas DataFrame
Time Measurement Mean Std
0 0 1 17 1.10
1 1 1 16 1.08
2 2 1 14 0.87
3 3 1 11 0.86
4 0 2 21 1.33
5 1 2 21 1.34
6 2 2 21 1.35
7 3 2 21 1.33
for i, m in df.groupby("Measurement"):
ax.plot(m.Time, m.Mean)
ax.fill_between(m.Time, m.Mean - m.Std, m.Mean + m.Std, alpha=0.35)
And here's the result with some random generated data:
EDIT
Since the issue is apparently iterating over your particular dataframe format let me show how you could do it (I'm new to pandas so there may be better ways). If I understood correctly your screenshot you should have something like:
Measurement 1 2
Mean Std Mean Std
Time
0 17 1.10 21 1.33
1 16 1.08 21 1.34
2 14 0.87 21 1.35
3 11 0.86 21 1.33
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
(1, Mean) 4 non-null int64
(1, Std) 4 non-null float64
(2, Mean) 4 non-null int64
(2, Std) 4 non-null float64
dtypes: float64(2), int64(2)
memory usage: 160.0 bytes
df.columns
MultiIndex(levels=[[1, 2], [u'Mean', u'Std']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'Measurement', None])
And you should be able to iterate over it with and obtain the same plot:
for i, m in df.groupby("Measurement"):
ax.plot(m["Time"], m['Mean'])
ax.fill_between(m["Time"],
m['Mean'] - m['Std'],
m['Mean'] + m['Std'], alpha=0.35)
Or you could restack it to the format above with
(df.stack("Measurement") # stack "Measurement" columns row by row
.reset_index() # make "Time" a normal column, add a new index
.sort_values("Measurement") # group values from the same Measurement
.reset_index(drop=True)) # drop sorted index and make a new one