I have a number of data files which I would like to process with a gretl script. Each file contains a set of annual figures,
e.g.
year, total
1956, 1639
1957, 790
1958, 1150
1959, 909
1960, 1241
When the above sample data is loaded gretl reports:
parsing sample.txt...
using delimiter ','
longest line: 12 characters
first field: 'year'
seems to be observation label
number of columns = 2
number of variables: 1
number of non-blank lines: 6
scanning for variable names...
line: year,total
scanning for row labels and data...
first row label "1956", last label "1960"
trying to parse row labels as dates...
1956: probably a year... and just a year
taking date information from row labels
How can I access the 'first row label' and 'last label' within a script?
given your data set, gretl already recognizes it as a time-series data set with 5 period observations. Printing the initial and last observation label is quite simple.
The obslabel() function converts the observation number into a date string (depending on the underlying observation frequency).
Furthermore, $t1 and $t2 are accessors, returning the observation period (integer value) of the initial observation and the last observation, respectively, of the currently selected sample. The accessor $tmax returns the observation number of the last observation of the overall data set.
Accessing the 'first row label' and 'last label' can be done by:
string first_obs = obslabel($t1)
string last_obs = obslabel($t2)
string last_obs_2 = obslabel($tmax)
print first_obs last_obs last_obs_2
Returns:
1956
1960
1960
Related
I have a table that shows participation in Hong Kong demonstrations by gender for different dates in 2019 (obtained from this source). The three first rows originally showed the percentage for males, females and unknown/unanswered. The final row shows the sample size. All data was initially of type string (the percentages included the % sign).
My DataFrame is titled gender_table
To be able to do some analysis, I first removed the percentage sign and changed the data to float type.
gender_table = gender_table.astype("float64")
This gives me the following:
To change the percentage values into ratios, I thought I'd just divide all the data (except the final row with sample size) by 100.
gender_table[:-1] = gender_table[:-1]/100
gender_table now looks like this:
My question is this: Why has this operation added additional trailing zeroes to the sample size row?
Pastebin with data (after removal of % signs) available here (can be saved as .csv and read into a Pandas df ("index_col=0")).
All rows of a column are formatted uniformly. The default format for a float variable x is f'{x:.6g}' (for details about format specifiers see here).
So when you divide the first rows that had 1 decimal place by 100 they get 3 decimal places and as all rows in a column share the same format 285.0 becomes 285.000.
This of course only changes the string representation of the values in the last row, the float values itself remain unchanged.
I have 2 series of data. For sake of simplicity, lets say the data looks like below,
set 1:
1 3
2 3.5
3 4
4 4.5
5 5
6 5.5
7 6
8 6.5
9 7
10 7.5
set 2:
1.5 2
2.8 4.5
3.5 8
4.5 6
5.5 4.8
6.5 4
7.5 6.5
8.5 9
9.5 3
10.5 4
After charting these 2 sets, I want to get the line with the higher data. I want the black line, In the attached pic. How do I get that? My actual data has thousands of data points, so doing this manually isn't possible.
Added later: Another thing I forgot to mention, in my actual data 1 set has about 500 x,y values, and the other set has about 50 values. Though the end points have same/similar x values.
Thanks for your help.
Given your information about the chart and the tables, I would do something like this:
The new series will be based on two formulas:
In Column H, I have the formula for the max value (between your two series):
=MAX(B2,E2)
In Column G, I have the formula that based on the Max value (formula above), which X value I should use (X-value from Series 1 or 2).
=IF(H2=B2,A2,D2)
Then I can plot my graph:
Series 1, Column B
Series 2, Column E
Series 3, Column H.
All series uses the X values of Column G.
Introduction
A few assumptions/comments/pitfalls/constraints regarding my solution:
Set 1 and Set 2 are in columns A till D.
The combined data set will combine the x-values of both Sets, and will have additional data points where the lines cross.
It involves several helper columns, in particular to allow you to copy/paste this across multiple worksheet with data.
I did not try to condense too much, to improve readability, and probably some helper columns could be combined.
It was tested with the data set from the question, but difficult to guarantee all "boundary" conditions, e.g. identical data points between Set 1 and Set 2, zero overlap between the two data sets, empty data sets, etc. (I did test some of these, see my comments at the end).
Set 1 and Set 2 must be sorted (on x-values). If this is not the case, a few additional helper columns are needed to sort the data dynamically.
To better understand the solution described below, see herewith the resulting graph, based on the data set in the question (although I added one data point [2.5;3.75] to avoid having the data points of Set 1 and Set 2 perfectly alternating):
General solution outline / methodology
Combine both datasets in a single (sorted) column;
For all x-values, determine highest y-value, between the y-value in the Set, and the calculated y-value on the line segment from the neighboring values in the other Set (looks simple, in particular with the given example data set, but this is quite tricky to do when data sets have no alternating x-values);
Find the points (x & y values) where the lines of the graph are crossing (intersecting), let's call this Set 3
Combine and sort (on x-values) the three data sets in a two columns (for x & y values).
The details and formulas
For the formulas, I assume row 1 contains headings, and the data start on row 2. All formulas should be entered in row 2, except for a few, where I mention to put them in row 3 (because they need data from the preceding row). The result is in columns E (x-values) and F (y-values), and G till AG are helper columns).
Column E : =INDEX(AH$2:AH$30;MATCH(ROWS(AH$2:AH2);$AJ$2:$AJ$30;0)) These is the actual result. Gets all x-values in AH and sorts these based on an index column AJ; this should actually be the last column in the logical flow, but for presentation purposes it is cleaner to have this next to the input data sets;
F : =INDEX(AF$2:AF$30;MATCH(ROWS(AF$2:AF2);$AG$2:$AG$30;0)) Same for y-values;
G : =IF(ISNA(H2);NA();COUNTIF($H$2:$H$30;"<="&H2)) Creates index to sort combined x-values of both data sets. You also can dynamically sort without such helper column, but then you need a VLOOKUP or INDEX/MATCH and with long decimal numbers I have some bad experiences with these;
H : =IF(ROW()-1<=COUNT($A$2:$A$30);A2;IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));INDEX($C$2:$C$30;ROW()-COUNT($A$2:$A$30)-1;1);NA())) Combines x-values of both data sets, i.e. in columns A & C;
I : =IF(ROW()-1<=COUNT($B$2:$B$30);B2;IF((ROW()-1)<=(COUNT($B$2:$B$30)+COUNT($D$2:$D$30));INDEX($D$2:$D$30;ROW()-COUNT($B$2:$B$30)-1;1);NA())) Same for the y-values;
J : =IF(ROW()-1<=COUNT($A$2:$A$30);"S1";IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));"S2";NA())) Assign "S1", or "S2" to each data point, as indication from which data set they come;
K : =IF(J2=J3;INTERCEPT(I2:I3;H2:H3);NA()) Determines the intercept of the line segment starting at that data point;
L : =IF(J2=J3;SLOPE(I2:I3;H2:H3);NA()) Same for slope;
M : =INDEX(H$2:H$30;MATCH(ROWS(H$2:H2);$G$2:$G$30;0)) Sorts all x-values;
N : =INDEX(I$2:I$30;MATCH(ROWS(I$2:I2);$G$2:$G$30;0)) Same for y-values
O : =INDEX(J$2:J$30;MATCH(ROWS(J$2:J2);$G$2:$G$30;0)) Same for corresponding "S1/S2" value to indicate from which data set they come;
P : =INDEX(K$2:K$30;MATCH(ROWS(K$2:K2);$G$2:$G$30;0)) Same for intercept;
Q : =INDEX(L$2:L$30;MATCH(ROWS(L$2:L2);$G$2:$G$30;0)) Same for slope;
R : =IF(O2="S1";"S2";"S1") Inversion between S1 & S2.
S : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));2);NA())} Array formula to be put in cell S3 (hence ctrl+shift+enter) that will search for the intercept of the preceding data point of the other data set.
T : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));3);NA())} Same for slope;
U : =IF(OR(ISNA(N2);NOT(ISNUMBER(S2)));NA();M2*T2+S2) Calculates the y-value on the line segment of the other data set;
V : =MAX(IFNA(U2;N2);N2) Maximum value between the original y-value and the calculated y-value on the corresponding line segment of the other data set;
W : =(V2=N2) Checks whether the y-value comes from the original data set or not;
X : =IF(O2="S1";IF(W2;"S1";"S2");IF(W2;"S2";"S1")) Determines on which data set (line) the y-value sits (S1 or S2);
Y : =IFERROR(AND((X2<>X3);COUNTIF(X3:$X$30;X2)>0);FALSE) Determines when the data sets cross (i.e. the lines on the graph intersect);
Z : =IF(Y2;(S2-P2)/(Q2-T2);NA()) Calculates x-value of intersection;
AA : =IF(Y2;Z2*Q2+P2;NA()) Calculates y-value of intersection;
AB : =COUNTIF($Z$2:$Z$30;"<="&Z2) Index to sort the newly calculated intersection points (I sort them because then the combining with the other data sets is straightforward, re-using formula of column H;
AC : =INDEX(Z$2:Z$30;MATCH(ROWS(Z$2:Z2);$AB$2:$AB$30;0)) Sorted x-values of intersection points;
AD : =INDEX(AA$2:AA$30;MATCH(ROWS(AA$2:AA2);$AB$2:$AB$30;0)) Same for y-values;
AE : =IF(ROW()-1<=COUNT(M$2:M$30);M2;IF((ROW()-1)<=(COUNT(M$2:M$30)+COUNT(AC$2:AC$30));INDEX(AC$2:AC$30;ROW()-COUNT(M$2:M$30)-1;1);NA())) Combine x-values of Set 1, Set 2, and the intersection points;
AF : =IF(ROW()-1<=COUNT(V$2:V$30);V2;IF((ROW()-1)<=(COUNT(V$2:V$30)+COUNT(AD$2:AD$30));INDEX(AD$2:AD$30;ROW()-COUNT(V$2:V$30)-1;1);NA())) Same for y-values;
AG : =IF(ISNA(AE2);NA();COUNTIF($AE$2:$AE$30;"<="&AE2)) Create index to sort the resulting data set (and this is used to calculate the final results in columns E & F;
All formulas go until row 30, but this need to be changed of course based on the actual data sets. The idea is to add these formulas to one worksheet, and then columns E > AG can be copied to all other worksheets. There are obviously quite a few #NA values, but this is on purpose, and are not errors or mistakes. On request, I can share the actual spreadsheet, so you do not have to retype all formulas.
Some additional comments
You have to modify some formulas (the sort indices) if there are identical x-values, either within Set 1 (which I will not cover here, as it seems this would be unlikely, or be data input errors), or between Set 1 and Set 2. The dynamic sorting does not work in that case. A workaround is to create a "synthetic" sort column, e.g. with =TEXT(J2;"0000.00000000000")&L2. This formats all numbers the same way as text, and appends S1 or S2. So this should give unique sort values, which would sort the same way as the corresponding numbers.
Empty data sets or data sets with only 1 value are not treated correctly either (the intercept formulas and finding values for the "previous" data point are meaningless in these cases).
I want to represent count of ID's wrt to Daterange column. I tried to split it into bins but there was option to split into no of parts and not as 3 differnt date values.I want to split the Date Range column into 3 parts so that i can represent the data in a bar chart as Current,Pat and Future data. Each of the 3 bins are represented as:
1.Current - Count of ID for Current month[Dec 2016]. The data for current month should be dunamically calculated since if the next month comes the data should point to that by the dynalic calculation
2.Past - Count of ID for Data less than current month[Data < Dec 2016]. I need to be able to dynamically change the no of months using custom expression so that user can chnage th e no of months it goes back to. Need the expression in such a way that it can be set by a custom expression ,if not the nos can be changed at the expression
3.Future - Count of ID for Data greater than current month[Data Dec 2016]. I need to be able to dynamically change the no of months using custom expression so that user can chnage th e no of months in the future. There will be future dates available since it is a data for manintenance done in the future time.
This 3 data needs to be as a custom/binned column so that the data is represented as shown in the attached picture.
You just need to create a calculated column...
case
when DatePart('month',[DateColumn]) = DatePart('month',DateTimeNow()) and DatePart('year',[DateColumn]) = DatePart('year',DateTimeNow()) then "Current"
when [DateColumn] < DateTimeNow() and [DateColumn] >= DateAdd('month',${NumOfMonthsBack} * (-1),[DateColumn]) then "Past"
when [DateColumn] > DateTimeNow() and [DateColumn] <= DateAdd('month',${NumOfMonthsAhead},[DateColumn]) then "Future"
end as [MonthRange]
I have data of 100 x 101. I want to convert them in series e.g. for first row all column data then for 2nd row all column data and so on. It means the result will be three columns only. The first column with row numbers, the 2nd column with column numbers and the 3rd column with the value for that respective row and column.
Could you please help me doing this conversion in MATLAB.
Available data are in ASCII format and it is possible to open in both MATLAB and Excel.
This can be done by find:
A = rand(100,101);
[data(:,1), data(:,2), data(:,3)] = find(A);
data = sortrows(data,[1 2]);
Note that this is highly inefficient, as you are storing 3 values where you only need to store 1 (the element's actual value). For accessing a specific element, say row 31, column 43, you simply do A(31,43), where you index the matrix.
The file size of data is indeed three times larger than that of A:
whos
Name Size Bytes Class Attributes
A 100x101 80800 double
data 10100x3 242400 double
You can use the ind2sub function that is faster and make more sense in this situation:
tic
A = rand(100,101);
[data(:,1), data(:,2), data(:,3)] = find(A);
data = sortrows(data,[1 2]);
toc
tic
B = A' ;
[data_B(:,1), data_B(:,2)] = ind2sub(size(B), 1:length(B(:)));
data_B(:,3) = B(:);
toc
The output for the timing is as follow:
Elapsed time is 0.002130 seconds (first method)
Elapsed time is 0.000525 seconds (second method).
I have 3 columns of data in a text file. First column is time. Second and Third columns are my variable parameters. My intention is to plot the difference in the current value of data from 3rd column and the previous data value from the 2nd column against 1st column current value.
That is a bit tricky since gnuplot usually has access only to the values of the current row. However, you can keep a limited number of previous values in memory with some trickery inside the using statement.
Unfortunately you don't give any example data, so here is my test data file test.dat (without times, you'll need to adapt that by yourself):
0 1 5
1 12 7
2 1 6
3 5 5
The following script plots the difference between the current value in column 3 and the previous value in column 2:
back2 = back1 = 0
shift(x) = (back2 = back1, back1 = x)
plot 'test.dat' using 0:(shift($2), $0 < 1 ? 1/0 : $3 - back2) w lp pt 7 ps 2
What happens here is basically the following:
Inside the using statement you can separate several statements like assignments with commas. Only the last expression in one column is used as actual data value.
The shift function the second column first assigns the value of the previous row to variable back2 and then the current value to variable back1. So, calling shift($2) saves the value in column 2 of the previous row to variable back2. Then you can do the actual computation as $3 - back2. The condition $0 < 1 discards the first row for which you don't have a previous value.
The result is: