Pivot_tables' representation of DatetimeIndex is faulty

Pivot_tables' representation of DatetimeIndex is faulty - python-3.x

I am encountering this weird 'visual' effect that takes place on my pivot table.
This code outputs the representation of the pivot_table's index as shown in Annex 1:
pt = df_sorted.pivot_table(index = [date_col],values=[measure],columns=[apporteur_col],aggfunc=[len],fill_value=0)
pt=pt.loc[:,'len'].loc[:,measure][['A','B','C']]
While this code outputs the representation of the pivot_table's index as shown in Annex 2:
pt = df_sorted.pivot_table(index = [date_col],values=[measure],columns=[apporteur_col],aggfunc=[len],fill_value=0,margins=True)
pt=pt.loc[:,'len'].loc[:,measure][['A','B','C',"All"]]
As you can see, the only difference is the use of margins=True and the column All.
My goal
To have the representation of the index as date only. (Like Annex 1)
My questions
Why does this happen?
How do I fix it?
EDIT
Even though Annex 1 shows the index as dd/mm/yyyy only, once I plot the pivot_table on a 7 days period, the x axis is shown as dd/mm/yyyy HH:MM:SS. (View Annex 3)
P.S : I removed the keys so the sensitive info doesn't show.
Annex

Related

extraneous digits formatting within dataframe

I am running into a formatting / precision issue which I'm hoping to control
I obtain a list of numbers such as:
x = [0.009947, 0.009447, 0.008947]
The finished product I'm after is a DataFrame with a column whose value is this list but multiplied by 100 with 3 decimal places, e.g.
[0.995, 0.945, 0.895]
I proceed as follows:
x = 100*np.around([0.009947, 0.009447, 0.008947],5)
this displays as
array([0.995, 0.945, 0.895])
When I build the DataFrame:
pd.DataFrame({'test':[x]})
I get for the value in the 'test' column:
[0.9950000000000001, 0.9450000000000001, 0.895]
This does not happen in other examples and I'm not sure how to control the behavior. Appreciate any suggestions

This is a general issue with the usage of floating points in computers, check this out
from the docs

Pandas, why does division done to other rows lead to additional trailing zeroes on final row?

I have a table that shows participation in Hong Kong demonstrations by gender for different dates in 2019 (obtained from this source). The three first rows originally showed the percentage for males, females and unknown/unanswered. The final row shows the sample size. All data was initially of type string (the percentages included the % sign).
My DataFrame is titled gender_table
To be able to do some analysis, I first removed the percentage sign and changed the data to float type.
gender_table = gender_table.astype("float64")
This gives me the following:
To change the percentage values into ratios, I thought I'd just divide all the data (except the final row with sample size) by 100.
gender_table[:-1] = gender_table[:-1]/100
gender_table now looks like this:
My question is this: Why has this operation added additional trailing zeroes to the sample size row?
Pastebin with data (after removal of % signs) available here (can be saved as .csv and read into a Pandas df ("index_col=0")).

All rows of a column are formatted uniformly. The default format for a float variable x is f'{x:.6g}' (for details about format specifiers see here).
So when you divide the first rows that had 1 decimal place by 100 they get 3 decimal places and as all rows in a column share the same format 285.0 becomes 285.000.
This of course only changes the string representation of the values in the last row, the float values itself remain unchanged.

Excel - max between 2 series

I have 2 series of data. For sake of simplicity, lets say the data looks like below,
set 1:
1 3
2 3.5
3 4
4 4.5
5 5
6 5.5
7 6
8 6.5
9 7
10 7.5
set 2:
1.5 2
2.8 4.5
3.5 8
4.5 6
5.5 4.8
6.5 4
7.5 6.5
8.5 9
9.5 3
10.5 4
After charting these 2 sets, I want to get the line with the higher data. I want the black line, In the attached pic. How do I get that? My actual data has thousands of data points, so doing this manually isn't possible.
Added later: Another thing I forgot to mention, in my actual data 1 set has about 500 x,y values, and the other set has about 50 values. Though the end points have same/similar x values.
Thanks for your help.

Given your information about the chart and the tables, I would do something like this:
The new series will be based on two formulas:
In Column H, I have the formula for the max value (between your two series):
=MAX(B2,E2)
In Column G, I have the formula that based on the Max value (formula above), which X value I should use (X-value from Series 1 or 2).
=IF(H2=B2,A2,D2)
Then I can plot my graph:
Series 1, Column B
Series 2, Column E
Series 3, Column H.
All series uses the X values of Column G.

Introduction
A few assumptions/comments/pitfalls/constraints regarding my solution:
Set 1 and Set 2 are in columns A till D.
The combined data set will combine the x-values of both Sets, and will have additional data points where the lines cross.
It involves several helper columns, in particular to allow you to copy/paste this across multiple worksheet with data.
I did not try to condense too much, to improve readability, and probably some helper columns could be combined.
It was tested with the data set from the question, but difficult to guarantee all "boundary" conditions, e.g. identical data points between Set 1 and Set 2, zero overlap between the two data sets, empty data sets, etc. (I did test some of these, see my comments at the end).
Set 1 and Set 2 must be sorted (on x-values). If this is not the case, a few additional helper columns are needed to sort the data dynamically.
To better understand the solution described below, see herewith the resulting graph, based on the data set in the question (although I added one data point [2.5;3.75] to avoid having the data points of Set 1 and Set 2 perfectly alternating):
General solution outline / methodology
Combine both datasets in a single (sorted) column;
For all x-values, determine highest y-value, between the y-value in the Set, and the calculated y-value on the line segment from the neighboring values in the other Set (looks simple, in particular with the given example data set, but this is quite tricky to do when data sets have no alternating x-values);
Find the points (x & y values) where the lines of the graph are crossing (intersecting), let's call this Set 3
Combine and sort (on x-values) the three data sets in a two columns (for x & y values).
The details and formulas
For the formulas, I assume row 1 contains headings, and the data start on row 2. All formulas should be entered in row 2, except for a few, where I mention to put them in row 3 (because they need data from the preceding row). The result is in columns E (x-values) and F (y-values), and G till AG are helper columns).
Column E : =INDEX(AH$2:AH$30;MATCH(ROWS(AH$2:AH2);$AJ$2:$AJ$30;0)) These is the actual result. Gets all x-values in AH and sorts these based on an index column AJ; this should actually be the last column in the logical flow, but for presentation purposes it is cleaner to have this next to the input data sets;
F : =INDEX(AF$2:AF$30;MATCH(ROWS(AF$2:AF2);$AG$2:$AG$30;0)) Same for y-values;
G : =IF(ISNA(H2);NA();COUNTIF($H$2:$H$30;"<="&H2)) Creates index to sort combined x-values of both data sets. You also can dynamically sort without such helper column, but then you need a VLOOKUP or INDEX/MATCH and with long decimal numbers I have some bad experiences with these;
H : =IF(ROW()-1<=COUNT($A$2:$A$30);A2;IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));INDEX($C$2:$C$30;ROW()-COUNT($A$2:$A$30)-1;1);NA())) Combines x-values of both data sets, i.e. in columns A & C;
I : =IF(ROW()-1<=COUNT($B$2:$B$30);B2;IF((ROW()-1)<=(COUNT($B$2:$B$30)+COUNT($D$2:$D$30));INDEX($D$2:$D$30;ROW()-COUNT($B$2:$B$30)-1;1);NA())) Same for the y-values;
J : =IF(ROW()-1<=COUNT($A$2:$A$30);"S1";IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));"S2";NA())) Assign "S1", or "S2" to each data point, as indication from which data set they come;
K : =IF(J2=J3;INTERCEPT(I2:I3;H2:H3);NA()) Determines the intercept of the line segment starting at that data point;
L : =IF(J2=J3;SLOPE(I2:I3;H2:H3);NA()) Same for slope;
M : =INDEX(H$2:H$30;MATCH(ROWS(H$2:H2);$G$2:$G$30;0)) Sorts all x-values;
N : =INDEX(I$2:I$30;MATCH(ROWS(I$2:I2);$G$2:$G$30;0)) Same for y-values
O : =INDEX(J$2:J$30;MATCH(ROWS(J$2:J2);$G$2:$G$30;0)) Same for corresponding "S1/S2" value to indicate from which data set they come;
P : =INDEX(K$2:K$30;MATCH(ROWS(K$2:K2);$G$2:$G$30;0)) Same for intercept;
Q : =INDEX(L$2:L$30;MATCH(ROWS(L$2:L2);$G$2:$G$30;0)) Same for slope;
R : =IF(O2="S1";"S2";"S1") Inversion between S1 & S2.
S : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));2);NA())} Array formula to be put in cell S3 (hence ctrl+shift+enter) that will search for the intercept of the preceding data point of the other data set.
T : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));3);NA())} Same for slope;
U : =IF(OR(ISNA(N2);NOT(ISNUMBER(S2)));NA();M2*T2+S2) Calculates the y-value on the line segment of the other data set;
V : =MAX(IFNA(U2;N2);N2) Maximum value between the original y-value and the calculated y-value on the corresponding line segment of the other data set;
W : =(V2=N2) Checks whether the y-value comes from the original data set or not;
X : =IF(O2="S1";IF(W2;"S1";"S2");IF(W2;"S2";"S1")) Determines on which data set (line) the y-value sits (S1 or S2);
Y : =IFERROR(AND((X2<>X3);COUNTIF(X3:$X$30;X2)>0);FALSE) Determines when the data sets cross (i.e. the lines on the graph intersect);
Z : =IF(Y2;(S2-P2)/(Q2-T2);NA()) Calculates x-value of intersection;
AA : =IF(Y2;Z2*Q2+P2;NA()) Calculates y-value of intersection;
AB : =COUNTIF($Z$2:$Z$30;"<="&Z2) Index to sort the newly calculated intersection points (I sort them because then the combining with the other data sets is straightforward, re-using formula of column H;
AC : =INDEX(Z$2:Z$30;MATCH(ROWS(Z$2:Z2);$AB$2:$AB$30;0)) Sorted x-values of intersection points;
AD : =INDEX(AA$2:AA$30;MATCH(ROWS(AA$2:AA2);$AB$2:$AB$30;0)) Same for y-values;
AE : =IF(ROW()-1<=COUNT(M$2:M$30);M2;IF((ROW()-1)<=(COUNT(M$2:M$30)+COUNT(AC$2:AC$30));INDEX(AC$2:AC$30;ROW()-COUNT(M$2:M$30)-1;1);NA())) Combine x-values of Set 1, Set 2, and the intersection points;
AF : =IF(ROW()-1<=COUNT(V$2:V$30);V2;IF((ROW()-1)<=(COUNT(V$2:V$30)+COUNT(AD$2:AD$30));INDEX(AD$2:AD$30;ROW()-COUNT(V$2:V$30)-1;1);NA())) Same for y-values;
AG : =IF(ISNA(AE2);NA();COUNTIF($AE$2:$AE$30;"<="&AE2)) Create index to sort the resulting data set (and this is used to calculate the final results in columns E & F;
All formulas go until row 30, but this need to be changed of course based on the actual data sets. The idea is to add these formulas to one worksheet, and then columns E > AG can be copied to all other worksheets. There are obviously quite a few #NA values, but this is on purpose, and are not errors or mistakes. On request, I can share the actual spreadsheet, so you do not have to retype all formulas.
Some additional comments
You have to modify some formulas (the sort indices) if there are identical x-values, either within Set 1 (which I will not cover here, as it seems this would be unlikely, or be data input errors), or between Set 1 and Set 2. The dynamic sorting does not work in that case. A workaround is to create a "synthetic" sort column, e.g. with =TEXT(J2;"0000.00000000000")&L2. This formats all numbers the same way as text, and appends S1 or S2. So this should give unique sort values, which would sort the same way as the corresponding numbers.
Empty data sets or data sets with only 1 value are not treated correctly either (the intercept formulas and finding values for the "previous" data point are meaningless in these cases).

Convert Fixed format Rpg code to free format

Hi I need to convert fixed format code to Free format as per my companies coding standard instructions. I dont even understand what the below means except that some date movement and conversions are happening..
C *ISO MOVE BCEXDT MDYDATE
C MOVE MDYDATE PEXPDATT
C *ISO MOVE BCSTDT MDYDATE
C MOVE MDYDATE PSTRDAT
Field definitions below:
BCEXDT 8S 0
BCSTDT 8S 0
D MDYDATE S D DATFMT(*MDY)

The two move operations for which you have shown the definitions can be converted like this:
C *ISO MOVE BCEXDT MDYDATE
C *ISO MOVE BCSTDT MDYDATE
to
MDYDate = %date(bcexdt: *ISO);
MDYDate = %date(bcstdt: *ISO);
Note that MDYDate is a date field, and has the same internal format as every other date field. The DATFMT(*MDY) keyword only defines how the field is represented externally by default when loaded into (or from) a character or numeric field. It also sets limits on allowable values. In the case of *MDY, those limits are 01-01-1940 to 12-31-2039. Notice that the values in BCEXDT and BCSTDT are *ISO format, and the values limits are different for those fields. Specifically 0001-01-01 to 9999-12-31. So you could get errors when attempting to assign the *ISO date to a *MDY date field.
The two moves I haven't converted for you would be done in one of the following ways.
If the targets are date fields:
pexpdatt = MDYDate;
pstrdat = MDYDate;
If the targets are numeric fields:
pexpdatt = %dec(MDYDate);
pstrdat = %dec(MDYDate);
If the targets are character fields:
pexpdatt = %char(MDYDate);
pstrdat = %char(MDYDate);
%dec() and %char() will assign the date fields in *MDY format since it was not explicitly specified in the assignment, and the date fields have DATFMT(*MDY).

Nobody else is going to be able to tell you what is going on either; without the definitions of the source and target fields.
That's why IBM depreciated the MOVE op-code from free-form.

Adding annotated values in xmgrace using batch mode

I need some help adding a series of annotated string values to my xy plot in Grace-5.1.22 when using a batch script to plot the graph.
I have two columns of data: a series of x values and a series of y values, and then a third column containing a number for each data point. If I want to plot all numbers in the z column as annotated values then I am able to import the data into grace using the command:
BLOCK XYZ "1:2:3"
S0 AVALUE ON
S0 AVALUE TYPE "Z"
However, since where data points are bunched together the annotations are not clear, I don't want to display the annotation on EVERY point and therefore have replaced some values in the z column with "" to leave the corresponding annotation blank. This means that the 3rd column is now a string and the above commands will not work.
This website http://plasma-gate.weizmann.ac.il/Xmgr/doc/commands.html suggests that I should be able to import the data as:
BLOCK XYSTRING "1:2:3"
However, unfortunately this type was removed from versions 5.0.3 onwards http://plasma-gate.weizmann.ac.il/Grace/doc/CHANGES.html
Instead the documentation now says that "any set type now may have an additional column of text strings associated" however the command:
BLOCK XY "1:2:3"
does not load any data giving the error: "Too many columns scanned in column string".
I am able to import string type annotated values into xy data using the GUI by:
Data>>Import>>ASCII>>"filename.dat">>"load as BLOCK DATA">>"strings from column 3"
however since I would like to plot the same graph many times for different data sets I would like a way of automating this process and not using the GUI.
Sorry for the for the long question but I would be very grateful if anyone could help me.

I know it's old but I had the same question and after 3 hours of experimenting, I found a solution. Therefore I post it for future reference (since no answer currently exists elsewhere online).
If your "filename.dat" has the content format of (x y "string") or (x y z) then you can add to your batch file the following to read it and add the annotations:
READ XY "filename.dat"
s0 line type 1
s0 line linewidth 1
s0 line color 1
# Enable annotations
s0 avalue on
# Select type: 0=None,4=String,5=Z
s0 avalue type 5
s0 avalue color 1
s0 avalue char size 1.0000
s0 avalue offset 0.000000 , 0.000000
If you want to see more options for batch files, you can make a plot interactively with the gui and then save it as an .agr file. If you open this with a text editor, you have all the available commands there.

This is not a direct answer to your question but it may give you a hint.
I ran accross the same error message with the following syntax :
gracebat -graph 0 -settype bardy -block mydatafile -bxy 0:5:6:2
(the last column intended to be the optional string column)
And this one proved to be the right one :
gracebat -graph 0 -settype bardy -block mydatafile -bxy 0:5:6:{2}
So I guess you need to add curly braces.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pivot_tables' representation of DatetimeIndex is faulty - python-3.x

Related

extraneous digits formatting within dataframe

Pandas, why does division done to other rows lead to additional trailing zeroes on final row?

Excel - max between 2 series

Convert Fixed format Rpg code to free format

Adding annotated values in xmgrace using batch mode

Categories

Resources