Histogram bars with different bar colors - colors

I would like to get a histogram with alternating gradients of a given color, according to decile breakpoints, as shown in figure below:
Example data:
clear
input float dn3001 double hw0010
1219000 2823.89408574376
-16390 520.112200750285
121010 238.732322261911
953839 221.316063150235
465000 247.280750487467
-870 280.305382323347
96000 2946.16661611018
69500 355.33497718705
113000 1421.43087298696
30500 616.914514202173
20000 3389.34765405599
154000 305.674687642557
440500 525.694777777734
56870 1823.24691219821
330500 376.651172915574
101000 465.098273950744
401046.5 660.816203440777
31872 1693.02190101773
220345 603.326244510505
193360 677.527413164373
196300 568.436679602066
222640 427.051692314575
510500 318.557431587468
131450 1388.72862441839
122300 532.996690473983
305 2441.72289873923
313500 292.610321722557
184500 2699.67735757755
1615564.6 386.944439319246
126528 3018.77523617479
711110 511.604491869939
127440 256.968118266053
424900 1620.366555701
95491 3097.46262561529
287500 413.119620218929
70050 2119.47171174278
75460 299.232446656805
210500 290.391474820414
135800 292.141670444933
119924 303.953183619671
81075 1568.41438245214
152 289.175871985445
73000 2551.12752046544
246500 327.474430367518
159960 2350.26463245568
14522 456.56909870547
139000 319.451311193507
68661 2771.34087931684
214089.7 388.589383036063
927800 849.088069585408
7840 1512.71702946577
140140 852.940547469624
21646.566 2405.47949923772
end
The code below produces a graph with uneven bar spread:
xtile aux = dn3001 [aw=hw0010], nq(10)
_pctile dn3001[aw=hw0010], nq(10)
sort dn3001
list dn3001 aux
return list
scalar p10=r(r1)
scalar p20=r(r2)
scalar p30=r(r3)
scalar p40=r(r4)
scalar p50=r(r5)
scalar p60=r(r6)
scalar p70=r(r7)
scalar p80=r(r8)
scalar p90=r(r9)
drop aux
sum dn3001 [aw=hw0010], d
scalar p1=r(p1)
scalar p95=r(p95)
twoway histogram dn3001 if dn3001>=scalar(p1) & dn3001<scalar(p10), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p10) & dn3001<scalar(p20), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p20) & dn3001<scalar(p30), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p30) & dn3001<scalar(p40), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p40) & dn3001<scalar(p50), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p50) & dn3001<scalar(p60), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p60) & dn3001<scalar(p70), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p70) & dn3001<scalar(p80), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p80) & dn3001<scalar(p90), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p90) & dn3001<scalar(p95), bcolor(green) freq legend(off)
How can I get the same bar width?

Here is one potential approach:
twoway__histogram_gen dn3001, freq bin(50) generate(b a, replace)
_pctile dn3001 [aw=hw0010], nq(10)
return list
scalars:
r(r1) = 20000
r(r2) = 30500
r(r3) = 68661
r(r4) = 75460
r(r5) = 96000
r(r6) = 126528
r(r7) = 159960
r(r8) = 196300
r(r9) = 440500
generate group = .
forvalues i = 9(-1)1 {
replace group = `i' if a <= `r(r`i')'
}
replace group = 10 if a > `r(r9)' & _n <= 20
list a b group in 1 / 20, sepby(group)
+-----------------------+
| a b group |
|-----------------------|
1. | -70.45375 6 1 |
|-----------------------|
2. | 32568.64 4 3 |
3. | 65207.73 7 3 |
|-----------------------|
4. | 97846.82 4 6 |
|-----------------------|
5. | 130485.9 9 7 |
|-----------------------|
6. | 163125 2 8 |
7. | 195764.1 4 8 |
|-----------------------|
8. | 228403.2 3 9 |
9. | 261042.3 1 9 |
10. | 293681.4 1 9 |
11. | 326320.5 2 9 |
12. | 391598.7 1 9 |
13. | 424237.8 2 9 |
|-----------------------|
14. | 456876.8 1 10 |
15. | 522155 1 10 |
16. | 717989.6 1 10 |
17. | 913824.1 1 10 |
18. | 946463.3 1 10 |
19. | 1207576 1 10 |
20. | 1599245 1 10 |
+-----------------------+
Result:
twoway (bar b a, barwidth(25000) legend(off)) ///
(bar b a if group == 3, barwidth(25000) color(green)) ///
(bar b a if group == 9, barwidth(25000) color(red))

More a comment (or a series of comments) than an answer you seek, but the graph won't fit in a comment.
Your approach looks doomed -- if not to failure, then to extreme difficulty.
There is no guarantee whatsoever that any of your quantile bin limits will match any of the histogram bin limits.
Similarly, there is no guarantee that the difference between adjacent quantiles is a simple multiple of any histogram bin width you might choose. You might be tempted to fudge this by colouring a bar according to whichever quantile bin was more frequent, but that would be ignoring details. So suppose your histogram bar was for [100, 200) but some values in that interval belong to one quantile bin and some to another: what would you do? And what would you do if 3 or more quantile bins fell within a histogram bar?
By specifying multiple histograms without specifying starts or bin widths you are unleashing anarchy. Stata will make separate decisions for each histogram based partly on sample sizes. That's what your code is telling it to do, but not what you want.
Your histograms don't know anything about the analytic weights you used.
Beyond that, your question raises all sorts of unnecessary puzzles.
Why produce aux and do nothing with it? It's a point of standard art on SO to show the minimum code necessary to explain your problem.
You say you are interested in deciles but inconsistently are also working with 1 and 95% percentiles.
Why you have such irregular values with very different weights is unclear but inessential for your immediate question. But all that inclines me to think that you cannot get a histogram like your example graph easily or effectively from your data. You have just 53 data points and so weights make no difference to your being unable to have more than 53 non-empty bins.
How the bin limits fall relative to the data can be shown directly without a histogram.
With your example data (thanks!) I do this
xtile aux = dn3001 [aw=hw0010], nq(10)
quantile dn3001, ms(none) mla(aux) mlabpos(0) scheme(s1color) rlopts(lc(none))
I would use a logarithmic scale ordinarily but negative values rule that out.
Here I go beyond strict programming issues, but the question inevitably raises the issues I address.

Related

IndexError: index 20 is out of bounds for axis 0 with size 20

I am working radio frequencies interference (RFI), I am trying to simulate the RFIs and them to the frequency range as per their classifications i.e., [continuous, intermittent, malfunction]. For the continuous RFIs, they will have to be everywhere. I did the calculations and got the values but I keep getting an error thus I can't plot the values I get.
This is what I have done;
def calculate_RFI(classifications, descriptions, amplitude, freq_samples=64, min_freq=1, max_freq=2, min_HA=-1.5, max_HA=1.5,sampling_H=60*3*2):
'''
This function calculates the RFI dataframe
Parameters
----------
classifications: list
The list of classifications
freq_samples: int
The number of frequency samples
min_freq: float
The minimum frequency in GHz
max_freq: float
The maximum frequency in GHz
min_HA: float
The minimum hour angle in hours
max_HA: float
The maximum hour angle in hours
sampling_H: float
The sampling interval of the hour angle in hours
'''
# create a multidimensional meshgrid of the frequency and hour angle
meshgrid = np.mgrid[min_freq:max_freq:freq_samples*1j,
min_HA:max_HA:sampling_H*1j]
freq = meshgrid[0]
HA = meshgrid[1]
# create frequency and hour angle point size
freq_point_size = (max_freq-min_freq)/freq_samples
HA_point_size = (max_HA-min_HA)/sampling_H
# create an empty RFI array which take the shape of the freq meshgrid
RFI = np.zeros(freq.shape)
point_size_label = np.zeros((freq.shape[0], freq.shape[1], 64))
# print(freq.shape)
# create an empty RFI array which take the shape of the freq meshgrid
RFI = np.zeros(freq.shape)
point_size_label = np.zeros((freq.shape[0], freq.shape[1], 64))
# print(freq.shape)
classification = classifications
description = descriptions
amp = amplitude
# loop through the classifications
for i in range(len(classification)):
# print(classification[i])
# loop through the amplitudes
for j in range(len(amp)):
# check if the classification is continuous
if classification[i] == 'continuous':
# get an array of the amplitude values with the same index as the frequency array
amp_array = amp[j]*np.ones(freq.shape)
# get a 2D array of the start and end frequencies of the current classification
freq_range = rfi_data[rfi_data['Classification'] == classification[i]][['start_freq','end_freq']].values *10**-3 # convert to GHz otherwise it will be THz
# get the RFI of the continuous classification by checking if the frequency is in the range of the current classification
RFI += np.where(np.logical_and(freq >= freq_range[j][0], freq <= freq_range[j][1], RFI), amp_array, RFI)
return RFI
To get a glimpse of the data, this is how the data looks like;
| Frequency | Classification | start_freq | end_freq | amplitude |
| --------- | -------------- | ---------- | -------- | --------- |
| 1000 | intermittent | 1000 | 1000 | 0.299792 |
| 1030 | intermittent | 1030 | 1030 | 0.291061 |
| 1025-1150 | intermittent | 1025 | 1150 | 0.260689 |
| 1090 | intermittent | 1090.0 | 1090.0 | 0.275039 |
| 1166-1186 | continuous | 1166 | 1186 | 0.252776 |
What confuses me is that I am getting the data if I try to print them but keep getting an error when I call the function which reads;
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/r8/hbbztfpn4ns1pgl0687hmwp40000gp/T/ipykernel_66490/1561735211.py in <module>
----> 1 calculate_RFI(classifications, descriptions, amplitude)
/var/folders/r8/hbbztfpn4ns1pgl0687hmwp40000gp/T/ipykernel_66490/1341244216.py in calculate_RFI(classifications, descriptions, amplitude, freq_samples, min_freq, max_freq, min_HA, max_HA, sampling_H)
59 freq_range = rfi_data[rfi_data['Classification'] == classification[i]][['start_freq','end_freq']].values *10**-3 # convert to GHz otherwise it will be THz
60 # get the RFI of the continuous classification by checking if the frequency is in the range of the current classification
---> 61 RFI += np.where(np.logical_and(freq >= freq_range[j][0], freq <= freq_range[j][1], RFI), amp_array, RFI)
62
63
IndexError: index 20 is out of bounds for axis 0 with size 20

Python 3- Group by for binned rows

I have a df called exp with the following columns:
| category | p_value | score |
| tennis | 0.45 | 432 |
| soccer | 0.88 | 46 |
My goal is to bin all the scores per p value and create an accumulative plot that is grouped by category.
I managed to create the bin using the following:
# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
exp['bin'] = np.digitize(exp['p value'], bins=bin_edges)
# get the number of UBI per p_value bin
score_per_bin = exp.groupby('bin')['score'].sum()
And than managed to plot it:
not every bin might be filled, so we will use pandas index
binned = pd.DataFrame({
'center': bin_center,
'width': bin_width,
'score': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))
binned['score'] = score_per_bin
plt.step(
binned['center'],
binned['score'].cumsum(),
where='mid',
)
plt.xlabel('p-value')
plt.ylabel('score')
plt.show()
But I get a plot with 1 line while I need a line per category.
My question is - how do I keep the category and plot it?
Thank you

spark dataframe sum of column based on condition

I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)
ID | type | value
-----------------------------
1 | red | 10
2 | blue | 20
3 | yellow | 30
result should be :
ID | type | value | portion
-----------------------------
1 | red | 10 | 1
2 | blue | 20 |0.4
3 | yellow | 30 |0.6
The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.
Any idea?
First add a column is_red to easier differentiate between the two groups. Then you can groupBy this new column and get the sums for each of the two groups respectively.
To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when and otherwise in Spark.
Below is the Scala code to do this. There is a sortBy since when using groupBy the order of results is not guaranteed. With the sort, sum1 below will contain the total sum for all non-red types while sum2 is the sum for red types.
val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
.groupBy($"is_red")
.agg(sum($"value"))
.collect()
.map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
.toList
.sortBy(_._1)
.map(_._2)
val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
The extra is_red column can be removed using drop.
Inspired by Shaido, I used an extra column is_red and the spark window function. But I'm not sure which one is better in performance.
df.withColumn("is_red", when(col("type").equalTo("Red"), "Red")
.otherwise("not Red")
.withColumn("portion", col("value")/sum("value)
.over(Window.partitionBy(col"is_Red")))
.drop(is_Red)

Scaling values with a known upper limit

I have a column of values in Excel that I need to modify by a scale factor. Original column example:
| Value |
|:-----:|
| 75 |
| 25 |
| 25 |
| 50 |
| 0 |
| 0 |
| 100 |
Scale factor: 1.5
| Value |
|:-----:|
| 112.5 |
| 37.5 |
| 37.5 |
| 75 |
| 0 |
| 0 |
| 150 |
The problem is I need them to be within a range of 0-100. My first thought was take them as percentages of 100, but then quickly realized that this would be going in circles.
Is there some sort of mathematical method or Excel formula I could use to handle this so that I actually make meaningful changes to the values, such that when these numbers are modified, 150 is 100 but 37.5 might not be 25 and I'm not just canceling out my scale factor?
Assuming your data begin in cell A1, you can use this formula:
=MIN(100,A1*1.5)
Copy downward as needed.
You could do something like:
ScaledValue = (v - MIN(AllValues)) / (MAX(AllValues) - MIN(AllValues)) * (SCALE_MAX - SCALE_MIN) + SCALE_MIN
Say your raw data (a.k.a. AllValues) ranges from a MIN of 15 to a MAX of 83, and you want to scale it to a range of 0 to 100. To do that you would set SCALE_MIN = 0 and SCALE_MAX = 100. In the above equation, v is any single value in the data.
Hope that helps
Another option is:
ScaledValue = PERCENTRANK.INC(AllValues, v)
In contrast to my earlier suggestion, (linear --- preserves relative spacing of the data points), this preserves the order of the data but not spacing. Using PERCENTRANK.INC will have the effect that sparse data will get compressed closer together, and bunched data will get spread out.
You could also do a weighted combination of the two methods --- give the linear method a weight of say 0.5 so that relative spacing is partially preserved.

Transfer coordinates from one triangle to another triangle

I have two triangles , which can be in any sizes. The problem is that, how I can transfer coordinates from one triangle to another? I know both of triangle position in coordinate system and yes, they both are in one system.
Basically, i have point in triangle1 and I need to transfer it in triangle2.
Reading some posts, I found out that I could be calculated using affine transformation matrix, but I didn't undestand how to solve this with affine transformation matrix.
Thank you for any help.
Let you have unknown affine transformation matrix
| a c e |
M =| b d f |
| 0 0 1 |
The first triangle vertices are (xa1, ya1), (xa2, ya2), (xa3, ya3), and the second triangle vertices have coordinates (xb1, yb1), (xb2, yb2), (xb3, yb3).
Then affine transformation M that transforms the first triangle vertices to the second one vertices is:
M * A = B
where
| xa1 xa2 xa3 |
A =| ya1 ya2 ya3 |
| 1 1 1 |
| xb1 xb2 xb3 |
B =| yb1 yb2 yb3 |
| 1 1 1 |
To find unknown M, we can multiply both sides of the expression by inverse of A matrix
M * A * Inv(A) = B * Inv(A)
M = B * Inv(A)
Inversion of A is rather simple (calculated by Maple, may contain errors due to my typos):
| (ya2-ya3) -(xa2-xa3) (xa2*ya3-xa3*ya2) |
| -(-ya3+ya1) (-xa3+xa1) -(xa1*ya3-ya1*xa3) | * 1/Det
| (-ya2+ya1) -(-xa2+xa1) (xa1*ya2-ya1*xa2) |
where determinant value is
Det = xa2*ya3-xa3*ya2-ya1*xa2+ya1*xa3+xa1*ya2-xa1*ya3
So you can find affine matrix for needed transformation and apply it to coordinates (multiply M and (x,y,1) column matrix)

Resources