Why the output of ax.hist and physt.hist are not identical? - python-3.x

I have an array and I would like to place it into 7 bins and then calculate the mean and standard deviation (standard in the error) corresponding to each bin so that I can plot both the histogram as well as the errorbars. While the numpy histogram readily outputs the mean values of bins, it is not meant to produce the errorbars (unless I am wrong). This is why I want to use the physt python package to directly extract the mean and errors corresponding to each bin for the purpose of errorbars. But, I just noticed that the two methodology are not agreeing with each other in the first place; they don't even produce the same mean values (heights) as expected. Now, I am kind of confused. I would truly appreciate your help.
import numpy as np
from physt import h1
import matplotlib.pyplot as plt
x_arr = np.array([
0, 32, 28, 15, 19, 22, 18, 16, 13, 35, 21, 32, 23, 11, 17, 3, 17, 3, 21, 43, 32, 15, 16, 18,
28, 9, 33, 16, 20, 19, 35, 37, 32, 26, 30, 30, 28, 30, 22, 25, 21, 26, 41, 41, 12, 3, 5, 6, 5,
17, 16, 16, 16, 7, 2, 15, 16, 15, 15, 15, 7, 5
])
bins = np.array([0, 2, 3, 5, 9, 17, 33, 65])
ax = plt.axes()
heights, bins, patches = ax.hist(x_arr, bins, density=True)
print('numpy: \n', heights)
hist = h1(x_arr, bins, density=True)
print('physt: \n', hist.frequencies / sum(hist.frequencies))
And here are the outputs which are interestingly different:
numpy:
[0.00806452 0.01612903 0.02419355 0.02419355 0.03427419 0.02721774
0.00352823]
physt:
[0.01612903 0.01612903 0.0483871 0.09677419 0.27419355 0.43548387
0.11290323]

Related

randomly sample from a high dimensional array along with a specific dimension

There has a 3-dimensional array x of shape (2000,60,5). If we think it represents a video, the 2000 can represent 2000 frames. I would like to randomly sample it along with the first dimension, i.e., get a set of frame samples. For instance, how to get an array of (500,60,5) which is randomly sampled from x along with the first dimension?
You can pass x as the first argument of the choice method. If you don't want repeated frames in your sample, use replace=False.
For example,
In [10]: x = np.arange(72).reshape(9, 2, 4) # Small array for the demo.
In [11]: x
Out[11]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23]],
[[24, 25, 26, 27],
[28, 29, 30, 31]],
[[32, 33, 34, 35],
[36, 37, 38, 39]],
[[40, 41, 42, 43],
[44, 45, 46, 47]],
[[48, 49, 50, 51],
[52, 53, 54, 55]],
[[56, 57, 58, 59],
[60, 61, 62, 63]],
[[64, 65, 66, 67],
[68, 69, 70, 71]]])
Sample "frames" from x with the choice method of NumPy random generator instance.
In [12]: rng = np.random.default_rng()
In [13]: rng.choice(x, size=3)
Out[13]:
array([[[40, 41, 42, 43],
[44, 45, 46, 47]],
[[40, 41, 42, 43],
[44, 45, 46, 47]],
[[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [14]: rng.choice(x, size=3, replace=False)
Out[14]:
array([[[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[32, 33, 34, 35],
[36, 37, 38, 39]],
[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]]])
Note that the frames will be in random order; if you want to preserve the order, you could use choice to generate an array of indices, then use the sorted indices to pull the frames out of x.

Plot Network statistics using matplotlib

I try to use matplotlib to print network statistics. I want to look it like line graphs created with excel.
Excel:
Matplotlib
[
My very simple code:
import matplotlib.pyplot as plt
import numpy as np
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59])
y = np.array(['0.00', '0.00', '0.00', '0.12', '0.00', '0.00', '0.00', '14.75', '108.56', '78.91', '508.15', '79.66', '147.84', '199.87', '14.02', '10.05', '3411.12', '19735.23', '19929.51', '18428.82', '21727.14', '19716.41', '20295.20', '20283.08', '20088.10', '20155.81', '20108.67', '19954.45', '20316.46', '20045.77', '20233.71', '19981.40', '20230.02', '20099.69', '20000.23', '20234.06', '19763.92', '20458.40', '19626.22', '20542.25', '19821.72', '20443.78', '20109.41', '19918.96', '20223.37', '19933.64', '20023.73', '19655.67', '19890.94', '20590.04', '20158.37', '20001.59', '20011.48', '19785.95', '20550.63', '19687.02', '20025.00', '20478.25', '20124.66', '20148.08'])
plt.plot(x, y)
plt.xticks(x)
plt.show()
Your y is string type. Try y=y.astype(float) before plot, then you get the expected:

Plotting time series in Matplotlib with month names (ex. January) and showing years beneath

I am currently plotting temporal scatter plot using the following data (you can use these data to reproduce my plot). Data to be plotted in x-axis is time, specifically datetime.datetime object (tp_pass) while data to be plotted in y-axis is angle between -180 and 180 (azip_pass). Also, they are both numpy.array.
tp_pass=np.array([datetime.datetime(2019, 10, 29, 1, 4, 43),
datetime.datetime(2019, 10, 31, 1, 11, 19),
datetime.datetime(2019, 11, 20, 8, 26, 7),
datetime.datetime(2019, 11, 20, 23, 50, 43),
datetime.datetime(2019, 12, 10, 17, 5, 2),
datetime.datetime(2020, 1, 2, 18, 23, 53),
datetime.datetime(2020, 2, 13, 10, 33, 44),
datetime.datetime(2020, 2, 20, 18, 57, 36),
datetime.datetime(2020, 3, 25, 2, 49, 20),
datetime.datetime(2020, 4, 10, 16, 44, 56),
datetime.datetime(2020, 4, 18, 8, 25, 37),
datetime.datetime(2020, 4, 19, 20, 39, 5),
datetime.datetime(2020, 5, 3, 11, 54, 24),
datetime.datetime(2020, 5, 4, 13, 7, 48),
datetime.datetime(2020, 5, 30, 18, 13, 47),
datetime.datetime(2020, 6, 13, 15, 51, 24),
datetime.datetime(2020, 6, 24, 19, 47, 44),
datetime.datetime(2020, 7, 30, 0, 35, 56),
datetime.datetime(2020, 8, 1, 17, 9, 1),
datetime.datetime(2020, 8, 3, 8, 31, 10),
datetime.datetime(2020, 8, 18, 0, 3, 48),
datetime.datetime(2020, 9, 15, 3, 41, 28),
datetime.datetime(2020, 9, 20, 22, 13, 15),
datetime.datetime(2020, 10, 3, 9, 31, 31),
datetime.datetime(2020, 11, 6, 8, 56, 38),
datetime.datetime(2020, 11, 15, 22, 37, 43),
datetime.datetime(2020, 12, 10, 13, 19, 58),
datetime.datetime(2020, 12, 20, 17, 23, 22),
datetime.datetime(2020, 12, 24, 23, 43, 41),
datetime.datetime(2021, 1, 12, 2, 39, 43),
datetime.datetime(2021, 2, 13, 14, 7, 50),
datetime.datetime(2021, 3, 2, 21, 22, 46)], dtype=object)
azip_pass=np.array([168.3472527 , 160.09844756, 175.44976695, 159.46139347,
168.4780719 , 165.17699028, 158.22654417, 151.02735996,
159.39235045, 164.8792118 , 168.84217025, 166.09269395,
-179.97929963, 163.3389004 , 167.24285926, 167.08062597,
163.71540408, 171.13687447, 163.61945117, 172.68473083,
159.89871931, 166.72228462, 162.2774924 , 166.13812415,
14.7128006 , 12.43499853, 11.86328998, 10.56097159,
16.16589956, 12.81530251, 10.0220719 , 4.21173499])
Using the following Python script, I generated the plot.
import matplotlib.pyplot as plt
import numpy as np
import datetime
from matplotlib import dates
from matplotlib import rc
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
rc('axes', edgecolor='k', linewidth="5.0")
fig, ax=plt.subplots(1, 1, figsize=(30, 10))
ax.xaxis.set_major_locator(dates.YearLocator())
ax.set_ylim(-185, 185)
ax.scatter(tp_pass, azip_pass, color="b", s=200, alpha=1.0, ec="k")
plt.xticks(fontsize=35)
plt.yticks([-180, -120, -60, 0, 60, 120, 180], ["${}^\circ$".format(x) for x in [-180, -120, -60, 0, 60, 120, 180]], fontsize=35)
plt.tight_layout()
plt.show()
x-axis of the plot automatically marks the year since I used matplotlib.dates.YearLocator(). Actually, I am not really satisfied with it and want to also locate months between years. However, I want months to be shown by their names, not numbers (ex. Jan, Feb, Mar, etc.). The x-axis of figure below shows what I want to implement. Is this possible using matplotlib?
Added (2021-05-18)
Using matplotlib.dates.MonthLocator(), I was able to make months show. However, the year number disappeared. Is there a way to show both year and months together (ex. year beneath month) using matplotlib?
fig, ax=plt.subplots(1, 1, figsize=(30, 10))
ax.xaxis.set_major_locator(dates.YearLocator()) # This line does not work
ax.xaxis.set_major_locator(dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(dates.DateFormatter('%b'))
ax.set_ylim(-185, 185)
ax.scatter(tp_pass, azip_pass, color="b", s=200, alpha=1.0, ec="k")
plt.xticks(fontsize=35)
plt.yticks([-180, -120, -60, 0, 60, 120, 180], ["${}^\circ$".format(x) for x in [-180, -120, -60, 0, 60, 120, 180]], fontsize=35)
plt.tight_layout()
plt.show()
Added (2021-05-19)
I found answer by Patrick FitzGerald to this question How to change the datetime tick label frequency for matplotlib plots? very helpful. This answer does not require the usage of secondary x-axis and does what I wanted to do.
You can create a second x-axis, use that to show only the year while using your original x-axis to show the month as a word. Here's this approach using your example. It will look like this.
import matplotlib.pyplot as plt
import numpy as np
import datetime
from matplotlib import dates as mdates
# Using Data from OP: tp_pass and azip_pass
# Creating your plot
fig, ax=plt.subplots(1, 1, figsize=(30, 10))
ax.set_ylim(-185, 185)
ax.scatter(tp_pass, azip_pass, color="b", s=200, alpha=1.0, ec="k")
# Minor ticks every month.
fmt_month = mdates.MonthLocator()
# Minor ticks every year.
fmt_year = mdates.YearLocator()
ax.xaxis.set_minor_locator(fmt_month)
# '%b' to get the names of the month
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_major_locator(fmt_year)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
# fontsize for month labels
ax.tick_params(labelsize=20, which='both')
# create a second x-axis beneath the first x-axis to show the year in YYYY format
sec_xaxis = ax.secondary_xaxis(-0.1)
sec_xaxis.xaxis.set_major_locator(fmt_year)
sec_xaxis.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
# Hide the second x-axis spines and ticks
sec_xaxis.spines['bottom'].set_visible(False)
sec_xaxis.tick_params(length=0, labelsize=35)
plt.yticks([-180, -120, -60, 0, 60, 120, 180], ["${}^\circ$".format(x) for x in [-180, -120, -60, 0, 60, 120, 180]], fontsize=35)
plt.tight_layout()
plt.show()
I'd suggest using ConciseDateFormatter https://matplotlib.org/stable/gallery/ticks_and_spines/date_concise_formatter.html
and using the auto locator for more ticks if you really want every month located:
fig, ax=plt.subplots(1, 1, figsize=(8, 4), constrained_layout=True)
plt.rcParams['date.converter'] = 'concise'
ax.xaxis.set_major_locator(mdates.AutoDateLocator(minticks=12, maxticks=20))
ax.set_ylim(-185, 185)
ax.scatter(tp_pass, azip_pass, color="b", s=200, alpha=1.0, ec="k")
# plt.xticks(fontsize=35)
plt.yticks([-180, -120, -60, 0, 60, 120, 180], ["${}^\circ$".format(x) for x in [-180, -120, -60, 0, 60, 120, 180]])
plt.show()

Image Output not properly displayed in Seaborn Bar graph

The below code snippet is displaying the plot image perfectly in Pycharm window, but the same image isn't appearing properly when it's saved in an image.
How I can save the image properly?
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
sns.set_context('paper')
report_id = ['Report_1', 'Report_2', 'Report_3', 'Report_4', 'Report_5', 'Report_6', 'Report_7', 'Report_8', 'Report_9',
'Report_10', 'Report_11', 'Report_12', 'Report_13', 'Report_14', 'Report_15', 'Report_16', 'Report_17',
'Report_18', 'Report_19', 'Report_20', 'Report_21', 'Report_22', 'Report_23', 'Report_24', 'Report_25',
'Report_26', 'Report_27', 'Report_28', 'Report_29', 'Report_30', 'Report_31', 'Report_32', 'Report_33',
'Report_34', 'Report_35', 'Report_36', 'Report_37', 'Report_38', 'Report_39', 'Report_40', 'Report_41',
'Report_42', 'Report_43', 'Report_44', 'Report_45', 'Report_46', 'Report_47', 'Report_48', 'Report_49',
'Report_50', 'Report_51', 'Report_52', 'Report_53', 'Report_54', 'Report_55', 'Report_56', 'Report_57',
'Report_58', 'Report_59', 'Report_60']
report_value = [1300, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60]
df = pd.DataFrame({'report_id': report_id, 'report_value': report_value})
sns.set(rc={'figure.figsize': (15, 100)})
ax = sns.barplot(y="report_id", x="report_value", data=df, palette="GnBu_d")
ax.tick_params(labelsize=3)
initialx = 0
for p in ax.patches:
ax.text(p.get_width(), initialx + p.get_height() / 10, "{:1.0f}".format(p.get_width()),fontsize=5)
initialx += 1
plt.savefig(r"C:\Program\Anaconda3\venvs\PlotGraph\Bar_Graph.png")
plt.show()
Pycharm Image:
Saved Image of Same plot:

Load data from file and normalize

How to normalize data loaded from file? Here what I have. Data looks kind of like this:
65535, 3670, 65535, 3885, -0.73, 1
65535, 3962, 65535, 3556, -0.72, 1
Last value in each line is a target. I want to have the same structure of the data but with normalized values.
import numpy as np
dataset = np.loadtxt('infrared_data.txt', delimiter=',')
# select first 5 columns as the data
X = dataset[:, 0:5]
# is that correct? Should I normalize along 0 axis?
normalized_X = preprocessing.normalize(X, axis=0)
y = dataset[:, 5]
Now the question is, how to pack correctly normalized_X and y back, that it has the structure:
dataset = [[normalized_X[0], y[0]],[normalized_X[1], y[1]],...]
It sounds like you're asking for np.column_stack. For example, let's set up some dummy data:
import numpy as np
x = np.arange(25).reshape(5, 5)
y = np.arange(5) + 1000
Which gives us:
X:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
Y:
array([1000, 1001, 1002, 1003, 1004])
And we want:
new = np.column_stack([x, y])
Which gives us:
New:
array([[ 0, 1, 2, 3, 4, 1000],
[ 5, 6, 7, 8, 9, 1001],
[ 10, 11, 12, 13, 14, 1002],
[ 15, 16, 17, 18, 19, 1003],
[ 20, 21, 22, 23, 24, 1004]])
If you'd prefer less typing, you can also use:
In [4]: np.c_[x, y]
Out[4]:
array([[ 0, 1, 2, 3, 4, 1000],
[ 5, 6, 7, 8, 9, 1001],
[ 10, 11, 12, 13, 14, 1002],
[ 15, 16, 17, 18, 19, 1003],
[ 20, 21, 22, 23, 24, 1004]])
However, I'd discourage using np.c_ for anything other than interactive use, simply due to readability concerns.

Resources