Zipped Columns in Pandas DataFrame

Zipped Columns in Pandas DataFrame - python-3.x

I have two lists I want to create a pandas Dataframe with 3 columns whereby one of the columns contains a column generated by zipping two of the list. I tried the following
import pandas as pd
import numpy as np
S_x = [80, 90, 100, 200, 300, 600, 800, 900, 1000, 1200]
S_y = [800, 1000, 1200, 450, 80, 100, 60, 300, 700, 900]
S_z=list(zip(S_x,S_y))
frame4 = pd.DataFrame(np.column_stack([S_x, S_y,S_z]), columns=["Recovered Data", "Percentage Error","Zipped"])
In the column S_z I want the elements to be tuples as they appear in list S_z while the first two columns they should be the way they are. When I run my code I get the error
ValueError: Shape of passed values is (4, 10), indices imply (3, 10)
I don't know what I am making wrong. Am using Python 3.x

When you use np.column_stack, it automatically unzip your S_z and thus np.column_stack([S_x, S_y,S_z]) become of shape (10, 4) instead. Do like this:
frame4 = pd.DataFrame({"Recovered Data": S_x, "Percentage Error": S_y,"Zipped": S_z})

IIUC
frame=pd.DataFrame(zip(S_x, S_y, S_z), columns=["Recovered Data", "Percentage Error","Zipped"])
Recovered Data Percentage Error Zipped
0 80 800 (80, 800)
1 90 1000 (90, 1000)
2 100 1200 (100, 1200)
3 200 450 (200, 450)
4 300 80 (300, 80)
5 600 100 (600, 100)
6 800 60 (800, 60)
7 900 300 (900, 300)
8 1000 700 (1000, 700)
9 1200 900 (1200, 900)

Related

Python: Plot histograms with customized bins

I am using matplotlib.pyplot to make a histogram. Due to the distribution of the data, I want manually set up the bins. The details are as follows:
Any value = 0 in one bin;
Any value > 60 in the last bin;
Any value > 0 and <= 60 are in between the bins described above and the bin size is 5.
Could you please give me some help? Thank you.

I'm not sure what you mean by "the bin size is 5". You can either plot a histogramm by specifying the bins with a sequence:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
plt.hist(data, bins=[0, 0.5, 60, max(data)])
plt.show()
But the bin size will match the corresponding interval, meaning -in this example- that the "0-case" will be barely visible:
(Note that 60 is moved to the last bin when specifying bins as a sequence, changing the sequence to [0, 0.5, 59.5, max(data)] would fix that)
What you (probably) need is first to categorize your data and then plot a bar chart of the categories:
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -5] # your data here
df = pd.DataFrame()
df['data'] = data
def find_cat(x):
if x == 0:
return "0"
elif x > 60:
return "> 60"
elif x > 0:
return "> 0 and <= 60"
df['category'] = df['data'].apply(find_cat)
df.groupby('category', as_index=False).count().plot.bar(x='category', y='data', rot=0, width=0.8)
plt.show()
Output:

building off Tranbi's answer, you could specify the bin edges as detailed in the link they shared.
import matplotlib.pyplot as plt
import pandas as pd
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -6] # your data here
df = pd.DataFrame()
df['data'] = data
bin_edges = [-5, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
bin_edges_offset = [x+0.000001 for x in bin_edges]
plt.figure()
plt.hist(df['data'], bins=bin_edges_offset)
plt.show()
histogram

IIUC you want a classic histogram for value between 0 (not included) and 60 (included) and add two bins for 0 and >60 on the side.
In that case I would recommend plotting the 3 regions separately:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
fig, axes = plt.subplots(1,3, sharey=True, width_ratios=[1, 12, 1])
fig.subplots_adjust(wspace=0)
# counting 0 values and drawing a bar between -5 and 0
axes[0].bar(-5, data.count(0), width=5, align='edge')
axes[0].xaxis.set_visible(False)
axes[0].spines['right'].set_visible(False)
axes[0].set_xlim((-5, 0))
# histogram between (0, 60]
axes[1].hist(data, bins=12, range=(0.0001, 60.0001))
axes[1].yaxis.set_visible(False)
axes[1].spines['left'].set_visible(False)
axes[1].spines['right'].set_visible(False)
axes[1].set_xlim((0, 60))
# counting values > 60 and drawing a bar between 60 and 65
axes[2].bar(60, len([x for x in data if x > 60]), width=5, align='edge')
axes[2].xaxis.set_visible(False)
axes[2].yaxis.set_visible(False)
axes[2].spines['left'].set_visible(False)
axes[2].set_xlim((60, 65))
plt.show()
Output:
Edit: If you wanna plot probability density, I would edit the data and simply use hist:
import matplotlib.pyplot as plt
data = [0, 0, 1, 2, 3, 4, 5, 6, 35, 60, 61, 82, -3] # your data here
data2 = []
for el in data:
if el < 0:
pass
elif el > 60:
data2.append(61)
else:
data2.append(el)
plt.hist(data2, bins=14, density=True, range=(-4.99,65.01))
plt.show()

Hue, colorbar, or scatterplot colors do not match in seaborn.scatterplot

Using an example from another post, I'm adding a color bar to a scatter plot. The idea is that both dot hue, and colorbar hue, should conform to the maximum and minimum possible, so that the colorbar can reflect the range of values in the hue:
x= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
y= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
z= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 255]
df = pd.DataFrame(list(zip(x, y, z)), columns =['x', 'y', 'z'])
colormap=matplotlib.cm.viridis
#A continuous color bar needs to be added independently
norm = plt.Normalize(df.z.min(), df.z.max())
sm = plt.cm.ScalarMappable(cmap=colormap, norm=norm)
sm.set_array([])
fig = plt.figure(figsize = (10,8), dpi=300)
ax = fig.add_subplot(1,1,1)
sb.scatterplot(x="x", y="y",
hue="z",
hue_norm=(0,255),
data=df,
palette=colormap,
ax=ax
)
ax.legend(bbox_to_anchor=(0, 1), loc=2, borderaxespad=0., title='hue from sb.scatterplot')
ax.figure.colorbar(sm).set_label('hue from sm')
plt.xlim(0,255)
plt.ylim(0,255)
plt.show()
Note how the hue from the scatterplot, even with hue_norm, ranges up to 300. In turn, the hue from the colorbar ranges from 0 to 255. From experimenting with values in hue_norm, it seems that matplotlib always rounds it off so that you have a "good" (even?) number of intervals.
My questions are:
Is which one is showing an incorrect range: the scatterplot, the scatterplot legend, or the colorbar? And how to correct it?
How could you retrieve min and max hue from the scatterplot (in this case 0 and 300, respectively), in order to set them as maximum and minimum of the colorbar?

Do you really need to use seaborn's scatterplot(). Using a numerical hue is always quite messy.
The following code is much simpler and yields an unambiguous output
fig, ax = plt.subplots()
g = ax.scatter(df['x'],df['y'], c=df['z'], cmap=colormap)
fig.colorbar(g)

Matplotlib fix y-axis

I am trying to create a horizontal bar chart with matplotlib. My data points are the following two arrays
distance = [100, 200, 300, 400, 500, 3000]
value = [10, 15, 50, 74, 95, 98]
my code to generate the horizontal bar chart is as follows
plt.barh(distance, value, height=75)
plt.savefig(fig_name, dpi=300)
plt.close()
The problem is my image comes out like this
https://imgur.com/a/Q8dvHKR
Is there a way to ensure all blocks are the same width and to skip the spaces in between 500 and 300

You can do this making sure Matplotlib treats your labels like labels, not like numbers. You can do this by converting them to strings:
import matplotlib.pyplot as plt
distance = [100, 200, 300, 400, 500, 3000]
value = [10, 15, 50, 74, 95, 98]
distance = [str(number) for number in distance]
plt.barh(distance, value, height=0.75)
Note that you have to change the height.

Alternatively, you can use a range of numbers as y-values, using range() function, to position the horizontal bars and then set the tick-labels as desired using plt.yticks() function whose first argument is the positions of the ticks and the second argument is the tick-labels.
import matplotlib.pyplot as plt
distance = [100, 200, 300, 400, 500, 3000]
value = [10, 15, 50, 74, 95, 98]
plt.barh(range(len(distance)), value, height=0.6)
plt.yticks(range(len(distance)), distance)
plt.show()

How to obtain the indices of all maximum values in array A that correspond to unique values in array B?

Suppose one has an array of observation times ts, each of which corresponds to some observed value in vs. The observation times are taken to be the number of elapsed hours (starting from zero) and can contain duplicates. I would like to find the indices that correspond to the maximum observed value per unique observation time. I am asking for the indices as opposed to the values, unlike a similar question I asked several months ago. This way, I can apply the same indices on various arrays. Below is a sample dataset, which I would like to use to adapt a code for a much larger dataset.
import numpy as np
ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])
My current approach is to split the array of values at any points at which there is not a duplicate time.
condition = np.where(np.diff(ts) != 0)[0]+1
ts_spl = np.split(ts, condition)
vs_spl = np.split(vs, condition)
print(ts_spl)
>> [array([0, 0]), array([1]), array([2]), array([3, 3, 3]), array([4, 4]), array([5]), array([6]), array([7]), array([8, 8]), array([9]), array([10])]
print(vs_spl)
>> [array([500, 600]), array([550]), array([700]), array([500, 500, 450]), array([800, 900]), array([700]), array([600]), array([850]), array([850, 900]), array([900]), array([900])]
In this case, duplicate max values at any duplicate times should be counted. Given this example, the returned indices would be:
[1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15]
# indices = 4,5,6 correspond to values = 500, 500, 450 ==> count indices 4,5
# I might modify this part of the algorithm to return either 4 or 5 instead of 4,5 at some future time
Though I have not yet been able to adapt this algorithm for my purpose, I think it must be possible to exploit the size of each previously-split array in vs_spl to keep an index counter. Is this approach feasible for a large dataset (10,000 elements per array before padding; 70,000 elements per array after padding)? If so, how can I adapt it? If not, what are some other approaches that may be useful here?

70,000 isn't that insanely large, so yes it should be feasible. It is, however, faster to avoid the splitting and use the .reduceat method of relevant ufuncs. reduceat is like reduce applied to chunks, but you don't have to provide the chunks, just tell reduceat where you would have cut to get them. For example, like so
import numpy as np
N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))
#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])
# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])

How to label bubble chart/scatter plot with column from pandas dataframe?

I am trying to label a scatter/bubble chart I create from matplotlib with entries from a column in a pandas data frame. I have seen plenty of examples and questions related (see e.g. here and here). Hence I tried to annotate the plot accordingly. Here is what I do:
import matplotlib.pyplot as plt
import pandas as pd
#example data frame
x = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
y = [100, 100, 200, 200, 300, 300, 400, 400, 500, 500, 600, 600]
s = [5, 10, 20, 30, 5, 10, 20, 30, 5, 10, 20, 30]
users =['mark', 'mark', 'mark', 'rachel', 'rachel', 'rachel', 'jeff', 'jeff', 'jeff', 'lauren', 'lauren', 'lauren']
df = pd.DataFrame(dict(x=x, y=y, users=users)
#my attempt to plot things
plt.scatter(x_axis, y_axis, s=area, alpha=0.5)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.annotate(df.users, xy=(x,y))
plt.show()
I use a pandas datframe and I somehow get a KeyError- so I guess a dict() object is expected? Is there any other way to label the data using with entries from a pandas data frame?

You can use DataFrame.plot.scatter and then select in loop by DataFrame.iat:
ax = df.plot.scatter(x='x', y='y', alpha=0.5)
for i, txt in enumerate(df.users):
ax.annotate(txt, (df.x.iat[i],df.y.iat[i]))
plt.show()

Jezreal's answer is fine, but i will post this just to show what i meant with df.iterrows in the other thread.
I'm afraid you have to put the scatter (or plot) command in the loop as well if you want to have a dynamic size.
df = pd.DataFrame(dict(x=x, y=y, s=s, users=users))
fig, ax = plt.subplots(facecolor='w')
for key, row in df.iterrows():
ax.scatter(row['x'], row['y'], s=row['s']*5, alpha=.5)
ax.annotate(row['users'], xy=(row['x'], row['y']))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Zipped Columns in Pandas DataFrame - python-3.x

When you use np.column_stack, it automatically unzip your S_z and thus np.column_stack([S_x, S_y,S_z]) become of shape (10, 4) instead. Do like this: frame4 = pd.DataFrame({"Recovered Data": S_x, "Percentage Error": S_y,"Zipped": S_z})

Related

Python: Plot histograms with customized bins

Hue, colorbar, or scatterplot colors do not match in seaborn.scatterplot

Matplotlib fix y-axis

How to obtain the indices of all maximum values in array A that correspond to unique values in array B?

How to label bubble chart/scatter plot with column from pandas dataframe?

Categories

Resources