open and plot several data file on same plot Python - python-3.x

Newbie here, first question.
I have several data files, that I want to open, get the relevant data (x and y) and plot on the same plot.
I know how to do it if I type out a plot statement for each of them, but what I would want to create is a single function or script that takes the filenames as input, extracts the data (this part depends on the type of file, but I think I know how to do it) and then creates one single plot with the different datasets. It should be pretty basic, but all my attempts return a plot for each file.
I think that my problem is that I have not understood how the whole ax, fig, gca, plot loop works, as I have been learning mostly by adapting things and doing.
So far I have created a for loop that opens each file, gets the data and stores it in a dataframe (a dataframe per file) then uses a plt.plot to plot, and then out of the loop, I have a plt.gca() that in my intentions would get things together to then modify the plot, add stuff to it and save it. I have also tried changing the position of the gca and using ax and fig, playing around with a few tutorials, but never with satisfying results.
I get different kinds of errors, depending on the different iterations of the script, here is one of my attempts. If there's an electrochemist among you they might recognize the datatype :) but the datatype should not be important.
**EDIT: I modified the script, as it had a couple of errors, the current versions returns an empty plot.
**
the current version returns an empty plot, the dataframe is created properly, from what I can see
files = ['file1.i2b', 'file2.i2b']
colors = []
fig_name = ''
file_type = 'i2b'
norm = []
if len(colors) != len(files):
l = len(files)
col_list = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
color_list = col_list[0:l]
if len(norm)!= len(files):
norm = [1]*len(files)
if file_type == 'i2b':
for filename, norm_factor, col in zip (files, norm, color_list):
flnm1 = os.path.splitext(filename)[0]
data_xrd = pd.read_csv(filename, sep=(' '), decimal = '.', skiprows =10,
header= None, names =['Freq','Real_part', 'Imm_part'])
data_xrd['norm_Imm_part'] = (0-data_xrd['Imm_part'])*norm_factor
data_xrd['norm_Re_part'] = data_xrd['Real_part']*norm_factor
plt.plot(x=data_xrd['norm_Re_part'], y=data_xrd['norm_Imm_part'],
legend=flnm1, style='-', color = col)
#plt.show
plt.gca()
#plt.axhline(y=0, color='k', linestyle='--')
#plt.set_xlabel('Z_real [Ohm]')
#plt.set_ylabel('Z_imm [Ohm]')
#plt.set_aspect('equal')
plt.savefig(fig_name + '.png')
Now, it might be better to split the data extraction to a different function, so that the plotting function is more flexible and can be paired with different kinds of data input, but at the moment I'd just like to understand how to use plot multiple files on a single plot simply by using a list of their names as input, in order to facilitate the grouping and plotting of a lot of datafiles.
Thanks for the help and please let me know how to improve my question!

Related

Sorting algorithm visulizer: how to highlight the current element being accessed and compared in the algorithm?

So im trying to write a sorting algorithm visualizer. Code bellow. I am basically using matplotlib to plot the figure. My problem is that i want to also highlight the current element in the array being accessed, compared, and swaped. all of my attempts have failed at this. Please do also let me know if there is a better way of writing a visulizer in python. I have seen some tutorials using pygame but wanted to stick to basics. Also when the program runs till the end and everthing is sorted the plot goes blank. Is this because of the plt.clf() command and is there a way for the sorted plot to not close. Thanks!!!
from matplotlib import pyplot as plt
import numpy as np
# generate sudo-random list of numbers
lst = np.random.randint(0, 100, 20)
# x values for the bar plot
x = range(0, len(lst))
def insertion_sort(lst):
# loop through the list
# incrementally check which index to the left should i be placed in
for i in range(1, len(lst)):
while lst[i-1] > lst[i] and i>0:
lst[i], lst[i-1] = lst[i-1], lst[i]
i = i-1
# plot
plt.bar(x,lst)
plt.pause(0.1)
plt.clf()
plt.show()
return lst
print(lst)
print(insertion_sort(lst))
So the solution i came up with for this problem was to create a second list containing the current i and i-1 indexes and basically plot a second barchart over the main one set to a different color. Bad solution and failed indeed. Another idea i tried was to pass a conditional argument for the color paramater of plt.bar()
colors = ['red' if lst[i-1]>lst[i] else for element in lst 'blue']
plt.bar(x, lst, color=colors)
This did not work aswell. dont know if am on the right track and just need to keep at it or this is whole setup is futile to begin with. thank you for your time!!

Pandas dropped row showing in plot

I am trying to make a heatmap.
I get my data out of a pipeline that class some rows as noisy, I decided to get a plot including them and a plot without them.
The problem I have: In the plot without the noisy rows I have blank line appearing (the same number of lines than rows removed).
Roughly The code looks like that (I can expand part if required I am trying to keep it shorts).
If needed I can provide a link with similar data publicly available.
data_frame = load_df_fromh5(file) # load a data frame from the hdf5 output
noisy = [..] # a list which indicate which row are vector
# I believe the problem being here:
noisy = [i for (i, v) in enumerate(noisy) if v == 1] # make a vector which indicates which index to remove
# drop the corresponding index
df_cells_noisy = df_cells[~df_cells.index.isin(noisy)].dropna(how="any")
#I tried an alternative method:
not_noisy = [0 if e==1 else 1 for e in noisy)
df = df[np.array(not_noisy, dtype=bool)]
# then I made a clustering using scipy
Z = hierarchy.linkage(df, method="average", metric="canberra", optimal_ordering=True)
df = df.reindex(hierarchy.leaves_list(Z))
# the I plot using the df variable
# quit long function I believe the problem being upstream.
plot(df)
The plot is quite long but I believe it works well because the problem only shows with the no noisy data frame.
IMO I believe somehow pandas keep information about the deleted rows and that they are plotted as a blank line. Any help is welcome.
Context:
Those are single-cell data of copy number anomaly (abnormalities of the number of copy of genomic segment)
Rows represent individuals (here individuals cells) columns represents for the genomic interval the number of copy (2 for vanilla (except sexual chromosome)).

Need help in creating a function to plot a Matplotlib GridSpec

I have a dataset with 80 variables. I am interested in creating a function that will automate the creation of a 20 X 4 GridSpec in Matplotlib. Each subplot would either contain a histogram or a barplot for each of the 80 variables in the data. As a first step, I successfully created two functions (I call them 'counts' and 'histogram') that contain the layout of the plot that I want. Both of them work when tested on individual variables. As a next step, I attempted to create a function that would take the column names, loop through a conditional to test whether the data type is an object or otherwise and call the right function based on the datatype as a new subplot. Here is the code that I have so far:
Creates list of coordinates we will need for subplot specification:
A = np.arange(21)
B = np.arange(4)
coords = []
for i in A:
for j in B:
coords.append([A[i], B[j]])
#Create the gridspec and layout the figure
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(12,6))
gs = gridspec.GridSpec(2,4)
#Function that relies on what we've done above:
def grid(cols=['MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley']):
for i in cols:
for vals in coords:
if str(train[i].dtype) == 'object':
plt.subplot('gs'+str(vals))
counts(cols)
else:
plt.subplot('gs'+str(vals))
histogram(cols)
When attempted, this code returns an error:
ValueError: Single argument to subplot must be a 3-digit integer
For purposes of helping you visualize, what I am hoping to achieve, I attach the screen shot below, which was produced by the line by line coding (with my created helper functions) I am trying to avoid:
Can anyone help me figure out where I am going wrong? I would appreciate any advice. Thank you!
The line plt.subplot('gs'+str(vals)) cannot work; which is also what the error tells you.
As can be seen from the matplotlib GridSpec tutorial, it needs to be
ax = plt.subplot(gs[0, 0])
So in your case you may use the values from the list as
ax = plt.subplot(gs[vals[0], vals[1]])
Mind that you also need to make sure that the coords list must have the n*m elements, if the gridspec is defined as gs = gridspec.GridSpec(n,m).

Matplotlib legend in increasing order

I have text files named as 5.txt, 10.txt, 15.txt, 20.txt but when I read the files with glob module and use fname variable in the legend I get disorganized legend data.
for fname in glob("*.txt"):
potential, current_density = np.genfromtxt(fname, unpack=True)
current_density = current_density*1e6
ax = plt.gca()
ax.get_yaxis().get_major_formatter().set_useOffset(False)
plt.plot(potential,current_density, label=fname[0:-4])
plt.legend(loc=4,prop={'size':12},
ncol=1, shadow=True, fancybox=True,
title = "Scan rate (mV/s)")
How can I plot and give the corresponding label to the data with in increasing order?
Just to provide yet another method, which does not require to change anything in the plotting part of the script:
handles, labels = plt.gca().get_legend_handles_labels()
handles, labels = zip(*[ (handles[i], labels[i]) for i in sorted(range(len(handles)), key=lambda k: list(map(int,labels))[k])] )
plt.legend(handles, labels, loc=4, ...)
Method 1 (Recommended)
You will need to sort and display the legend yourself. plt.legend takes a list of lines and a list of strings as the first two optional positional arguments. You can maintain a list of the items you need, sort it into the order you want, and pass the portions you want over to legend.
ax = plt.gca()
legend_items = []
for fname in glob("*.txt"):
potential, current_density = np.genfromtxt(fname, unpack=True)
current_density *= 1e6
line, = ax.plot(potential, current_density)
name = fname[0:-4]
legend_items.append((int(name), line, name))
legend_items.sort()
ax.get_yaxis().get_major_formatter().set_useOffset(False)
ax.legend([x[1] for x in legend_items], [x[2] for x in legend_items],
loc=4, prop={'size':12}, ncol=1, shadow=True,
fancybox=True, title = "Scan rate (mV/s)")
Major additions are marked in bold, while minor style changes that can probably be ignored are marked in italics.
Major additions include the accumulation of the items for the legend. I use tuples for each item because a list of tuples is automatically sorted by the first element first. The comma in line, = ax.plot... is necessary because it triggers argument unpacking on the list that plot returns. An alternative would be to do line = ax.plot(...)[0]. The file name is no longer added as an explicit label to the data.
Among the minor changes, I switched to using ax.plot and ax.legend instead of plt.plot and plt.legend. This is the object oriented part of Matplotlib's API and it makes things a little clearer. Also, you don't have to keep calling gca() to get the reference over and over this way. Also, set_useoffset only needs to be called only once, not inside the loop.
Method 2
Another way to approach the problem would be to pre-sort the file names before processing them, so that they appear in the correct order in your legend:
import os
file_list = os.listdir('.')
file_list = [x for x in file_list if x.endswith('.txt')]
file_list.sort(key=lambda x: int(x[0:-4]))
for fname in file_list:
...
You will have to do the name filtering yourself, but it is not especially difficult. The sorting key is just the number. Also, you will note that I got tired of doing the custom fancy formatting for this update :)
Dont know if this is so relevant but I ended up here anyway - I found I didnt need the middle line - If you want 2 columns this worked for me;
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles, labels, loc=4,
ncol=2, shadow=True, title="Legend", fancybox=True)

Matplotlib - Stacked Bar Chart with ~1000 Bars

Background:
I'm working on a program to show a 2d cross section of 3d data. The data is stored in a simple text csv file in the format x, y, z1, z2, z3, etc. I take a start and end point and flick through the dataset (~110,000 lines) to create a line of points between these two locations, and dump them into an array. This works fine, and fairly quickly (takes about 0.3 seconds). To then display this line, I've been creating a matplotlib stacked bar chart. However, the total run time of the program is about 5.5 seconds. I've narrowed the bulk of it (3 seconds worth) down to the code below.
'values' is an array with the x, y and z values plus a leading identifier, which isn't used in this part of the code. The first plt.bar is plotting the bar sections, and the second is used to create an arbitrary floor of -2000. In order to generate a continuous looking section, I'm using an interval between each bar of zero.
import matplotlib.pyplot as plt
for values in crossSection:
prevNum = None
layerColour = None
if values != None:
for i in range(3, len(values)):
if values[i] != 'n':
num = float(values[i].strip())
if prevNum != None:
plt.bar(spacing, prevNum-num, width=interval, \
bottom=num, color=layerColour, \
edgecolor=None, linewidth=0)
prevNum = num
layerColour = layerParams[i].strip()
if prevNum != None:
plt.bar(spacing, prevNum+2000, width=interval, bottom=-2000, \
color=layerColour, linewidth=0)
spacing += interval
I'm sure there's a more efficient way to do this, but I'm new to Matplotlib and still unfamilar with its capabilities. The other main use of time in the code is:
plt.savefig('output.png')
which takes about a second, but I figure this is to be expected to save the file and I can't do anything about it.
Question:
Is there a faster way of generating the same output (a stacked bar chart or something that looks like one) by using plt.bar() better, or a different Matplotlib function?
EDIT:
I forgot to mention in the original post that I'm using Python 3.2.3 and Matplotlib 1.2.0
Leaving this here in case someone runs into the same problem...
While not exactly the same as using bar(), with a sufficiently large dataset (large enough that using bar() takes a few seconds) the results are indistinguishable from stackplot(). If I sort the data into layers using the method given by tcaswell and feed it into stackplot() the chart is created in 0.2 seconds, rather than 3 seconds.
EDIT
Code provided by tcaswell to turn the data into layers:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
It looks like you are drawing each bar, you can pass sequences to bar (see this example)
I think something like:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
ax = plt.gca()
spacing = interval*numpy.arange(len(accum_values[0]))
for data,color is zip(accum_values,layer_params):
ax.bar(spacing,data,bottom=bottom,color=color,linewidth=0,width=interval)
bottom += data
will be faster (because each call to bar creates one BarContainer and I suspect the source of your issues is you were creating one for each bar, instead of one for each layer).
I don't really understand what you are doing with the bars that have tops below their bottoms, so I didn't try to implement that, so you will have to adapt this a bit.

Resources