Graphing three database in one graph Python - python-3.x
How can I plot the graph
Getting the data from those 3 sources
Using only first letter and last digits of the first column to put it in the X-axis as in the Excel graph above
How can I only show first column data by 20 digits difference ? aa010 aa030 aa050 ... etc
I have three different data from a source. Each one of them has 2 columns. Some of those 3 sources' first columns named the same but each one of them has different data corresponding to it in the second column.
I need to use python to plot those 3 data at one graph.
X-axis should be the combination of the first column of three data from the sources. - The data is in format of: aa001 - (up to sometimes aa400); ab001 - (up to sometimes ab400).
So, the X-axis should start with a aa001 and end with ab400. Since it would just overfill the x-axis and would make it impossible to look at it in a normal size, I want to just show aa020, aa040 ..... (using the number in the string, only show it after aa0(+20) or ab0(+20))
Y-axis should be just numbers from 0-10000 (may want to change if at least one of the data has max more than 10000.
I will add the sample graph I created using excel.
My sample data would be (Note: Data is not sorted by any column and I would prefer to sort it as stated above: aa001 ...... ab400):
Data1
Name Number
aa001 123
aa032 4211
ab400 1241
ab331 33
Data2
Name Number
aa002 1213
aa032 41
ab378 4231
ab331 63
aa163 999
Data3
Name Number
aa209 9876
ab132 5432
ab378 4124
aa031 754
aa378 44
ab344 1346
aa222 73
aa163 414
ab331 61
I searched up Matplotlib, found a sample example where it plots as I want (with dots for each x-y point) but does not apply to my question.
This is the similar code I found:
x = range(100)
y = range(100,200)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x[:4], y[:4], s=10, c='b', marker="s", label='first')
ax1.scatter(x[40:],y[40:], s=10, c='r', marker="o", label='second')
plt.legend(loc='upper left');
plt.show()
Sample graph (instead of aa for X-axis-> bc; ab -> mc)
I expect to see a graph as follows, but skipping every 20 in the X-axis. (I want the first graph dotted (symbolled) as the second graph but second graph to use X-axis as the first graph, but with skipping 20 in the name
First Graph ->- I want to use X-axis like this but without each data (only by 20 difference)
Second graph ->- I want to use symbols instead of lines like in this one
Please, let me know if I need to provide any other information or clarify/correct myself. Any help is appreciated!
The answer is as following but the following code has still some errors. The final answer will be posted after receiving complete answer at The answer will be in the following link:
Using sorted file to plot X-axis with corresponding Y-values from the original file
from matplotlib import pyplot as plt
import numpy as np
import csv
csv_file = []
with open('hostnum.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
csv_file.append(line)
us_csv_file = []
with open('unsorted.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
us_csv_file.append(line)
us_csv_file.sort(key=lambda x: csv_list.index(x[1]))
plt.plot([int(item[1]) for item in csvfile], 'o-')
plt.xticks(np.arange(len(csvfile)), [item[0] for item in csvfile])
plt.show()
Related
Creating subplots through a loop from a dataframe
Case: I receive a dataframe with (say 50) columns. I extract the necessary columns from that dataframe using a condition. So we have a list of selected columns of our dataframe now. (Say this variable is sel_cols) I need a bar chart for each of these columns value_counts(). And I need to arrange all these bar charts in 3 columns, and varying number of rows based on number of columns selected in sel_cols. So, if say 8 columns were selected, I want the figure to have 3 columns and 3 rows, with last subplot empty or just 8 subplots in 3x3 matrix if that is possible. I could generate each chart separately using following code: for col in sel_cols: df[col].value_counts().plot(kind='bar) plt.show() plt.show() inside the loop so that each chart is shown and not just the last one. I also tried appending these charts to a list this way: charts = [] for col in sel_cols: charts.append(df[col].value_counts().plot(kind='bar)) I could convert this list into an numpy array through reshape() but then it will have to be perfectly divisible into that shape. So 8 chart objects will not be reshaped into 3x3 array. Then I tried creating the subplots first in this way: row = len(sel_cols)//3 fig, axes = plt.subplots(nrows=row,ncols=3) This way I would get the subplots, but I get two problems: I end up with extra subplots in the 3 columns which will go unplotted (8 columns example). I do not know how to plot under each subplots through a loop. I tried this: for row in axes: for chart, col in zip(row,sel_cols): chart = data[col].value_counts().plot(kind='bar') But this only plots the last subplot with the last column. All other subplots stays blank. How to do this with minimal lines of code, possibly without any need for human verification of the final subplots placements? You may use this sample dataframe: pd.DataFrame({'A':['Y','N','N','Y','Y','N','N','Y','N'], 'B':['E','E','E','E','F','F','F','F','E'], 'C':[1,1,0,0,1,1,0,0,1], 'D':['P','Q','R','S','P','Q','R','P','Q'], 'E':['E','E','E','E','F','F','G','G','G'], 'F':[1,1,0,0,1,1,0,0,1], 'G':['N','N','N','N','Y','N','N','Y','N'], 'H':['G','G','G','E','F','F','G','F','E'], 'I':[1,1,0,0,1,1,0,0,1], 'J':['Y','N','N','Y','Y','N','N','Y','N'], 'K':['E','E','E','E','F','F','F','F','E'], 'L':[1,1,0,0,1,1,0,0,1], }) Selected columns are: sel_cols = ['A','B','D','E','G','H','J','K'] Total 8 columns. Expected output is bar charts for value_counts() of each of these columns arranged in subplots in a figure with 3 columns. Rows to be decided based on number of columns selected, here 8 so 3 rows.
Given OP's sample data: df = pd.DataFrame({'A':['Y','N','N','Y','Y','N','N','Y','N'],'B':['E','E','E','E','F','F','F','F','E'],'C':[1,1,0,0,1,1,0,0,1],'D':['P','Q','R','S','P','Q','R','P','Q'],'E':['E','E','E','E','F','F','G','G','G'],'F':[1,1,0,0,1,1,0,0,1],'G':['N','N','N','N','Y','N','N','Y','N'],'H':['G','G','G','E','F','F','G','F','E'],'I':[1,1,0,0,1,1,0,0,1],'J':['Y','N','N','Y','Y','N','N','Y','N'],'K':['E','E','E','E','F','F','F','F','E'],'L':[1,1,0,0,1,1,0,0,1]}) sel_cols = list('ABDEGHJK') data = df[sel_cols].apply(pd.value_counts) We can plot the columns of data in several ways (in order of simplicity): DataFrame.plot with subplots param seaborn.catplot Loop through plt.subplots 1. DataFrame.plot with subplots param Set subplots=True with the desired layout dimensions. Unused subplots will be auto-disabled: data.plot.bar(subplots=True, layout=(3, 3), figsize=(8, 6), sharex=False, sharey=True, legend=False) plt.tight_layout() 2. seaborn.catplot melt the data into long-form (i.e., 1 variable per column, 1 observation per row) and pass it to seaborn.catplot: import seaborn as sns melted = data.melt(var_name='var', value_name='count', ignore_index=False).reset_index() sns.catplot(data=melted, kind='bar', x='index', y='count', col='var', col_wrap=3, sharex=False) 3. Loop through plt.subplots zip the columns and axes to iterate in pairs. Use the ax param to place each column onto its corresponding subplot. If the grid size is larger than the number of columns (e.g., 3*3 > 8), disable the leftover axes with set_axis_off: fig, axes = plt.subplots(3, 3, figsize=(8, 8), constrained_layout=True, sharey=True) # plot each col onto one ax for col, ax in zip(data.columns, axes.flat): data[col].plot.bar(ax=ax, rot=0) ax.set_title(col) # disable leftover axes for ax in axes.flat[data.columns.size:]: ax.set_axis_off()
Alternative to the answer by tdy, I tried to do it without seaborn using Matplotlib and a for loop. Figured it might be better for some who want specific control over subplots with formatting and other parameters, then this is another way: fig = plt.figure(1,figsize=(16,12)) for i, col in enumerate(sel_cols,1): fig.add_subplot(3,4,i,) data[col].value_counts().plot(kind='bar',ax=plt.gca()) plt.title(col) plt.tight_layout() plt.show(1) plt.subplot activates a subplot, while plt.gca() points to the active subplot.
How to create a scatter plot where values are across multiple columns?
I have a dataframe in Pandas in which the rows are observations at different times and each column is a size bin where the values represent the number of particles observed for that size bin. So it looks like the following: bin1 bin2 bin3 bin4 bin5 Time1 50 200 30 40 5 Time2 60 60 40 420 700 Time3 34 200 30 67 43 I would like to use plotly/cufflinks to create a scatterplot in which the x axis will be each size bin, and the y axis will be the values in each size bin. There will be three colors, one for each observation. As I'm more experienced in Matlab, I tried indexing the values using iloc (note the example below is just trying to plot one observation): df.iplot(kind="scatter",theme="white",x=df.columns, y=df.iloc[1,:]) But I just get a key error: 0 message. Is it possible to use indexing when choosing x and y values in Pandas?
Rather than indexing, I think you need to better understand how pandas and matplotlib interact each other. Let's go by steps for your case: As the pandas.DataFrame.plot documentation says, the plotted series is a column. You have the series in the row, so you need to transpose your dataframe. To create a scatterplot, you need both x and y coordinates in different columns, but you are missing the x column, so you also need to create a column with the x values in the transposed dataframe. Apparently pandas does not change color by default with consecutive calls to plot (matplotlib does it), so you need to pick a color map and pass a color argument, otherwise all points will have the same color. Here a working example: import pandas as pd import numpy as np import matplotlib.pyplot as plt #Here I copied you data in a data.txt text file and import it in pandas as a csv. #You may have a different way to get your data. df = pd.read_csv('data.txt', sep='\s+', engine='python') #I assume to have a column named 'time' which is set as the index, as you show in your post. df.set_index('time') tdf = df.transpose() #transpose the dataframe #Drop the time column from the trasponsed dataframe. time is not a data to be plotted. tdf = tdf.drop('time') #Creating x values, I go for 1 to 5 but they can be different. tdf['xval'] = np.arange(1, len(tdf)+1) #Choose a colormap and making a list of colors to be used. colormap = plt.cm.rainbow colors = [colormap(i) for i in np.linspace(0, 1, len(tdf))] #Make an empty plot, the columns will be added to the axes in the loop. fig, axes = plt.subplots(1, 1) for i, cl in enumerate([datacol for datacol in tdf.columns if datacol != 'xval']): tdf.plot(x='xval', y=cl, kind="scatter", ax=axes, color=colors[i]) plt.show() This plots the following image: Here a tutorial on picking colors in matplotlib.
labeling data points with dataframe including empty cells
I have an Excel sheet like this: A B C D 3 1 2 8 4 2 2 8 5 3 2 9 2 9 6 4 2 7 Now I am trying to plot 'B' over 'C' and label the data points with the entrys of 'A'. It should show me the points 1/2, 2/2, 3/2 and 4/2 with the corresponding labels. import matplotlib.pyplot as plt import pandas as pd import os df = pd.read_excel(os.path.join(os.path.dirname(__file__), "./Datenbank/Test.xlsx")) fig, ax = plt.subplots() df.plot('B', 'C', kind='scatter', ax=ax) df[['B','C','A']].apply(lambda x: ax.text(*x),axis=1); plt.show() Unfortunately I am getting this: with the Error: ValueError: posx and posy should be finite values As you can see it did not label the last data point. I know it is because of the empty cells in the sheet but i cannot avoid them. There is just no measurement data at this positions. I already searched for a solution here: Annotate data points while plotting from Pandas DataFrame but it did not solve my problem. So, is there a way to still label the last data point? P.S.: The excel sheet is just an example. So keep in mind in reality there are many empty cells at different positions.
You can simply trash the invalid data rows from df before plotting them df = df[df['B'].notnull()]
Specifying the number of ticks in between a range
I have written code for customizing my x ticks, snippet of the same is below arr_label = ['sum_msg_len','log_count','info_hit','debug_hit','error_hit'] for label in arr_label : fig = plt.figure(figsize=(15,6)) axes = fig.add_axes([1,1,1,1]) axes.xaxis.set_major_locator(plt.LinearLocator(30)) axes.tick_params(axis ='x',labelsize=6) axes.plot(df.index,df[label],'g',label =label) axes.legend() fig.autofmt_xdate() fig.savefig('images_indv/'+app_index+"_"+label+".png",bbox_inches='tight') #fig.close() fig.clf() my requirement is that is have timestamps spaced by minute and i want to plot timestamp vs ('sum_msg_len'/'log_count'/'info_hit'/'debug_hit'/'error_hit') one by one, but problem is X ticks, i want some specified no of ticks to appear within the range of the data which i am plotting. Earlier when i was not specifing any Locator then all the timestamps got overlapped and one cannot read the timestamps properly. So when i try to use a locator, it labels the x-axis with out any relation to the plotted value. Like if i use LinearLocator(30) it just plots the first 00 to 29 mins in the graph,and if i use LinearLocator(50) it just plots the first 00 to 49 mins in the graph with no change to the y axis values. Plots of both I am putting below. I also tried with different locators Like MultipleLocator and MaxNlocator, but issue sustains In short, I just want the graph plotted for 21July 00:00:00 to 22 July 00:00:00 which will be 1440 entries but the i want to see around 30-40 intermediate entries mentioned on the plot.
Plot number of occurrences in Pandas dataframe (2)
this is a followup from the previous question: Plot number of occurrences from Pandas DataFrame I'm trying to produce a bar chart in descending order from the results of a pandas dataframe that is grouped by "Issuing Office." The data comes from a csv file which has 3 columns: System (string), Issuing Office (string), Error Type (string). The first four commands work fine - read, fix the column headers, strip out the offices I don't need, and reset the index. However I've never displayed a chart before. CSV looks like: System Issuing Office Error Type East N1 Error1 East N1 Error1 East N2 Error1 West N1 Error3 Looking for a simple horizontal bar chart that would show N1 had a count of 3, N2 had a count of 2. import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv('mydatafile.csv',index_col=None, header=0) #ok df.columns = [c.replace(' ','_') for c in df.columns] #ok df = df[df['Issuing_Office'].str.contains("^(?:N|M|V|R)")] #ok df = df.reset_index(drop=True) #ok # produce chart that shows how many times an office came up (Decending) df.groupby([df.index, 'Issuing_Office']).count().plot(kind='bar') plt.show() # produce chart that shows how many error types per Issuing Office (Descending). There are no date fields in this which makes it different than the original question. Any help is greatly appreciated :)
JohnE's solution worked. Used the code: # produce chart that shows how many times an office came up (Decending) df['Issuing_Office'].value_counts().plot(kind='barh') #--JohnE plt.gca().invert_yaxis() plt.show() # produce chart that shows how many error types per Issuing Office N1 (Descending). dfN1 = df[df['Issuing_Office'].str.contains('N1')] dfN1['Error_Type'].value_counts().plot(kind='barh') plt.gca().invert_yaxis() plt.show()