How to split a dataframe and plot some columns - python-3.x

I have a dataframe with 990 rows and 7 columns, I want to make a XvsY linear graph, broking the line at every 22 rows.
I think that dividing the dataframe and then plotting it will be good way, but I don't get good results.
max_rows = 22
dataframes = []
while len(Co1new) > max_rows:
top = Co1new[:max_rows]
dataframes.append(top)
Co1new = Co1new[max_rows:]
else:
dataframes.append(Co1new)
for grafico in dataframes:
AC = plt.plot(grafico)
AC = plt.xlabel('Frequency (Hz)')
AC = plt.ylabel("Temperature (K)")
plt.show()
The code functions but it is not plotting the right columns.
Here some reduced data and in this case it should be divided at every four rows:
df = pd.DataFrame({
'col1':[2.17073,2.14109,2.16052,2.81882,2.29713,2.26273,2.26479,2.7643,2.5444,2.5027,2.52532,2.6778],
'col2':[10,100,1000,10000,10,100,1000,10000,10,100,1000,10000],
'col3':[2.17169E-4,2.15889E-4,2.10526E-4,1.53785E-4,2.09867E-4,2.07583E-4,2.01699E-4,1.56658E-4,1.94864E-4,1.92924E-4,1.87634E-4,1.58252E-4]})

One way I can think of is to add a new column with labels for every 22 records. See below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
seaborn.set(style='ticks')
"""
Assuming the index is numeric and is from [0-990)
this will return an integer for every 22 records
"""
Co1new['subset'] = 'S' + np.floor_divide(Co1new.index, 22).astype(str)
Out:
col1 col2 col3 subset
0 2.17073 10 0.000217 S0
1 2.14109 100 0.000216 S0
2 2.16052 1000 0.000211 S0
3 2.81882 10000 0.000154 S0
4 2.29713 10 0.000210 S1
5 2.26273 100 0.000208 S1
6 2.26479 1000 0.000202 S1
7 2.76434 10000 0.000157 S1
8 2.54445 10 0.000195 S2
9 2.50270 100 0.000193 S2
10 2.52532 1000 0.000188 S2
11 2.67780 10000 0.000158 S2
You can then use seaborn.pairplot to plot your data pairwise and use Co1new['subset'] as legend.
seaborn.pairplot(Co1new, hue='subset')
Or if you absolutely need line charts, you can make line charts of your data, each pair at a time separately, here is col1 vs. col3
seaborn.lineplot('col1', 'col3', hue='subset', data=Co1new)

Using #SIA ' s answer
df['groups'] = np.floor_divide(df.index, 3).astype(str)
import plotly.express as px
fig = px.line(df, x="col1", y="col2", color='groups')
fig.show()

Related

Operations on selective rows on pandas dataframe

I have a dataframe with phone calls, some of them are of zero duration. I want to replace them with int values ranging from 0 to 7, but every my attempt leads to errors or data loss.
I wrote function:
def calls_new(dur):
dur = random.randint(0,7)
return dur
and I tried to use it like this (one of these lines):
df_calls['duration'] = df_calls['duration'].apply(lambda row: x = random.randint(0,7) if x == 0 )
df_calls['duration'] = df_calls['duration'].where(df_calls['duration'] == 0, df_calls.apply(calls_new))
df_calls['duration'] = df_calls[df_calls['duration']==0].apply(calls_new)
Use .loc to set the values only where duration is 0. You can generate all of the random numbers and set everything at once. If you want 7, the end of randint needs to be 8 as the docs indicate high is one above the largest integer to be drawn.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'duration': [0,10,20,0,15,0,0,211]})
m = df['duration'].eq(0)
df.loc[m, 'duration'] = np.random.randint(0, 8, m.sum())
# |
# Need this many numbers
print(df)
duration
0 4
1 10
2 20
3 7
4 15
5 6
6 2
7 211

Change the bar item name in Pandas

I have a test excel file like:
df = pd.DataFrame({'name':list('abcdefg'),
'age':[10,20,5,23,58,4,6]})
print (df)
name age
0 a 10
1 b 20
2 c 5
3 d 23
4 e 58
5 f 4
6 g 6
I use Pandas and matplotlib to read and plot it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
df.plot(kind="bar")
plt.show()
the result shows:
it use index number as item name, how can I change it to the name, which stored in column name?
You can specify columns for x and y values in plot.bar:
df.plot(x='name', y='age', kind="bar")
Or create Series first by DataFrame.set_index and select age column:
df.set_index('name')['age'].plot(kind="bar")
#if multiple columns
#df.set_index('name').plot(kind="bar")

Plot values for multiple months and years in Plotly/Dash

I have a Dash dashboard and I need to plot on the x axis months from 0-12 and I need to have multiple lines on the same figure for different years that have been selected, ie 1991-2040. The plotted value is a columns say 'total' in a dataframe. The labels should be years and the total value is on the y axis. My data looks like this:
Month Year Total
0 0 1991 31.4
1 0 1992 31.4
2 0 1993 31.4
3 0 1994 20
4 0 1995 300
.. ... ... ...
33 0 2024 31.4
34 1 2035 567
35 1 2035 10
36 1 2035 3
....
Do I need to group it and how to achieve that in Dash/Plotly?
It seems to me that you should have a look at pd.pivot_table.
%matplotlib inline
import pandas as pd
import numpy as np
import plotly.offline as py
import plotly.graph_objs as go
# create a df
N = 100
df = pd.DataFrame({"Date":pd.date_range(start='1991-01-01',
periods=N,
freq='M'),
"Total":np.random.randn(N)})
df["Month"] = df["Date"].dt.month
df["Year"] = df["Date"].dt.year
# use pivot_table to have years as columns
pv = pd.pivot_table(df,
index=["Month"],
columns=["Year"],
values=["Total"])
# remove multiindex in columns
pv.columns = [col[1] for col in pv.columns]
data = [go.Scatter(x = pv.index,
y = pv[col],
name = col)
for col in pv.columns]
py.iplot(data)

Matplotlib - bar chart starts does not start with 0

I have a data and I am looking to present it in bar chart form.
data:
col1 = ['2018 01 01', '2018 01 02', '2018 12 27'] #dates
col2 = ['4554', '14120', '1422'] #usage of the user in seconds for that data in col1
my code:
I have imported all of the modules
import openpyxl as ol
import numpy as np
import matplotlib.pyplot as plt
plt.bar(col1, col2, label="Usage of the user")
plt.xlabel("Date")
plt.ylabel("Usage in seconds")
plt.title('Usage report of ' + str(args.user))
plt.legend()
plt.savefig("data.png")
When I open data.png
I get this:
Click here for the image
The graph looks all over the place, I want it to start at zero.
I am new to the matplotlib and openpyxl.
Any help is appreciated.
It appears that the issue is that the values in col2 that are being plotted on the y-axis are strings rather than integers. Updating these values to integers will allow the y-axis to start at 0 and be in sequential order.
col1 = ['2018 01 01', '2018 01 02', '2018 12 27'] #dates
col2 = ['4554', '14120', '1422']
plt.bar(col1, [int(x) for x in col2], label="Usage of the user")
plt.xlabel("Date")
plt.ylabel("Usage in seconds")
plt.title('Usage report')
plt.legend()

Populating pandas column based on moving date range (efficiently)

I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?
So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1
You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.

Resources