Graphing a log file with Python - python-3.x

To start with I don't usually use any scripting language other than bash. However I have a requirement to graph the data from a large number of environmental monitoring log files from our server room monitor and feel that Python will work best. I am using Python 3.7 for this and have installed it and all the, so far, required libraries via macports and pip.
I want to end up with at least 7 graphs, each with multiple lines. Four of the graphs are the temperature and humidity data for each physical measurement point. Two of them are for air flow, hot and cold, and the last is for line voltage.
I have attempted to start this on my own and have gotten decently far. I open the log files and extract the required data. However getting the data into a graph seems to be beyond me. The data to be graphed is a date and time stamp, as X, and a number which is a dotted decimal, which should always be positive, as Y.
When extracting the date I am using time.strptime and time.mktime to turn it into a Unix epoch, which works just fine. When extracting the data I am using re.findall to remove the non-numerical portions. I plan to move the date from an epoch to date and time but that can come later.
When I get to the graphing portion is where I am having the issue.
I first tried graphing the data directly which gave me the error: TypeError: unhashable type: 'numpy.ndarray'
I have also tried using a Pandas dataframe. This gave me the error: TypeError: unhashable type: 'list'
I have even tried to convert the lists to a tuple both with and without the dataframe, the same errors were give.
Based on the output of my lists I think the issue is with using append for the values that will be the Y axis. However I cannot seem to Google well enough to find a solution.
The code, outputs seen, and input data is below. The comments are there from the last run, I use them for testing various portions.
Code so far:
# Import needed libraries
import re
import time
import matplotlib.pyplot as plt
import pandas as pd
#import matplotlib.dates as mpd
# Need to initialize these or append doesn't work
hvacepoch = []
hvacnum = []
endepoch = []
endnum = []
# Known static variables
datepattern = '%m-%d-%Y %H:%M:%S'
# Open the files
coldairfile = open("air-cold.log","r")
# Grab the data and do some initial conversions
for coldairline in coldairfile:
fields = coldairline.split()
colddate = fields[0] + " " + fields[1]
# coldepoch = mpd.epoch2num(int(time.mktime(time.strptime(colddate, datepattern))))
coldepoch = int(time.mktime(time.strptime(colddate, datepattern)))
coldnum = re.findall('\d*\.?\d+',fields[4])
coldloc = fields[9]
if coldloc == "HVAC":
hvacepoch.append(coldepoch)
hvacnum.append(coldnum)
if coldloc == "Cold":
endepoch.append(coldepoch)
endnum.append(coldnum)
# Convert the lists to a tuple. Do I need this?
hvacepocht = tuple(hvacepoch)
hvacnumt = tuple(hvacnum)
endepocht = tuple(endepoch)
endnumt = tuple(endnum)
# Testing output
print(f'HVAC air flow date and time: {hvacepoch}')
print(f'HVAC air flow date and time tuple: {hvacepocht}')
print(f'HVAC air flow numbers: {hvacnum}')
print(f'HVAC air flow numbers tuple: {hvacnumt}')
print(f'Cold end air flow date and time: {endepoch}')
print(f'Cold end air flow date and time tuple: {endepocht}')
print(f'Cold end air flow numbers: {endnum}')
print(f'Cold end air flow numbers tuple: {endnumt}')
# Graph it. How to do for multiple graphs?
# With a Pandas dataframe as a list.
#colddata=pd.DataFrame({'x': endepoch, 'y1': endnum, 'y2': hvacnum })
#plt.plot( 'x', 'y1', data=colddata, marker='', color='blue', linewidth=2, label="Cold Aisle End")
#plt.plot( 'x', 'y2', data=colddata, marker='', color='skyblue', linewidth=2, label="HVAC")
# With a Pandas dataframe as a tuple.
#colddata=pd.DataFrame({'x': endepocht, 'y1': endnumt, 'y2': hvacnumt })
#plt.plot( 'x', 'y1', data=colddata, marker='', color='blue', linewidth=2, label="Cold Aisle End")
#plt.plot( 'x', 'y2', data=colddata, marker='', color='skyblue', linewidth=2, label="HVAC")
# Without a Pandas dataframe as a list.
#plt.plot(hvacepoch, hvacnum, label = "HVAC")
#plt.plot(endepoch, endnum, label = "Cold End")
# Without a Pandas dataframe as a tuple.
#plt.plot(hvacepocht, hvacnumt, label = "HVAC")
#plt.plot(endepocht, endnumt, label = "Cold End")
# Needed regardless
#plt.title('Airflow\nUnder Floor')
#plt.legend()
#plt.show()
# Close the files
coldairfile.close()
The output from the print lines(truncated):
HVAC air flow date and time: [1588531379, 1588531389, 1588531399]
HVAC air flow date and time tuple: (1588531379, 1588531389, 1588531399)
HVAC air flow numbers: [['0.14'], ['0.15'], ['0.15']]
HVAC air flow numbers tuple: (['0.14'], ['0.15'], ['0.15'])
Cold end air flow date and time: [1588531379, 1588531389, 1588531399]
Cold end air flow date and time tuple: (1588531379, 1588531389, 1588531399)
Cold end air flow numbers: [['0.10'], ['0.09'], ['0.07']]
Cold end air flow numbers tuple: (['0.10'], ['0.09'], ['0.07'])
The input(truncated):
05-03-2020 14:42:59 Air Velocit 0.14m/ Under Floor Air Flow HVAC
05-03-2020 14:42:59 Air Velocit 0.10m/ Under Floor Air Flow Cold End
05-03-2020 14:43:09 Air Velocit 0.15m/ Under Floor Air Flow HVAC
05-03-2020 14:43:09 Air Velocit 0.09m/ Under Floor Air Flow Cold End
05-03-2020 14:43:19 Air Velocit 0.15m/ Under Floor Air Flow HVAC
05-03-2020 14:43:19 Air Velocit 0.07m/ Under Floor Air Flow Cold End

I just checked your data and it looks like the issue is that endnum and hvacnum are not lists of values. They're lists of lists, as you can see below:
In [1]: colddata.head()
Out[1]:
x y1 y2
0 1588531379 [0.10] [0.14]
1 1588531389 [0.09] [0.15]
2 1588531399 [0.07] [0.15]
So, when you go to plot the data, matplotlib doesn't know how to plot those rows. What you can do is use a list comprehension to grab unpack the list.
In [2]:
print(endnum)
print(hvacnum)
Out[2]:
[['0.10'], ['0.09'], ['0.07']]
[['0.14'], ['0.15'], ['0.15']]
In [3]:
endnum = [i[0] for i in endnum]
hvacnum = [i[0] for i in hvacnum]
print(endnum)
print(hvacnum)
Out[3]:
['0.10', '0.09', '0.07']
['0.14', '0.15', '0.15']

Given your log file, you can use pd.read_fwf with specific colspecs:
df = pd.read_fwf('/home/quang/projects/untitled.txt', header=None,
colspecs=[[0,20], [22,34], [35,39], [42, 54], [54,-1]], # modify this to fit your needs
parse_dates=[0],
names=['time', 'veloc', 'value', 'location', 'type'] # also modify this
)
which gives you a dataframe like this:
time veloc value location type
0 2020-05-03 14:42:59 Air Velocit 0.14 Under Floor Air Flow HVAC
1 2020-05-03 14:42:59 Air Velocit 0.10 Under Floor Air Flow Cold End
2 2020-05-03 14:43:09 Air Velocit 0.15 Under Floor Air Flow HVAC
3 2020-05-03 14:43:09 Air Velocit 0.09 Under Floor Air Flow Cold End
4 2020-05-03 14:43:19 Air Velocit 0.15 Under Floor Air Flow HVAC
5 2020-05-03 14:43:19 Air Velocit 0.07 Under Floor Air Flow Cold End
And you can plot with sns:
sns.lineplot(data=df, x='time', y='value', hue='type' )
Output:

Related

KeyError: "None of [Int64Index([1960, 1961, 1962, 1963, 1964], dtype='int64')] are in the [columns]"

I am trying to scatter plot a dataframe and for this I have provided it with x and y components. It is showing error in the x component. it gives the error on 'Year' column. I have checked manually that Year Column exists in the dataframe still it shows error. Note that year column contains years from 1960 to 1964.
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
df_urb_pop = next(urb_pop_reader)
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(a[0]*(a[1]*0.01)) for a in pops_list]
# Plot urban population data
df_pop_ceb.plot(kind='scatter', x=df_pop_ceb['Year'], y=df_pop_ceb['Total Urban Population'])
plt.show()
If you want to use pandas' plotting, you should pass the labels as x and y, not the data:
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
Also looking at the docs I think you should rather do
df_pop_ceb.plot.scatter(x='Year', y='Total Urban Population')
The error raises because you are trying to apply the plt method to the dataframe directly. Try:
import matplotlib as plt
plt.scatter(x=df_pop_ceb['Year'], y=df_pop_ceb['Total Urban Population'])
plt.title('Title')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Also, there's no need to zip to calculate the total urban population. You could just multiply both columns directly:
df_pop_ceb['Total Urban Population'] = (df_pop_ceb['Total Population']*df_pop_ceb['Urban population (% of total)']*0.01)
Hope that helps!

Gantt Chart for USGS Hydrology Data with Python?

I have a compiled a dataframe that contains USGS streamflow data at several different streamgages. Now I want to create a Gantt chart similar to this. Currently, my data has columns as site names and a date index as rows.
Here is a sample of my data.
The problem with the Gantt chart example I linked is that my data has gaps between the start and end dates that would normally define the horizontal time-lines. Many of the examples I found only account for the start and end date, but not missing values that may be in between. How do I account for the gaps where there is no data (blanks or nan in those slots for values) for some of the sites?
First, I have a plot that shows where the missing data is.
import missingno as msno
msno.bar(dfp)
Now, I want time on the x-axis and a horizontal line on the y-axis that tracks when the sites contain data at those times. I know how to do this the brute force way, which would mean manually picking out the start and end dates where there is valid data (which I made up below).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt
df=[('RIO GRANDE AT EMBUDO, NM','2015-7-22','2015-12-7'),
('RIO GRANDE AT EMBUDO, NM','2016-1-22','2016-8-5'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2014-12-10','2015-12-14'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2017-1-10','2017-11-25'),
('RIO GRANDE AT OTOWI BRIDGE, NM','2015-8-17','2017-8-21'),
('RIO GRANDE BLW TAOS JUNCTION BRIDGE NEAR TAOS, NM','2015-9-1','2016-6-1'),
('RIO GRANDE NEAR CERRO, NM','2016-1-2','2016-3-15'),
]
df=pd.DataFrame(data=df)
df.columns = ['A', 'Beg', 'End']
df['Beg'] = pd.to_datetime(df['Beg'])
df['End'] = pd.to_datetime(df['End'])
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df['A'], dt.date2num(df['Beg']), dt.date2num(df['End']))
How do I make a figure (like the one shown above) with the dataframe I provided as an example? Ideally I want to avoid the brute force method.
Please note: values of zero are considered valid data points.
Thank you in advance for your feedback!
Find date ranges of non-null data
2020-02-12 Edit to clarify logic in loop
df = pd.read_excel('Downloads/output.xlsx', index_col='date')
Make sure the dates are in order:
df.sort_index(inplace=True)
Loop thru the data and find the edges of the good data ranges. Get the corresponding index values and the name of the gauge and collect them all in a list:
# Looping feels like defeat. However, I'm not clever enough to avoid it
good_ranges = []
for i in df:
col = df[i]
gauge_name = col.name
# Start of good data block defined by a number preceeded by a NaN
start_mark = (col.notnull() & col.shift().isnull())
start = col[start_mark].index
# End of good data block defined by a number followed by a Nan
end_mark = (col.notnull() & col.shift(-1).isnull())
end = col[end_mark].index
for s, e in zip(start, end):
good_ranges.append((gauge_name, s, e))
good_ranges = pd.DataFrame(good_ranges, columns=['gauge', 'start', 'end'])
Plotting
Nothing new here. Copied pretty much straight from your question:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(good_ranges['gauge'],
dt.date2num(good_ranges['start']),
dt.date2num(good_ranges['end']))
fig.tight_layout()
Here's an approach that you could use, it's a bit hacky so perhaps some else will produce a better solution but it should produce your desired output. First use pd.where to replace non NaN values with an integer which will later determine the position of the lines on y-axis later, I do this row by row so that all data which belongs together will be at the same height. If you want to increase the spacing between the lines of the gantt chart you can add a number to i, I've provided an example in the comments in the code block below.
The y-labels and their positions are produced in the data munging steps, so this method will work regardless of the number of columns and will position the labels correctly when you change the spacing described above.
This approach returns matplotlib.pyplot.axes and matplotlib.pyplot.Figure object, so you can adjust the asthetics of the chart to suit your purposes (i.e. change the thickness of the lines, colours etc.). Link to docs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel('output.xlsx')
dates = pd.to_datetime(df.date)
df.index = dates
df = df.drop('date', axis=1)
new_rows = [df[s].where(df[s].isna(), i) for i, s in enumerate(df, 1)]
# To increase spacing between lines add a number to i, eg. below:
# [df[s].where(df[s].isna(), i+3) for i, s in enumerate(df, 1)]
new_df = pd.DataFrame(new_rows)
### Plotting ###
fig, ax = plt.subplots() # Create axes object to pass to pandas df.plot()
ax = new_df.transpose().plot(figsize=(40,10), ax=ax, legend=False, fontsize=20)
list_of_sites = new_df.transpose().columns.to_list() # For y tick labels
x_tick_location = new_df.iloc[:, 0].values # For y tick positions
ax.set_yticks(x_tick_location) # Place ticks in correct positions
ax.set_yticklabels(list_of_sites) # Update labels to site names

Pandas dataframe float index not self-consistent

I need/want to work with float indices in pandas but I get a keyerror when running something like this:
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
df[df.index[0]]
I have seen some errors regarding precision, but shouldn't this work?
You get the KeyError because df[df.index[0]] would try to access a column with label 1.1 in this case - which does not exist here.
What you can do is use loc or iloc to access rows based on indices:
import numpy as np
import pandas as pd
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
# to access e.g. the first row use
df.loc[df.index[0]]
# or more general
df.iloc[0]
# 5.4 1.531411
# 6.7 -0.341232
# Name: 1.1, dtype: float64
In principle, if you can, avoid equal comparisons for floating point numbers for the reason you already came across: precision. The 1.1 displayed to you might be != 1.1 for the computer - simply because that would theoretically require infinite precision. Most of the time, it will work though because certain tolerance checks will kick in; for example if the difference of the compared numbers is < 10^6.

Vectorizing a for loop with a pandas dataframe

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.
Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

Find SARIMAX AIC and pdq values, statsmodels

I'm trying to find the values of p,d,q and the seasonal values of P,D,Q using statsmodels as "sm" in python.
The data set i'm using is a csv file that contains time series data over three years recording the energy consumption. The file was split into a smaller data frame in order to work with it. Here is what df_test.head() looks like.
time total_consumption
122400 2015-05-01 00:01:00 106.391
122401 2015-05-01 00:11:00 120.371
122402 2015-05-01 00:21:00 109.292
122403 2015-05-01 00:31:00 99.838
122404 2015-05-01 00:41:00 97.387
Here is my code so far.
#Importing the timeserie data set from local file
df = pd.read_csv(r"C:\Users\path\Name of the file.csv")
#Rename the columns, put time as index and assign datetime to the column time
df.columns = ["time","total_consumption"]
df['time'] = pd.to_datetime(df.time)
df.set_index('time')
#Select test df (there is data from the 2015-05-01 2015-06-01)
df_test = df.loc[(df['time'] >= '2015-05-01') & (df['time'] <= '2015-05-14')]
#Find minimal AIC value for the ARIMA model integers
p = range(0,2)
d = range(0,2)
q = range(0,2)
pdq = list(itertools.product(p,d,q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,d,q))]
warnings.filterwarnings("ignore")
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(df_test,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
When I try to run the code as it is, the program doesn't even acknowledge the "for" loop. But when I take out the
try:
except:
continue
the program gives me this error message
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
How could I remedy that and is there a way to automate the process directly output the parameters with the lowest AIC value (without having to look for it throught all the possibilities).
Thanks !

Resources