Reducing time and memory used in loop calculation when locating earthquakes - python-3.x

I'm trying to locate tremor, which is a type of earthquake with smaller amplitude. I use grid search, which is a method that finds the coordinate where 'the difference between theoretical value and observed value of differential time in seismic wave arrival' becomes minimum.
The code I made is as follows. First I defined two functions that calculate distance between earthquake source and each point on grid, and that calculate travel time of seismic waves using obspy.
def distance(a,i):
return math.sqrt(((ste[a].stats.sac.stla-la[i])**2)+((ste[a].stats.sac.stlo-lo[i])**2))
def traveltime(a):
return model.get_travel_times(source_depth_in_km=35, distance_in_degree=a, phase_list=["S"], receiver_depth_in_km=0)[0].time
Then I conducted grid search using following codes.
di=[(la[i],lo[i],distance(a,i), distance(b,i)) for i in range(len(lo))
for a in range(len(ste))
for b in range(len(ste)) if a<b]
didf=pd.DataFrame(di)
latot=didf[0]
lotot=didf[1]
dia=didf[2]
dib=didf[3]
tt=[]
for i in range(len(di)):
try:
tt.append((latot[i],lotot[i],traveltime(dia[i])-traveltime(dib[i])))
except IndexError:
continue
ttdf=pd.DataFrame(tt)
final=[(win[j],ttdf[0][i],ttdf[1][i],(ttdf[2][i]-shift[j])**2) for i in range(len(ttdf))
for j in range(len(ccdf))]
where la and lo are the list of latitude and longitude coordinates with 0.01 degree interval, and ste is the list of the east components seismogram of each station. I have to get the list 'final' to proceed to the next step.
However, the problem is that it takes too much time to calculate three segments of codes written above. Moreover, the result I get after tens of hours of calculation is 'out of memory' error message. Is there any solution that can reduce both time and memory?

Without access to your dataset, it's a little difficult to debug, but here are a few suggestions for you.
for i in range(len(di)):
try:
tt.append((latot[i],lotot[i],traveltime(dia[i])-traveltime(dib[i])))
except IndexError:
continue
• Given the size of these lists, I think that the Garbage Collector might be slowing down this for loop; you might consider turning it off for the duration of the loop (gc.disable()).
• In theory, the Append statement shouldn't be the source of your performance problems, since it over-allocates:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
but you already know the size of the array, so you might consider using numpy.zeroes() to fill the list before the for-loop, and use the index to directly address each element. Alternatively, you could just use list comprehensions, as you did earlier, and avoid the problem altogether.
• I see that you've tagged the question with python-3.x, so range() shouldn't be an issue like it was in 2.x (otherwise you would want to consider using xrange()).
If you update your question with more details, I could probably provide a more detailed answer...hope this helps.

Related

MDAnalysis comparing position of ion in trajectory to previous ts

So I have a system where I need to be able to determine the exact position of my ions and run an equation on the average position of that ion. I found my ion positions were inconsistent due to some ions wrapping across the periodic boundary and severely changing the position for that one window. Leading me to have an average of say +20 when the ion just shuffled between +40 and -40.
I was wanting to correct that by implementing a way to unwrap my wrapped coordinates for ions on the edge of my box.
Essentially I was thinking that for each frame in my trajectory, MDAnalysis would check the position of ION 1 in frame 1. Then in frame 2 it would check the same ion once more and compare it to the previous position. If it for example goes from + coordinates to - coordinates then I would have a count that adds +1 meaning that it wrapped once. If it goes from - to + I would have it subtract 1. Then by the end of all of the frames I would have a number that could help me identify how I could perform my analysis.
However my coding skills are less than lackluster and I wanted to know how I would go about implementing this? I have essentially gotten the count down, but the comparison between frames is where I am confused. How would I do this comparison?
Thanks in advance
There are a few ways to answer this question. Firstly,
Essentially I was thinking that for each frame in my trajectory, MDAnalysis would check the position of ION 1 in frame 1. Then in frame 2 it would check the same ion once more and compare it to the previous position. If it for example goes from + coordinates to - coordinates then I would have a count that adds +1 meaning that it wrapped once. If it goes from - to + I would have it subtract 1. Then by the end of all of the frames I would have a number that could help me identify how I could perform my analysis.
You could write your own analysis class.
One untested way to do it is prototyped below -- the tutorial goes more into what each method (_prepare, _conclude, etc) does.
from MDAnalysis.analysis.base import AnalysisBase
import numpy as np
class CountWrappings(AnalysisBase):
def __init__(self, universe, select="name NA"):
super().__init__(universe.universe.trajectory)
# these are your selected ions
self.atomgroup = universe.select_atoms(select)
self.n_atoms = len(self.atomgroup)
def _prepare(self):
# self.results is a dictionary of results
self.results.wrapping_per_frame = np.zeros((self.n_frames, self.n_atoms), dtype=bool)
self._last_positions = self.atomgroup.positions
def _single_frame(self):
# does sign change for any element in 2D array?
compare_signs = np.sign(self.atomgroup.positions) == np.sign(self._last_positions)
sign_changes_any_axis = np.any(compare_signs, axis=1)
# _frame_index is the relative index of the frame being currently analyzed
self.results.wrapping_per_frame[self._frame_index] = sign_changes_any_axis
self._last_positions = self.atomgroup.positions
def _conclude(self):
self.results.n_wraps = self.results.wrapping_per_frame.sum(axis=0)
n_wraps = CountWrappings(my_universe, select="name NA CL MG")
n_wraps.run()
print(n_wraps.results.wrapping_per_frame)
print(n_wraps.results.n_wraps)
However, I'm not sure that addresses your actual aim:
I was wanting to correct that by implementing a way to unwrap my wrapped coordinates for ions on the edge of my box.
Are you computing the ion positions relative to anything? Potentially you could add bonds between each ion and the center so that you can use the AtomGroup.unwrap() function. Alternatively, is your data compatible with GROMACS? GROMACS has an unwrapping utility called "nojump" that unwraps atoms jumping across box edges, e.g.
gmx trjconv -f my_trajectory.xtc -s my_topology.gro -pbc nojump -o my_unwrapped_trajectory.xtc
As Lily mentioned, you could write your own analysis to do this or use GROMACS. However, both Lily's example and the GROMACS implementation of 'nojump' fail to account for box size fluctuations under the NPT ensemble (assuming you've used NPT). von Bulow et al. wrote about this widespread problem a couple of years ago. As far as I'm aware, the only implementation of nojump unwrapping that accounts for box size fluctuations is in LiPyphilic (disclaimer: I am the author of LiPyphilic).
Using LiPyphilic, you can unwrap your trajectory like so:
import MDAnalysis as mda
from lipyphilic.transformations import nojump
u = mda.Universe(pdb, xtc)
ions = u.select_atoms('name NA CLA')
u.trajectory.add_transformations(
nojump(
ag=ions,
nojump_x=True,
nojump_y=True,
nojump_z=True)
)
Then, when you do further analysis with your MDAnalysis Universe, the atoms will automatically be unwrapped at each frame.

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Predicting a users next action based on current day and time

I'm using Microsoft Azure Machine Learning Studio to try an experiment where I use previous analytics captured about a user (at a time, on a day) to try and predict their next action (based on day and time) so that I can adjust the UI accordingly. So if a user normally visits a certain page every Thursday at 1pm, then I would like to predict that behaviour.
Warning - I am a complete novice with ML, but have watched quite a few videos and worked through tutorials like the movie recommendations example.
I have a csv dataset with userid,action,datetime and would like to train a matchbox recommendation model, which, from my research appears to be the best model to use. I can't see a way to use date/time in the training. The idea being that if I could pass in a userid and the date, then the recommendation model should be able to give me a probably result of what that user is most likely to do.
I get results from the predictive endpoint, but the training endpoint gives the following error:
{
"error": {
"code": "ModuleExecutionError",
"message": "Module execution encountered an error.",
"details": [
{
"code": "18",
"target": "Train Matchbox Recommender",
"message": "Error 0018: Training dataset of user-item-rating triples contains invalid data."
}
]
}
}
Here is a link to a public version of the experiment
Any help would be appreciated.
Thanks.
Maybe this answer could be helpful, you may also take a look on this where you can read:
The problem is probably with the range of rating data. There's an upper limit for rating range, because the training gets expensive if the range between smallest and largest rating is too large.
[...]
One option would be to scale the ratings to a narrower range.
According to this MSDN, please note that you cannot have a gap between the min and max note higher than 100.
So you have to make a pre-processing on your csv file column data (userid, action, datetime etc...) in order to keep all column data in the [0-99] range.
Please see bellow a Python implementation (to share the logic):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
big_gap_arr = [-250,-2350,850,-120,-1235,3212,1,5,65,48,265,1204,65,23,45,895,5000,3,325,3244,5482] #data with big gap
abs_min = abs(min(big_gap_arr)) #get the absolute minimal value
max_diff= ( max(big_gap_arr) + abs_min ) #get the maximal diff
specific_range_arr=[]
for each_value in big_gap_arr:
new_value = ( 99/1. * float( abs_min + each_value) / max_diff ) #get a corresponding value in the [0,99] range
specific_range_arr.append(new_value)
print specific_range_arr #post computed data => all in range [0,99]
Which give you:
[26.54494382022472, 0.0, 40.449438202247194, 28.18820224719101, 14.094101123595506, 70.3061797752809, 29.71769662921348, 29.76825842696629, 30.526685393258425, 30.31179775280899, 33.05477528089887, 44.924157303370784, 30.526685393258425, 29.995786516853933, 30.27387640449438, 41.01825842696629, 92.90730337078652, 29.742977528089888, 33.813202247191015, 70.71067415730337, 99.0]
Note that all data are now in the [0,99] range
Following this process:
User id could be float instead an integer
Action is an integer (if you got less than 100 actions) or float (if more than 100 actions)
Datetime will be splited in two integer (or one integer and one float), please see bellow:
Concerning:
(A) way to use date/time in the training
You may split your datetime in two column, something like:
one column for the weekday:
0: Sunday
1: Monday
2: Tuesday
[...]
6: Saturday
one column for the time in the day:
0: Between 00:00 & 00:15
1: Between 00:15 & 00:30
2: Between 00:30 & 00:40
[...]
95 : Between 23:45 & 00:00
If you need a better granularity (here it is a 15 min window) you may also use float number for the time column.
So from messing with this for a while, I think I may see where the issue may lie. I think that the first three inputs of the Train Matchbox Recommender would need to be filled in for an accurate prediction. I'll include screenshots of the sample for recommending restaurants, as well.
The first input would be the dataset consisting of the user, item, and rating.
The second input would be the features of each user.
And the third input would be the features of each feature (restaurant in this case).
So to help with the date/time issue, I'm wondering if the data would need to be munged to match something similar to the restaurant and user data.
I know it's not much, but I hope it helps lead you down the right track.

how to predict with gaussianhmm sklearn

I'm trying to predict stock prices using sklearn. I'm new to prediction. I tried the example from sklearn for stock prediction with gaussian hmm. But predict gives states sequence which overlay on the price and it also takes points from given input close price. My question is how to generate next 10 prices?
You will always use the last state to predict the next state, so let's add 10 days worth of inputs by changing the end date to the 23rd:
date2 = datetime.date(2012, 1, 23)
You can double check the rest of the code to make sure I am not actually using future data for the prediction. The rest of these lines can be added to the bottom of the file. First we want to find out what the expected return is for a given state. The model.means_ array has returns, both those were the returns that got us to this state, not the future returns which is what you want. To get the future returns, we consider the probability of going to any one of the 5 states, and what the return of those states is. We get the probability of going to any particular state from the model.transmat_ matrix, the for the return of each state we use the model.means_ values. We take the dot product to get the expected return for a particular state. Then we remove the volume data (you can leave it in if you want, but you seemed to be most interested in future prices).
expected_returns_and_volumes = np.dot(model.transmat_, model.means_)
returns_and_volumes_columnwise = zip(*expected_returns_and_volumes)
returns = returns_and_volumes_columnwise[0]
If you print the value for returns[0], you'll see the expected return in dollars for state 0, returns[1] for state 1 etc. Now, given a day and a state, we want to predict the price for tomorrow. You said 10 days so let's use that for lastN.
predicted_prices = []
lastN = 10
for idx in xrange(lastN):
state = hidden_states[-lastN+idx]
current_price = quotes[-lastN+idx][2]
current_date = datetime.date.fromordinal(dates[-lastN+idx])
predicted_date = current_date + datetime.timedelta(days=1)
predicted_prices.append((predicted_date, current_price + returns[state]))
print(predicted_prices)
If you were running this in "production" you would set date2 to the last date you have and then lastN would be 1. Note that I don't take into account weekends for the predicted_date.
This is a fun exercise but you probably wouldn't run this in production, hence the quotes. First, the time series is the raw price; this should really be percentage returns or log returns. Plus there is no justification for picking 5 states for the HMM, or that a HMM is even good for this kinda problem, which I doubt. They probably just picked it as an example. I think the other sklearn example using PCA is much more interesting.

Resources