Graphing two cumulative distributions in Stata - statistics

I'm trying this code (just below), Stata seems to read it -- it does not show any errors --, but it does not generate any variables. Here it is:
cumul price if dummy==1, gen(cprice1)
cumul price if dummy==0, gen (cprice2)
line cprice1 cprice2 price
Could you guys help me? I could graph two kernel density distributions with a condition of "if" for the dummy, with a similar code, in which I stored the results for latter graphing them -- following the help files in Stata. But I could not do this with the cumulative distributions.

If you don't need to store the variables, cdfplot will do the trick. If not, cumul seems to work just fine:
sysuse auto, clear
/* Without Storing Variables */
ssc install cdfplot
cdfplot price, by(foreign) saving(cdfplot, replace)
/* With Variable Creation */
cumul price if foreign == 0, gen(cprice0)
cumul price if foreign == 1, gen(cprice1)
tw conn cprice* price, sort connect(J J) ms(none none) saving(cumulplot, replace)
/* Compare the two methods */
graph combine cdfplot.gph cumulplot.gph

Related

Contribution analysis versus lca.score

I am interested in which processes/activities contribute most to the Life Cycle Impact Assessment (LCIA) that I am conducting. For this, I run a contribution analysis (see code below). To crosscheck the results of my contribution analysis and to ensure that I get everything right, I wanted to compare the returned contributions with the impact assessment result (lca.score).
The documentation of ca.annotated_top_processes(lca) says: "Returns a list of tuples: (lca score, supply, activity)."
In my understanding, lca.score should be the same value as the sum of all the first values in the tuples that are returned by ca.annotated_top_processes(lca) (the printed values). However, this is not the case. What am I missing? Is there some sort of cut-off applied or did I misunderstand something?
import bw2analyzer as bwa
random_act = db_ei381.random()
lca = bw2data.LCA(
{random_act: 1},
('ReCiPe Midpoint (H) V1.13', 'water depletion', 'WDP')
)
lca.lci()
lca.lcia()
print(lca.score)
# %% Contribution analysis
ca = bwa.ContributionAnalysis()
contributions = ca.annotated_top_processes(lca)
print(sum([i[0] for i in contributions]))
It is not well documented, but you can introduce an argument limit that specifies the number of activities that are considered in the contribution analysis. The default value I think is 25. It is sorted so the most important activities come first. If you write something like this you should see how the result converges to the total score as the number of activities increase:
import matplotlib.pyplot as plt
cutoff = [25,50,100,500,1000,1200]
scores = []
for n in cutoff:
contributions = ca.annotated_top_processes(lca,limit=n)
contr_sum = sum([i[0] for i in contributions])
scores.append(contr_sum)
plt.plot(cutoff,scores)
plt.axhline(lca.score,ls='--',color='r');

Reducing time and memory used in loop calculation when locating earthquakes

I'm trying to locate tremor, which is a type of earthquake with smaller amplitude. I use grid search, which is a method that finds the coordinate where 'the difference between theoretical value and observed value of differential time in seismic wave arrival' becomes minimum.
The code I made is as follows. First I defined two functions that calculate distance between earthquake source and each point on grid, and that calculate travel time of seismic waves using obspy.
def distance(a,i):
return math.sqrt(((ste[a].stats.sac.stla-la[i])**2)+((ste[a].stats.sac.stlo-lo[i])**2))
def traveltime(a):
return model.get_travel_times(source_depth_in_km=35, distance_in_degree=a, phase_list=["S"], receiver_depth_in_km=0)[0].time
Then I conducted grid search using following codes.
di=[(la[i],lo[i],distance(a,i), distance(b,i)) for i in range(len(lo))
for a in range(len(ste))
for b in range(len(ste)) if a<b]
didf=pd.DataFrame(di)
latot=didf[0]
lotot=didf[1]
dia=didf[2]
dib=didf[3]
tt=[]
for i in range(len(di)):
try:
tt.append((latot[i],lotot[i],traveltime(dia[i])-traveltime(dib[i])))
except IndexError:
continue
ttdf=pd.DataFrame(tt)
final=[(win[j],ttdf[0][i],ttdf[1][i],(ttdf[2][i]-shift[j])**2) for i in range(len(ttdf))
for j in range(len(ccdf))]
where la and lo are the list of latitude and longitude coordinates with 0.01 degree interval, and ste is the list of the east components seismogram of each station. I have to get the list 'final' to proceed to the next step.
However, the problem is that it takes too much time to calculate three segments of codes written above. Moreover, the result I get after tens of hours of calculation is 'out of memory' error message. Is there any solution that can reduce both time and memory?
Without access to your dataset, it's a little difficult to debug, but here are a few suggestions for you.
for i in range(len(di)):
try:
tt.append((latot[i],lotot[i],traveltime(dia[i])-traveltime(dib[i])))
except IndexError:
continue
• Given the size of these lists, I think that the Garbage Collector might be slowing down this for loop; you might consider turning it off for the duration of the loop (gc.disable()).
• In theory, the Append statement shouldn't be the source of your performance problems, since it over-allocates:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
but you already know the size of the array, so you might consider using numpy.zeroes() to fill the list before the for-loop, and use the index to directly address each element. Alternatively, you could just use list comprehensions, as you did earlier, and avoid the problem altogether.
• I see that you've tagged the question with python-3.x, so range() shouldn't be an issue like it was in 2.x (otherwise you would want to consider using xrange()).
If you update your question with more details, I could probably provide a more detailed answer...hope this helps.

Geospatial fixed radius cluster hunting in python

I want to take an input of millions of lat long points (with a numerical attribute) and then find all fixed radius geospatial clusters where the sum of the attribute within the circle is above a defined threshold.
I started by using sklearn BallTree to sum the attribute within any defined circle, with the intention of then expanding this out to run across a grid or lattice of circles. The run time for one circle is around 0.01s, so this is fine for small lattices, but won't scale if I want to run 200m radius circles across the whole of the UK.
#example data (use 2m rows from postcode centroid file)
df = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000000)
#this will be our grid of points (or lattice) use points from same file for example
df2 = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000)
#reorder lat long columns for balltree input
columnTitles=["Y","X"]
df = df.reindex(columns=columnTitles)
df2 = df2.reindex(columns=columnTitles)
# assign new columns to existing dataframe. attribute will hold the data we want to sum over (set to 1 for now)
df['attribute'] = 1
df2['aggregation'] = 0
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self, lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index =BallTree(self.lat_longs, metric='haversine')
def query_radius(self,query,radius):
radius_km = radius/1000
radius_radiant = radius_km / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
indices = self.ball_tree_index.query_radius(query,r=radius_radiant)
return indices[0]
#index the base data
a=BallTreeIndex(df.iloc[:,0:2])
#begin to loop over the lattice to test performance
for i in range(0,100):
b = df2.iloc[i,0:2]
output = a.query_radius(b, 200)
accumulation = sum(df.iloc[output, 2])
df2.iloc[i,2] = accumulation
It feels as if the above code is really inefficient as I don't need to run the calculation across all circles on my lattice (as most will be well below my threshold - or will have no data points in at all).
Instead of this for loop, is there a better way of scaling this algorithm to give me the most dense circles?
I'm new to python, so any help would be massively appreciated!!
First don't try to do this on a sphere! GB is small and we have a well defined geographic projection that will work. So use the oseast1m and osnorth1m columns as X and Y. They are in metres so no need to convert (roughly) to degrees and use Haversine. That should help.
Next add a spatial index to speed up lookups.
If you need more speed there are various tricks like loading a 2R strip across the country into memory and then running your circles across that strip, then moving down a grid step and updating that strip (checking Y values against a fixed value is quick, especially if you store the data sorted on Y then X value). If you need more speed then look at any of the papers the Stan Openshaw (and sometimes I) wrote about parallelising the GAM. There are examples of implementing GAM in python (e.g. this paper, this paper) that may also point to better ways.

Min/Max Values Across Multiple Sublists within a Dictionary

I'm working on a program that will read a .txt file of NBA team names with fives statistics for each team into a dictionary. I was able to read the file into a dictionary with the correct key-value pairs, but I can't figure out how to return a table of minimum, maximum, and average values for each statistic across all teams. I've looked at other questions on the site, but I can't find anything pertaining to a key with multiple values per entry in the dictionary. Here's what the dictionary looks like for a few of the entries:
{'Golden State Warriors':[113.5, 107.5, 43.5, 0.503, 8.0], 'Houston Rockets':[112.4, 103.9, 43.5, 0.46, 8.5], ... : ...}
I need to make a table that displays the min, max, and average of each statistic across all entries:
PPG PPAG RPG FG SPG
MIN
MAX
AVG
I can do this kind of stuff with the statistics in the form of a list, but whenever I try to write lists into a dictionary, I get a TypeError. Would greatly appreciate suggestions, I've been stuck on this for hours.
Also, IMPORTANT NOTE: I cannot use lambda for this task, I am working on this for a project and my options are for loops, while loops, and the basic list functions and dictionary methods.
The following code should do the computation of the stas:
from statistics import mean
lists = list(dic.values())
output = {'MIN':[], 'MAX':[], 'AVG':[]}
for i in range(0, len(lists[0])):
output['MIN'].append(min(col[i] for col in lists))
output['MAX'].append(max(col[i] for col in lists))
output['AVG'].append(mean(col[i] for col in lists))
To print the results as a table, you may do the following:
print("\tPPG\tPPAG\tRPG\tFG\tSPG")
for stats in output:
print(stats, end='\t')
for i in range(0, cols):
print(round(output[stats][i], 3), end='\t')
print()
Hope it helps.
As long as all the data for each team is in the same order in the dictionary this will work well for your purpose. I started by forming a list containing the dictionary values which are also lists. Then you can zip the sub lists together, this has the handy feature of grouping all the same indices in the lists together. Now we can call max, min, and mean on each zip'd stat! If you can not import the 'statistics' module a list comp would also be an easy one liner.
from statistics import mean
stats = {'Golden State Warriors':[113.5, 107.5, 43.5, 0.503, 8.0], 'Houston Rockets':[112.4, 103.9, 43.5, 0.46, 8.5]}
values = [*zip(*stats.values())]
for value in values:
print("max: {}\nmin: {}\nmean: {:0.3f} \n".format(max(value), min(value), mean(value)))
EDIT: take note that if you try to zip lists together that are not the same length it will zip by the shortest list. This behavior can be corrected by using the itertools izip_longest and providing a default value.

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

Resources