Speeding up my geopy function - python-3.x

I have a set of lat-long coordinates in a column of a df. The format is lat, long. The idea is to determine the zipcode (postal code) and add a column with that number to the df.
Here is the column
10 40.7145,-73.9425
10000 40.7947,-73.9667
100004 40.7388,-74.0018
100007 40.7539,-73.9677
100013 40.8241,-73.9493
I created this function using geopy to get the information I want:
>>>def zipcode (coords):#
geolocator = Nominatim()
location = geolocator.reverse(coords)
zip=location.raw['address']['postcode']
return zip
It works fine in a small subset of the data frame, but when I try it in the larger dataset, it stalls.
Was wondering if someone can give me a more efficient way to do this.

Related

Optimize Past stock calculation using cumulative sum in python and pandas without iterating through dataframe (Performance warning)

I'm calculating the cumulative sum of one specific Clothes store stock over time (grouped by Family, Groups, Year and months). To be able to re-establish the stock levels of the past based on three values: Number of purchased items , Number of sold items and the current stock i have today.
I have already solved the calculation problem by: Merging the stock table with the movement table and calculating the mov_itens['novo_estoque'] with the formula bellow:
mov_itens['novo_estoque'] = mov_itens['vendas'] - mov_itens['compras'] + mov_itens['estoque']
Then i have transformed it on a multi-index dataframe. Where indexes are respecticelly: codfamily, codgroup, year and month. By doing:
gruposemindice = mov_itens.groupby(['codfamilia','codgrupo','ano','mes']).sum()
And calculated the CUMSUM() on the 'estoque' column. Where i could not use it after a map or something like that, because i wasn't able to return (to my new dataframe) the other columns that shouldn't receive the cumulative sum.
gruposemindice_ord = gruposemindice.sort_index(ascending=False)
for i in gruposemindice_ord.index:
if(f == i[0]):#codfamilia
if(g == i[1]):#codgrupo
gruposemindice_ord.loc[i[:-2]]['estoque'] = (gruposemindice_ord.loc[i[:-2]]['novo_estoque']).cumsum()
#calcular giro de estoque nessa linha
print(gruposemindice_ord.loc[i[:-2]])
else:
g = i[1]
else:
f = i[0]
The problem is that I'm doing it iteratively and the dataframe is sorted DESCENDING by index, which makes the query last like order(n) for each index that I access. In fact, it should be Order (1) [direct access], to be fast enough and do not cause a bottleneck and these errors that are appearing....
gruposemindice_ord.loc[i[:-2]]['estoque'] = (gruposemindice_ord.loc[i[:-2]]['novo_estoque']).cumsum()
C:\Users\Diego\AppData\Local\Temp/ipykernel_20264/1416332248.py:10: **PerformanceWarning: indexing past lexsort depth may impact performance.**
print(gruposemindice_ord.loc[i[:-2]])
C:\Users\Diego\AppData\Local\Temp/ipykernel_20264/1416332248.py:9: PerformanceWarning: indexing past lexsort depth may impact performance.
gruposemindice_ord.loc[i[:-2]]['estoque'] = (gruposemindice_ord.loc[i[:-2]]['novo_estoque']).cumsum()
C:\Users\Diego\AppData\Local\Temp/ipykernel_20264/1416332248.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
According to python,I can correct it just ASCENDING sorting the dataframe before doing the manipulation, but I need a DESCENDING sorting, because stock info is only available for the current month, which is the last month of the table, and then I calculate the previous months starting for it.... If I sort the other way the calculation will not work.
Some Observations
The dataframe Multi-indexes are the numbers of the Families, Groups and Brands of the store, so I can't re-index and loose these numbers,
Also i cannot do this by account on an ascending order, as my first stock is on the last month
I am already checking if the sort and transaction are correct (As pointed on another stack answer).
gruposemindice_ord = gruposemindice.sort_index(ascending=False)
gruposemindice_ord.index.is_lexsorted()
Hope someone can help me!
Best Regards, Diego Mello

How to alter output data format for isolation forest

I have built an isolation forest to detect anomalies for a csv file that I have, and I wanted to see how I can change the format of the data. Right now, the anomaly data is being outputted as a pandas dataframe, but I would like to alter it to be a json file, in the following format:
{seconds: #seconds for that row, size2: size2, pages: #pages for that row}
I have attached come of the code and a sample of the data, thank you so much!
model.fit(df[['label']])
df['anomaly']=model.fit_predict(df[['size2','size3','size4']])
#df['anomaly']= model.predict(df[['pages']])
print(model.predict(X_test))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)
The output data looks something like this:
Unnamed: seconds: size2: ... size4: pages: anomaly:
1 40 32 654 1 -1
I have figured out a way to do this; I made multiple dictionaries, one mapping the index of the row to that timestamp, and one mapping the index of the row to the label. I was then able to keep track of which indexes were in the output data, and access all the information from those dictionaries.

How to resample multiple files of time series data to be all the same length (same # of observations)

I have many csv files which are of time series data (i.e. the data are sequential, however no time column exists).
I need to make all the files the same length in order to feed them into tensorflow. I could make them all the size of the file with largest length, or just use the average length of all files. -doesn't really matter-
Since the files don't have time column, I converted the index column to timedate with unit 's' and used this column in the resampling.
To give you a sample of the shape of my data, when running df.head(3) ,this is the result:
0 1 2 3 4 5 6 7
0 0.30467 0.45957 -0.95414 1.74687 1.42338 -0.03860 2.20401 1.44406
1 0.27331 0.59293 -1.00874 1.74135 1.32004 -0.00701 2.20917 1.34164
2 0.30348 0.88129 -1.05517 1.75090 1.65138 -0.03112 2.21598 1.68487
This is what I have tried so far to no avail:
for file in files:
df=pd.read_csv(file, header=None)
resampled=df.set_index(pd.to_datetime(df.index,unit='s')).resample('250ms')
resamp=pd.DataFrame(resampled)
I aslo tried: df.set_index(pd.to_datetime(df.index,unit='s')).resample('250ms').asfreq() and df.set_index(pd.to_datetime(df.index,unit='s')).resample('250ms').asfreq().interpolate()
None of the above gave dataframes of the same length. They all returned dfs of different lengths.
I expect the output to be resampled data such that all the files are of the same length (i.e. same number of observations) and are correctly resampled (either upsampled or downsampled).
After having resampled the files, I need to concatenate all of them to have one big file which I can then reshape to input to tensorflow.
I am new to Python so I will really appreciate support here.
I have reached to making the files same size using the following steps:
1- add date range column as index column to all files with same start and end dates.
2- resampling all files with the same resampling rate. (used trial and error to reach a resampled size that seemed good enough for my data)
3- Put a condition that if length was > desired do mean, else do interpolate
Code:
for i, file in enumerate (directory):
count+=1
df=pd.read_csv(file, header=None)
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
frames_length.append(df.shape[0])
# Resampling using 33D (33D was chosen arbitrarily with trial and error until 100 was obtained),
# returns resampled files/movements of 100 frames each
if df.shape[0] <100: resampled=df.resample('33D').interpolate()
elif df.shape[0]>100: resampled=df.resample('33D').mean()
else: break
# Check if resampled files have any nulls or NaNs
print(resampled.isnull().any().any())
The above code returned files of size 100 for me.
Now what I need is to know if resampling was correct or not?
Any help would be greatly appreciated.
Thanks!

Finding nearest Zip-Code Given Lat Long and List of Zip Codes w/ Lat Long

I have a left data frame of over 1million lat/long observations. I have another data frame (the right) of 43191 zip codes that have a central Lat/Long.
My goal is to run each row of the 1 million lat/long against the entire Zip Code data frame, take distance of each observation, then return the corresponding minimum distance zip code that goes with that minimum distance point . I want to take a loop approach since there is too much data to do a cartesian join with.
I understand this will probably be a lengthy operation but I only need to do it once. I am just trying to do it in a way that doesn't take days and won't give me a memory error.
The database with the lat/long zip codes lives here:
https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/
I have tried to join the left table with the right in a cartesian setting but that creates over 50 billion rows so that isn't going to work.
Some dummy data:
import geopy.distance as gd
import pandas as pd
import os
import numpy as np
df = pd.DataFrame(np.array([[42.801104,-76.827879],[38.187102,-83.433917],
[35.973115,-83.955932]]), columns = ['Lat', 'Long'])
for index, row in df.iterrows():
gd.vincenty((row['Lat'], row['Long']))
My goal is to create the loop so that a single row on the left frame iterates over the 43000 rows in the right frame, calculate each distance and take the minimum of that result set (probably a list of some sort) then return the corresponding zip code in a new column.
I am a bit lost as I typically would just do this with a cartesian join and calculate everything in one go but I have too much data volume to do that.

pandas - convert Panel into DataFrame using lookup table for column headings

Is there a neat way to do this, or would I be best off making a look that creates a new dataframe, looking into the Panel when constructing each column?
I have a 3d array of data that I have put into a Panel, and I want to reorganise it based on a 2d lookup table using 2 of the axes so that it will be a DataFrame with labels taken from my lookup table using the nearest value. In a kind of double vlookup type of a way.
The main thing I am trying to achieve is to be able to quickly locate a time series of data based on the label. If there is a better way, please let me know!
my data is in a panel that looks like this, with items axis latitude and minor axis longitude.
data
Out[920]:
<class 'pandas.core.panel.Panel'>
Dimensions: 53 (items) x 29224 (major_axis) x 119 (minor_axis)
Items axis: 42.0 to 68.0
Major_axis axis: 2000-01-01 00:00:00 to 2009-12-31 21:00:00
Minor_axis axis: -28.0 to 31.0
and my lookup table is like this:
label_coords
Out[921]:
lat lon
label
2449 63.250122 -5.250000
2368 62.750122 -5.750000
2369 62.750122 -5.250000
2370 62.750122 -4.750000
I'm kind of at a loss. Quite new to python in general and only really started using pandas yesterday.
Many thanks in advance! Sorry if this is a duplicate, I couldn't find anything that was about the same type of question.
Andy
figured out a loop based solution and thought i may as well post in case someone else has this type of problem
I changed the way my label coordinates dataframe was being read so that the labels were a column, then used the pivot function:
label_coord = label_coord.pivot('lat','lon','label')
this then produces a dataframe where the labels are the values and lat/lon are the index/columns
then used this loop, where data is a panel as in the question:
data_labelled = pd.DataFrame()
for i in label_coord.columns: #longitude
for j in label_coord.index: #latitude
lbl = label_coord[i][j]
shut_nump['%s'%lbl]=data[j][i]

Resources