How to alter output data format for isolation forest - scikit-learn

I have built an isolation forest to detect anomalies for a csv file that I have, and I wanted to see how I can change the format of the data. Right now, the anomaly data is being outputted as a pandas dataframe, but I would like to alter it to be a json file, in the following format:
{seconds: #seconds for that row, size2: size2, pages: #pages for that row}
I have attached come of the code and a sample of the data, thank you so much!
model.fit(df[['label']])
df['anomaly']=model.fit_predict(df[['size2','size3','size4']])
#df['anomaly']= model.predict(df[['pages']])
print(model.predict(X_test))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)
The output data looks something like this:
Unnamed: seconds: size2: ... size4: pages: anomaly:
1 40 32 654 1 -1

I have figured out a way to do this; I made multiple dictionaries, one mapping the index of the row to that timestamp, and one mapping the index of the row to the label. I was then able to keep track of which indexes were in the output data, and access all the information from those dictionaries.

Related

How can I define a parameter from specific columns and rows from excel?

I want to obtain a list of certain values from an excel file.
I tried this:
import pandas as pd
df = pd.read_excel('Data.xlsx')
orders = df[['Order']].loc[[4,129]]
print(orders)
I obtained this solution:
Order
4 18292839
129 83938292
But the solution that I want to obtain is the orders from 4 to 129 in a list on values like this:
['18292839', .............. (other orders), '83938292']
If someone can help me I will be very grateful!
You can use orders.values.tolist() to convert pd.Series into list. More about converting DataFrames and Series into the list you can read here.

How to resample multiple files of time series data to be all the same length (same # of observations)

I have many csv files which are of time series data (i.e. the data are sequential, however no time column exists).
I need to make all the files the same length in order to feed them into tensorflow. I could make them all the size of the file with largest length, or just use the average length of all files. -doesn't really matter-
Since the files don't have time column, I converted the index column to timedate with unit 's' and used this column in the resampling.
To give you a sample of the shape of my data, when running df.head(3) ,this is the result:
0 1 2 3 4 5 6 7
0 0.30467 0.45957 -0.95414 1.74687 1.42338 -0.03860 2.20401 1.44406
1 0.27331 0.59293 -1.00874 1.74135 1.32004 -0.00701 2.20917 1.34164
2 0.30348 0.88129 -1.05517 1.75090 1.65138 -0.03112 2.21598 1.68487
This is what I have tried so far to no avail:
for file in files:
df=pd.read_csv(file, header=None)
resampled=df.set_index(pd.to_datetime(df.index,unit='s')).resample('250ms')
resamp=pd.DataFrame(resampled)
I aslo tried: df.set_index(pd.to_datetime(df.index,unit='s')).resample('250ms').asfreq() and df.set_index(pd.to_datetime(df.index,unit='s')).resample('250ms').asfreq().interpolate()
None of the above gave dataframes of the same length. They all returned dfs of different lengths.
I expect the output to be resampled data such that all the files are of the same length (i.e. same number of observations) and are correctly resampled (either upsampled or downsampled).
After having resampled the files, I need to concatenate all of them to have one big file which I can then reshape to input to tensorflow.
I am new to Python so I will really appreciate support here.
I have reached to making the files same size using the following steps:
1- add date range column as index column to all files with same start and end dates.
2- resampling all files with the same resampling rate. (used trial and error to reach a resampled size that seemed good enough for my data)
3- Put a condition that if length was > desired do mean, else do interpolate
Code:
for i, file in enumerate (directory):
count+=1
df=pd.read_csv(file, header=None)
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
frames_length.append(df.shape[0])
# Resampling using 33D (33D was chosen arbitrarily with trial and error until 100 was obtained),
# returns resampled files/movements of 100 frames each
if df.shape[0] <100: resampled=df.resample('33D').interpolate()
elif df.shape[0]>100: resampled=df.resample('33D').mean()
else: break
# Check if resampled files have any nulls or NaNs
print(resampled.isnull().any().any())
The above code returned files of size 100 for me.
Now what I need is to know if resampling was correct or not?
Any help would be greatly appreciated.
Thanks!

Losing records when converting DataFrame to dictionary

I parse a CSV file into a Dataframe. 10,000 records go in, no problems.
Two columns one 'ID', one 'Reviews'.
I try to convert the DF into a dictionary where keys = 'ID', and values = 'Reviews'.
For some reason the new dictionary only contains 680 records.
#read csv data file
data = pd.read_csv("Movie_reviews.csv",
delimiter='\t',
header=None,names=['ID','Reviews'])
reviews = data.set_index('ID').to_dict().get('Reviews')
len(reviews)
output is 680
If I don't append '.get('Reviews')' everything is one big record.
the Dataframe 'data' looks like this
ID Reviews
1 076780192X it always amazes me how people can rate the DV...
2 0767821599 This movie is okay, but, its not worth what th...
3 0782008380 If you love the Highlander 1 movie and the ser...
4 0767726227 This is a great classic collection, if you lik...
5 0780621832 This is the second of John Ford and John Wayne...
6 0310263662 I am an evangelical Christian who believes in ...
7 0767809270 Federal law, in one of its numerous unfunded m...
In case it helps anyone else.
The id's for the movie reviews were not all unique. The .nunique() function revealed that as suggested by #YOLO.
Assigning only the values (Reviews) to the dictionary automatically added unique keys as suggested by #JackHoman resolving my issue.
I think you can do:
Method 1:
reviews = data.set_index('ID')['Reviews'].to_dict()
Method 2: Here we convert reviews to a list for each ID so that we don't lose any information.
reviews = data.groupby('ID')['Reviews'].apply(list).to_dict()

Speeding up my geopy function

I have a set of lat-long coordinates in a column of a df. The format is lat, long. The idea is to determine the zipcode (postal code) and add a column with that number to the df.
Here is the column
10 40.7145,-73.9425
10000 40.7947,-73.9667
100004 40.7388,-74.0018
100007 40.7539,-73.9677
100013 40.8241,-73.9493
I created this function using geopy to get the information I want:
>>>def zipcode (coords):#
geolocator = Nominatim()
location = geolocator.reverse(coords)
zip=location.raw['address']['postcode']
return zip
It works fine in a small subset of the data frame, but when I try it in the larger dataset, it stalls.
Was wondering if someone can give me a more efficient way to do this.

pandas - convert Panel into DataFrame using lookup table for column headings

Is there a neat way to do this, or would I be best off making a look that creates a new dataframe, looking into the Panel when constructing each column?
I have a 3d array of data that I have put into a Panel, and I want to reorganise it based on a 2d lookup table using 2 of the axes so that it will be a DataFrame with labels taken from my lookup table using the nearest value. In a kind of double vlookup type of a way.
The main thing I am trying to achieve is to be able to quickly locate a time series of data based on the label. If there is a better way, please let me know!
my data is in a panel that looks like this, with items axis latitude and minor axis longitude.
data
Out[920]:
<class 'pandas.core.panel.Panel'>
Dimensions: 53 (items) x 29224 (major_axis) x 119 (minor_axis)
Items axis: 42.0 to 68.0
Major_axis axis: 2000-01-01 00:00:00 to 2009-12-31 21:00:00
Minor_axis axis: -28.0 to 31.0
and my lookup table is like this:
label_coords
Out[921]:
lat lon
label
2449 63.250122 -5.250000
2368 62.750122 -5.750000
2369 62.750122 -5.250000
2370 62.750122 -4.750000
I'm kind of at a loss. Quite new to python in general and only really started using pandas yesterday.
Many thanks in advance! Sorry if this is a duplicate, I couldn't find anything that was about the same type of question.
Andy
figured out a loop based solution and thought i may as well post in case someone else has this type of problem
I changed the way my label coordinates dataframe was being read so that the labels were a column, then used the pivot function:
label_coord = label_coord.pivot('lat','lon','label')
this then produces a dataframe where the labels are the values and lat/lon are the index/columns
then used this loop, where data is a panel as in the question:
data_labelled = pd.DataFrame()
for i in label_coord.columns: #longitude
for j in label_coord.index: #latitude
lbl = label_coord[i][j]
shut_nump['%s'%lbl]=data[j][i]

Resources