sklearn MinMaxScaler mis-scaling? - scikit-learn

I'm having trouble understanding one of the scaled columns in a pandas dataframe returned by MinMaxScaler:
The code snippet is as follows:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
A = np.random.randint(5, size=(8, 4))
FrameA = pd.DataFrame()
FrameA = A
scaled_array = MinMaxScaler().fit_transform(FrameA)
Scaled (LHS) and original (RHS)
Column 2 is suspect. The formula seems to be: x[i] / max{x} - 1 which differs from the other columns.

Related

Annotating clustering from DBSCAN to original Pandas DataFrame

I have working code that is utilizing dbscan to find tight groups of sparse spatial data imported with pd.read_csv.
I am maintaining the original spatial data locations and would like to annotate the labels returned by dbscan for each data point to the original dataframe and then write a csv with the same information.
So the code below is doing exactly what I would expect it to at this point, I would just like to extend it to import the label for each row in the original dataframe.
import argparse
import string
import os, subprocess
import pathlib
import glob
import gzip
import re
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from sklearn.cluster import DBSCAN
X = pd.read_csv(tmp_csv_name)
X = X.drop('Name', axis = 1)
X = X.drop('Type', axis = 1)
X = X.drop('SomeValue', axis = 1)
# only columns 'x' and 'y' now remain
db=DBSCAN(eps=EPS, min_samples=minSamples, metric='euclidean', algorithm='auto', leaf_size=30).fit(X)
labels = def_inst_dbsc.labels_
unique_labels = set(labels)
# maxX , maxY are manual inputs temporarily
while sizeX > 16 or sizeY > 16 :
sizeX=sizeX*0.8 ; sizeY=sizeY*0.8
fig, ax = plt.subplots(figsize=(sizeX,sizeY))
plt.xlim(0,maxX)
plt.ylim(0,maxY)
plt.scatter(X['x'], X['y'], c=colors, marker="o", picker=True)
# hackX , hackY are manual inputs temporarily
# which represent the boundaries defined in the original dataset
poly = patches.Polygon(xy=list(zip(hackX,hackY)), fill=False)
ax.add_patch(poly)
plt.show()

Different clustering results on Azure databricks (cloud) vs Jupyter Notebook (local) with same seed

I ran sklearn Kmeans clustering on a dataset with the same code and seed on both setup but why do I still get different clustering results?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import sklearn.cluster as cluster
import sklearn.metrics as metrics
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
df = pd.read_csv("sample_dataset.csv")
#return only unique records
df2 = df.drop_duplicates(subset=["ID"], keep = "first")
#remove outliers
df2[["AMT_INCOME_TOTAL", "DAYS_BIRTH"]] = df2[["AMT_INCOME_TOTAL", "DAYS_BIRTH"]].astype("float")
iqr = df2["AMT_INCOME_TOTAL"].quantile(q=0.75) - df2["AMT_INCOME_TOTAL"].quantile(q=0.25)
upperbound = df2["AMT_INCOME_TOTAL"].quantile(q=0.75) + 1.5 * iqr
#keep records lesser than upperbound
df2 = df2.where(df2["AMT_INCOME_TOTAL"] <= upperbound)
df2 = df2[df2["AMT_INCOME_TOTAL"].notna()]
df2["AGE"] = round(abs(df2["DAYS_BIRTH"]/365))
df3 = df2[["ID","AMT_INCOME_TOTAL","AGE"]]
#for simplicity, dod 2 cols only #amt income and age
#min max normalisation
scaler = MinMaxScaler()
scale = scaler.fit_transform(df3[["AMT_INCOME_TOTAL","AGE"]])
df_scale = pd.DataFrame(scale, columns=["AMT_INCOME_TOTAL","AGE"])
X= df_scale.values
#best K is 3
k_means_best = KMeans(n_clusters=3, init="k-means++", random_state=101)
y= k_means_best.fit_predict(X)
I tried with two local machines and both produce the same results but when tested on Azure databricks, the results are different.

Pandas dataframe from numpy with last dimension as object

Convert 3d numpy array to pandas dataframe with numpy.array as elements.
Are there any other solutions? What about speed?
import numpy as np
import pandas as pd
ones = np.ones((2,3,5))
temp = [[np.array(column_elem, dtype=np.object) for column_elem in row] for row in ones]
df = pd.DataFrame(temp)

Python 3: Import files of floats and give them unique identifiers for plotting

I have two files named "Posterior_C.txt" and "Posterior_l.txt", each containing 5000 float entries, that I would like to import and concatenate into a dataframe (for plotting in seaborn). Each entry belonging to Posterior_C should be given a label C and each entry belonging to Posterior_l should be called l.
How can I import the data and concatenate them, while creating an unique identifier for each. E.g.
0.012 Posterior_C
0.0021 Posterior_C
0.2 Posterior_l
0.52 Posterior_l
This is what I've got so far:
import pandas as pd
import numpy as np
C=np.loadtxt("Posterior_C.txt")
l=np.loadtxt("Posterior_l.txt")
df={C,l}
df=pd.DataFrame(df)
import numpy as np
xc = np.array(["C"])
c=np.repeat(xc, 5000, axis=0)
import numpy as np
xl = np.array(["l"])
l=np.repeat(xl, 5000, axis=0)
But a bit stuck now.
*In R i would do *
C<-read.table("Posterior_C.txt,header=FALSE)
l<-read.table("Posterior_l.txt,header=FALSE)
df=rbind(C,l)
df<-as.data.frame(df)
dfID=rbind(rep("C",NROW(C),rep("l",NROW(l))
df$ID<-cbind(df,dfID[,1] )
or something similar.
Something like this:
c = pd.read_table("Posterior_C.txt", header=None)
l = pd.read_table("Posterior_l.txt", header=None)
c['ID'] = 'C'
l['ID'] = 'l'
df = pd.concat([c, l], ignore_index=True)

How do I map df column values to hex color in one go?

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.
This is what I have tried so far. Part of the toy code is taken from here.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df
Which outputs:
How do I convert 'some_value' df column values to hex in one go?
Ideally using the sns.cubehelix_palette(light=1)
I am not opposed to using something other than matplotlib
Thanks in advance.
You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df
Efficiency
The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def create_df(n=10):
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)),
columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
return df
The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.
def apply1(df):
# map values to colors in hex via
# matplotlib to_hex by pandas apply
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
return df
That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.
def apply2(df):
# map values to colors in hex via
# matplotlib to_hex by assigning numpy array as column
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
a = mapper.to_rgba(df['some_value'])
df['some_value_color'] = np.apply_along_axis(mcolors.to_hex, 1, a)
return df
Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.
def apply3(df):
# map values to colors in hex via
# creating a hex Look up table table and apply the normalized data to it
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
lut = plt.cm.viridis(np.linspace(0,1,256))
lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
a = (norm(df['some_value'].values)*255).astype(np.int16)
df['some_value_color'] = lut[a]
return df
Compare the timings
Let's take a dataframe with 10000 rows.
df = create_df(10000)
Original solution (apply1)
%timeit apply1(df)
2.66 s per loop
Array solution (apply2)
%timeit apply2(df)
240 ms per loop
LUT solution (apply3)
%timeit apply1(df)
7.64 ms per loop
In this case the LUT solution gives almost a factor 400 of improvement.

Resources