Changing data types after interpolating - python-3.x

from tnorma import tnorma
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tnorma import tnorma
df6 = pd.read_csv('Static06_new.csv')
Z =tnorma(df6)
df = pd.DataFrame(Z)
print(df)
This is a simple enough code. The interpolation code "tnorma" is given here https://pypi.org/project/tnorma/
The Static06_new.csv files contains positional data with time column. The time column is not continuous, hence I am interpolating them.
The interpolation is successful, however, I am unable to convert the result back into a data-frame for further analysis.
The error received when running my code is as follows:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
values = np.array([convert(v) for v in values])
0
0 [[109.00000000000001, 0.009174311926606001, 35...
1 [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...
2 [0, 3694]
Not sure how to proceed. Ideal case I would like it back in a data-frame format and save it as .csv file.
Kind regards,

Related

(Python 3) Classification of dataset, using a user input Elo, to suggest opening move based on Chess dataset?

I'm just looking for a little bit of help. I'm struggling to work out if what I'm doing is right or not, and even if Naive Bayes is even the right way to do this.
I am wanting the user to be able to input their elo, and the 'app' to suggest them a opening move set, based on win rate at that ELO. For this I am using the following dataset: https://www.kaggle.com/datasnaek/chess
The important data out of this, are the opening name (what I'm trying to suggest), the average rating (what the user can input), and winner (we need to see if white wins).
This is my code so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from matplotlib.colors import ListedColormap
from sklearn import preprocessing
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Read in dataset
data = pd.read_csv(f"games.csv")
# set new column that is true/false depending on if white wins
data['white_wins'] = (data['winner'] == "white")
# Create new columns, average rating (based on white rating and black rating) and category (categorization of rating for Naive Bayes)
data['average_rating'] = data.apply(lambda row: (row['white_rating'] + row['black_rating']) / 2, axis=1)
data['category'] = data['average_rating'] // 100 + 1
# Drop unneccessary columns
data = data.drop(['turns', 'moves', 'victory_status', 'id', 'winner', 'rated', 'created_at', 'last_move_at', 'opening_ply', 'white_id', 'black_id', 'increment_code', 'opening_eco', 'white_rating', 'black_rating'], axis=1)
#Label Encoder Initialisation
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
opening_name_encoded=le.fit_transform(data.opening_name)
category_encoded=le.fit_transform(data.category)
label=le.fit_transform(data.white_wins)
#Package features together
features=zip(opening_name_encoded, category_encoded)
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
And i currently get the error:
error
Also, i'm not even convinced this is correct, as if i continue down this stream, I am only going to be predicting if white wins based on the opening moveset, and elo. I'm really unsure on where to take this to get it to the point i need.
Thanks for any help!
zip returns in iterator, so your code is not doing what you expect. My guess is that you intended features to be a list of 2-tuples. If that is the case, then adjust your code to features = list(zip(opening_name_encoded, category_encoded))
In [31]: zip([1, 2, 3], ['a', 'b', 'c'])
Out[31]: <zip at 0x25d61abfd80>
In [32]: list(zip([1, 2, 3], ['a', 'b', 'c']))
Out[32]: [(1, 'a'), (2, 'b'), (3, 'c')]

How do I convert a Python DataFrame into a NumPy array

Below is a snippet that converts data into a NumPy array. It is then converted to a Pandas DataFrame where I intend to process it. I'm attempting to convert it back to a NumPy array. I'm failing at this. Badly.
import pandas as pd
import numpy as np
from pprint import pprint
data = [
('2020-11-01 00:00:00', 1.0),
('2020-11-02 00:00:00', 2.0)
]
coordinatesType = [('timestamp', 'datetime64[s]'), ('value', '<f8')]
npArray = np.asarray(data, coordinatesType)
df = pd.DataFrame(data = npArray)
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_numpy(coordinatesType)
pprint(mutatedNpArray)
# don't suply dtype for kicks
pprint(df.to_numpy())
This yields crazytown:
array([[('2020-11-01T00:00:00', 1.6041888e+18),
('1970-01-01T00:00:01', 1.0000000e+00)],
[('2020-11-02T00:00:00', 1.6042752e+18),
('1970-01-01T00:00:02', 2.0000000e+00)]],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
array([[Timestamp('2020-11-01 00:00:00'), 1.0],
[Timestamp('2020-11-02 00:00:00'), 2.0]], dtype=object)
I realize a DataFrame is really a fancy NumPy array under the hood, but I'm passing back to a function that accepts a simple NumPy array. Clearly I'm not handling dtypes correctly and/or I don't understand the data structure inside my DataFrame. Below is what the function I'm calling expects:
[('2020-11-01T00:00:00', 1.000 ),
('2020-11-02T00:00:00', 2.000 )],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
I'm really lost on how to do this. Or what I should be doing instead.
Help!
As #hpaul suggested, I tried the following:
# ...
df = df.set_index('timestamp')
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_records(coordinatesType)
# ...
All good!
Besides the to_records approach mentioned in comments, you can do:
df.apply(tuple, axis=1).to_numpy(coordinatesType)
Output:
array([('2020-11-01T00:00:00', 1.), ('2020-11-02T00:00:00', 2.)],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
Considerations:
I believe the issue here is related to the difference between the original array and the dataframe.
The shape your original numpy array is (2,), where each value is a tuple. When creating the dataframe, both df.shape and df.to_numpy() shapes are (2, 2) so that the dtype constructor does not work as expected. When converting rows to tuples into a pd.Series, you get the original shape of (2,).

Matplotlib warning using pandas.DataFrame.plot.scatter()

One windows 10, with versions:
Python 3.5.2, pandas 0.23.4, matplotlib 3.0.0, numpy 1.15.2,
the following code give me the following warning that i would like to sort out
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
# a 5x4 random pandas DataFrame
pf = pd.DataFrame(np.random.random((5,4)), columns=['a', 'b', 'c', 'd'])
# colors:
colors = cm.rainbow(np.linspace(0, 1, 4))
fig1 = pf.plot.scatter('a', 'b', color='k')
for i, j in enumerate(['b', 'c', 'd']):
pf.plot.scatter('a', j, color=colors[i+1], ax = fig1)
And I get a warning:
'c' argument looks like a single numeric RGB or RGBA sequence, which
should be avoided as value-mapping will have precedence in case its
length matches with 'x' & 'y'. Please use a 2-D array with a single
row if you really want to specify the same RGB or RGBA value for all
points.
Could you point me on how to address that warning?
I can't reproduce the warning with matplotlib 3.0 and pandas 0.23.4, but what it says is essentially that you should not use a single RGB tuple to specify a color.
So instead of color=colors[i+1] use
color=[colors[i+1]]

How do I map df column values to hex color in one go?

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.
This is what I have tried so far. Part of the toy code is taken from here.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df
Which outputs:
How do I convert 'some_value' df column values to hex in one go?
Ideally using the sns.cubehelix_palette(light=1)
I am not opposed to using something other than matplotlib
Thanks in advance.
You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df
Efficiency
The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def create_df(n=10):
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)),
columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
return df
The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.
def apply1(df):
# map values to colors in hex via
# matplotlib to_hex by pandas apply
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
return df
That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.
def apply2(df):
# map values to colors in hex via
# matplotlib to_hex by assigning numpy array as column
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
a = mapper.to_rgba(df['some_value'])
df['some_value_color'] = np.apply_along_axis(mcolors.to_hex, 1, a)
return df
Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.
def apply3(df):
# map values to colors in hex via
# creating a hex Look up table table and apply the normalized data to it
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
lut = plt.cm.viridis(np.linspace(0,1,256))
lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
a = (norm(df['some_value'].values)*255).astype(np.int16)
df['some_value_color'] = lut[a]
return df
Compare the timings
Let's take a dataframe with 10000 rows.
df = create_df(10000)
Original solution (apply1)
%timeit apply1(df)
2.66 s per loop
Array solution (apply2)
%timeit apply2(df)
240 ms per loop
LUT solution (apply3)
%timeit apply1(df)
7.64 ms per loop
In this case the LUT solution gives almost a factor 400 of improvement.

plt.errorbar for X string value

I have a dataframe as below
import pandas as pd
import matplotlib.pylab as plt
df = pd.DataFrame({'name':['one', 'two', 'three'], 'assess':[100,200,300]})
I want to build errorbar like this
c = 30
plt.errorbar(df['name'], df['assess'], yerr=c, fmt='o')
and of course i get
ValueError: could not convert string to float
I can convert string to float, but I'm losing value signatures and maybe there's a more elegant way?
Matplotlib can indeed only work with numerical data. There is an example in the matplotlib collection showing how to handle cases where you have categorical data. The solution is to plot a range of values and set the labels afterwards using plt.xticks(ticks, labels) or a combination of ax.set_xticks(ticks) and ax.set_xticklabels(labels).
In your case the former works fine:
import pandas as pd
import matplotlib.pylab as plt
df = pd.DataFrame({'name':['one', 'two', 'three'], 'assess':[100,200,300]})
c = 30
plt.errorbar(range(len(df['name'])), df['assess'], yerr=c, fmt='o')
plt.xticks(range(len(df['name'])), df['name'])
plt.show()

Resources