I'm learning about Series and I'm using VS Code to do excercises to learn it's usage, but when I typed this
current_series_add = pd.Series()
in the terminal it shows me a message telling me that "Te default dtype for empty Series will be object instead of float64 in a future version"
How can I specify a dtype?
As the docs say:
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)[source]¶
...
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from data. See the user guide for more usages.
Example:
In [1]: import pandas as pd
In [2]: pd.Series(dtype=int)
Out[2]: Series([], dtype: int64)
Related
I wanted to write a code which gets several .txt and .ASC data. All of those have to be run through some functions. So I thought it would be great to have a script which is doing it automatically.
The .txt contains more data (product, number, color, size) than the .ASC (product, number, size). So I have to adjust the head of each.
So, this is the first part of what I thought my script could look like.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import new_methods as nem
import sys
sys.path.append("../../src/")
path_data ="C:///Users///"
fids = [file for file in os.listdir(path_data)]
d = dict()
for i in fids:
if i[-1]== 't':
d.update({i : nem.df(path_data+i, header_lines=1)})
elif i[-1] == 'C':
d.update({i : nem.df(path_data+i, header_lines=0)})
for val in d.values():
txt_fid=d[val]
dh_txt=nem.sort(txt_fid)
But it gives a Typeerror
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
It does work if I change the last part to
txt_fid=d['specific.txt']
dh_txt=nem.sort(txt_fid)
But like this I have to change manually for every txt sheet.
Like the error says you cannot have a key in a dictionary that is mutable (such as a DataFrame), which you are doing when you do d[val] because val is a DataFrame.
Did you mean to use the value of the dictionary or did you want the keys? Or some element of the DataFrame perhaps?
If you want the keys and not the values, you can simply do for val in d: instead.
numpy.genfromtxt on scipy page shows the following code. I cannot make sense of the following code especially the dtype and reading the string part and, therefore the code. The following is the code.
from io import StringIO
import numpy as np
s=StringIO(u"1,1.3,abced")
data=np.genfromtxt(s, dtype=[('myint', 'i8'),('myfloat','f8'), ('mystring','S5')], delimiter=",")
Ok. Here, I get that 1,1.3 and abced is being read from s=StringIO(u"1,1.3,abced"). But what does u do?
Also, I get that i8 is integer for 8 bytes. But what do 'myint', 'myfloat' and 'mystring' do?
'u' is for 'unicode', the default string type in Py3, so it isn't needed here. Also the StringIO isn't needed either. I just give genfromtxt a list of strings:
In [221]: txt = ["1,1.3,abced"]
In [223]: np.genfromtxt(txt,
dtype=[('myint', 'i8'),('myfloat','f8'), ('mystring','S5')],
delimiter=",")
Out[223]:
array((1, 1.3, b'abced'),
dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', 'S5')])
The dtype defines a compound dtype, one with 3 fields, one for each column. You access fields by name:
data['myint']
data['myfloat']
I was playing with the Titanic dataset on Kaggle (https://www.kaggle.com/c/titanic/data), and I want to use LabelEncoder from sklearn.preprocessing to transform Sex, originally labeled as 'male' or 'female' into '0' or '1'. I had the following four lines of code,
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('titanic.csv')
df['Sex'] = LabelEncoder.fit_transform(df['Sex'])
But when I ran it I received the following error message:
TypeError: fit_transform() missing 1 required positional argument: 'y'
the error comes from line 4, i.e.,
df['Sex'] = LabelEncoder.fit_transform(df['Sex'])
I wonder what went wrong here. Although I know I could also do the transformation using map, which might be even simpler, but I still want to know what's wrong with my usage of LabelEncoder.
Try using following link to sklearn. LabelEncoder is a utility class and need to create the object with LabelEncoder():
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
Testing with example:
# create test series
gender = pd.Series(['male', 'female', 'male'])
le = LabelEncoder()
transformed_val = le.fit_transform(gender)
# checking result after using label encoder
print(transformed_val)
Results:
[1 0 1]
You're just missing the () which initializes a LabelEncoder instance.
This will work: LabelEncoder().fit_transform(df['Sex'])
Having said that, 0p3n5ourcE's example is more conventional, and also a cleaner way to handle objects.
I have a stock price line graph which works,however I wanted to use the fill between function. I have tried passing in the values directly from the series and also creating lists etc. and nothing works. Is this possible?
myDF = pd.read_csv('C:/Workarea/OneDrive/PyProjects/Learning/stocks_sentdex/GOOG-LON_TSCO.csv')
print(myDF)
myDF = myDF.set_index('Date')
myDF['Close'].plot()
plt.fill_between(?, 0, ?, alpha=0.3)
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Check it out')
plt.legend()
plt.subplots_adjust(left=0.09,bottom=0.16, right=0.94,top=0.90, wspace=0.2, hspace=0)
plt.show()
All the examples I have seen use their own data or read from a urllib. All help greatly appreciated.
import pandas as pd
import pandas_datareader.data as pdata
import matplotlib.pyplot as plt
# myDF = pd.read_csv('C:/Workarea/OneDrive/PyProjects/Learning/stocks_sentdex/GOOG-LON_TSCO.csv')
# myDF = myDF.set_index('Date')
myDF = pdata.get_data_google('LON:TSCO', start='2009-01-02', end='2009-12-31')
fig, ax = plt.subplots()
ax.fill_between(myDF.index, 0, myDF['Close'], alpha=0.3, label='LON:TSCO')
ax.set_xlabel('Date')
ax.set_ylabel('Price')
ax.set_title('Check it out')
ax.legend()
fig.subplots_adjust(left=0.09,bottom=0.16, right=0.94,top=0.90, wspace=0.2, hspace=0)
fig.autofmt_xdate()
plt.show()
The error message
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'
could occur if either myDF.index or myDF['Close'] is an object array. As a simple example,
In [110]: plt.fill_between(np.array([1,2], dtype='O'), 0, np.array([1,2], dtype='O'))
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Chances are it is the Date that are mere strings rather than datetime-like objects. To fix this, use pd.to_datetime(myDF['Date']) to convert the date strings into datetime-like objects.
myDF = pd.read_csv('C:/Workarea/OneDrive/PyProjects/Learning/stocks_sentdex/GOOG-LON_TSCO.csv')
myDF['Date'] = pd.to_datetime(myDF['Date'])
myDF = myDF.set_index('Date')
I have a custom distance metric that I need to use for KNN, K Nearest Neighbors.
I tried following this, but I cannot get it to work for some reason.
I would assume that the distance metric is supposed to take two vectors/arrays of the same length, as I have written below:
import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
def d(a,b,L):
# Inputs: a and b are rows from a data matrix
return a+b+2+L
knn=NearestNeighbors(n_neighbors=1,
algorithm='auto',
metric='pyfunc',
func=lambda a,b: d(a,b,L)
)
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)
However, when I call: knn.kneighbors(), it doesn't seem to like the custom function. Here is the bottom of the error stack:
ValueError: Unknown metric pyfunc. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'], or 'precomputed', or a callable
However, I see the exact same in the question I cited. Any ideas on how to make this work on sklearn version 0.14? I'm not aware of any differences in the versions.
Thanks.
The documentation is actually pretty clear on the use of the metric argument:
metric : string or callable, default ‘minkowski’
metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should
take two arrays as input and return one value indicating the distance
between them. This works for Scipy’s metrics, but is less efficient
than passing the metric name as a string.
Thus (as also per the error message), metric should be a callable, not a string. And it should accept two arguments (arrays), and return one. Which is your lambda function.
Thus, your code can be simplified to:
import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
def d(a,b,L):
return a+b+2+L
knn=NearestNeighbors(n_neighbors=1,
algorithm='auto',
metric=lambda a,b: d(a,b,L)
)
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)