How to classify a frequency domain data in Python 3? (could not convert string to float) - python-3.x

I've converted a DataFrame from time domain to frequency domain using :
df = np.fft.fft(df)
Now I need to classify the the data using several machine learning algorithms such as Random Forest and Gaussian Naive Bayes. The problem is I keep getting this error:
could not convert string to float: '(2.9510193818016135-0.47803712350473193j)'
I tried to convert the strings to floats in DataFrame but it is still giving me the same error.
How can I solve this problem in order to get my classification results?

Assuming your results are like the following form, you first need to cast to a real complex type:
In[84]:
# data setup
df = pd.DataFrame({'fft':['(2.9510193818016135-0.47803712350473193j)']})
df
Out[84]:
fft
0 (2.9510193818016135-0.47803712350473193j)
Now cast to complex type:
In[85]:
df['complex'] = df['fft'].apply(complex)
df
Out[85]:
fft complex
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
Now you can extract as polar coords using apply with cmath.polar:
In[86]:
import cmath
df['polar_x'],df['polar_y'] = df['complex'].apply(lambda x: cmath.polar(x)[0]), df['complex'].apply(lambda x: cmath.polar(x)[1])
df
Out[86]:
fft complex \
0 (2.9510193818016135-0.47803712350473193j) (2.9510193818-0.478037123505j)
polar_x polar_y
0 2.989487 -0.160595
Now the dtypes are compatible so you can pass the float columns:
In[87]:
df.dtypes
Out[87]:
fft object
complex complex128
polar_x float64
polar_y float64
dtype: object
You can also use cmath.rect if desired

Related

convert pandas Series of dtype <- 'datetime64' into dtype <- 'np.int' without iterating

consider the below sample pandas DataFrame
df = pd.DataFrame({'date':[pd.to_datetime('2016-08-11 14:09:57.00'),pd.to_datetime('2016-08-11 15:09:57.00'),pd.to_datetime('2016-08-11 16:09:57.8700')]})
I can convert single instance into np.int64 type with
print(df.date[0].value)
1470924597000000000
or convert the entire columns iteratively with
df.date.apply(lambda x: x.value)
How can I achieve this without using iteration? something like
df.date.value
I would also want to convert back the np.int64 object to pd.Timestamp object without using iteration. I got some insights from solutions posted here and here but it doesn't solve my problem.
as per #anky comments above, the solution was straightforward.
df.date.astype('int64') >> to int64 dtype
pd.to_datetime(df.date) >> from int dtype to datetime64 dtype

converting a Object (containing string and integers) Pandas dataframe to a scipy sparse matrix

I have a dataframe with two columns, one column is medicine name of dtype object it contains medicine name and few of the medicine name followed by its mg(eg. Avil25 and other row for Avil50) and other column is Price of dtype int . I'm trying to convert medicine name column into a scipy csr_matrix using the following lines of code:
from scipy.sparse import csr_matrix
sparse_matrix = csr_matrix(medName)
I am getting the following error message:
TypeError: no supported conversion for types: (dtype('O'),)
as an alternative way I tried to remove the integers using(medName.str.replace('\d+', '')) from dataframe and tried sparse_matrix = csr_matrix(medName.astype(str)) . Still i am getting the same error.
What's going on wrong here?
What is another way to convert this dataframe to csr matrix?
you will have the encode strings to numeric data types for it to be made sparse. One solution ( probably not the most memory efficient) is to make a networkx graph, where the string-words will be the nodes, using the nodelist of the graph you can keep track of the word to numeric mapping.

Calculate cosine similarity between two columns of Spark DataFrame in PySpark [duplicate]

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
Ex: [24,23,27,23] should get converted to [24, 23, 27]
Code:
def uniq_array(col_array):
x = np.unique(col_array)
return x
uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))
Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))
In the above code, Df2.age_array is the array on which I am applying the UDF to get a different column "age_array_unique" which should contain only unique values in the array.
However, as soon as I run the command Df3.show(), I get the error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Can anyone please let me know why this is happening?
Thanks!
The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
If you really want np.unique you have to convert the output:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
You need to convert the final value to a python list. You implement the function as follows:
def uniq_array(col_array):
x = np.unique(col_array)
return list(x)
This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an ArrayType, you need to convert the output to a python list before returning it.
I also got this error when my UDF returns a float but I forget to cast it as a float. I need to do this:
retval = 0.5
return float(retval)
As of pyspark version 2.4, you can use array_distinct transformation.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
Below Works fine for me
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
[x.item() for x in <any numpy array>]
converts it to plain python.

Changing column datatype from Timestamp to datetime64

I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype, but I need them to be in np.datetime64, so that I can make calculations.
I am aware that the function pd.to_datetime() and the astype(np.datetime64[ns]) method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.
I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64 and plugging it back onto the original dataframe:
dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi
But still it doesn't work. I have also tried updating values one by one:
arr_i = df.index
for i in range(len(arr_i)):
df.at[arri[l],'dates'].to_datetime64()
Edit
The root problem seems to be that the dtype of the column gets updated to np.datetime64, but somehow, when getting single values from within, they still have the dtype = Timestamp
Does anyone have a suggestion of a workaround that is fairly fast?
Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.
There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object:
import numpy as np
import pandas as pd
# create some dates, merely for example
dates = pd.date_range('2000-1-1', periods=10)
# convert the dates to a *list* of datetime64s
arr = list(dates.to_numpy())
# wrap the values you wish to protect in a Series of dtype object.
ser = pd.Series(arr, dtype='object')
# assignment with `df['datetime64s'] = ser` would also work
df = pd.DataFrame({'timestamps': dates,
'datetime64s': ser})
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 2 columns):
# timestamps 10 non-null datetime64[ns]
# datetime64s 10 non-null object
# dtypes: datetime64[ns](1), object(1)
# memory usage: 240.0+ bytes
print(type(df['timestamps'][0]))
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(type(df['datetime64s'][0]))
# <class 'numpy.datetime64'>
But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism,
it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.
You may be looking at an XY problem -- it may be more fruitful to find a way to (1)
work with Pandas Timestamps instead of trying to force Pandas to return NumPy
datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).

Does Spark.ml LogisticRegression assumes numerical features only?

I was looking at the Spark 1.5 dataframe/row api and the implementation for the logistic regression. As I understand, the train method therein first converts the dataframe to RDD[LabeledPoint] as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
}
...
And then it proceeds to feature standardization, etc.
What I am confused with is, the DataFrame is of type RDD[Row] and Row is allowed to have any valueTypes, for e.g. (1, true, "a string", null) seems a valid row of a dataframe. If that is so, what does the extractLabeledPoints above mean? It seems it is selecting only Array[Double] as the feature values in Vector. What happens if a column in the data-frame was strings? Also, what happens to the integer categorical values?
Thanks in advance,
Nikhil
Lets ignore Spark for a moment. Generally speaking linear models, including logistic regression, expect numeric independent variables. It is not in any way specific to Spark / MLlib. If input contains categorical or ordinal variables these have to be encoded first. Some languages, like R, handle this in a transparent manner:
> df <- data.frame(x1 = c("a", "b", "c", "d"), y=c("aa", "aa", "bb", "bb"))
> glm(y ~ x1, df, family="binomial")
Call: glm(formula = y ~ x1, family = "binomial", data = df)
Coefficients:
(Intercept) x1b x1c x1d
-2.357e+01 -4.974e-15 4.713e+01 4.713e+01
...
but what is really used behind the scenes is so called design matrix:
> model.matrix( ~ x1, df)
(Intercept) x1b x1c x1d
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
...
Skipping over the details it is the same type of transformation as the one performed by the OneHotEncoder in Spark.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
Tuple1("a"), Tuple1("b"), Tuple1("c"), Tuple1("d")
)).toDF("x").repartition(1)
val indexer = new StringIndexer()
.setInputCol("x")
.setOutputCol("xIdx")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("xIdx")
.setOutputCol("xVec")
val encoded = encoder.transform(indexed)
encoded
.select($"xVec")
.map(_.getAs[Vector]("xVec").toDense)
.foreach(println)
Spark goes one step further and all features, even if algorithm allows nominal/ordinal independent variables, have to be stored as Double using a spark.mllib.linalg.Vector. In case of spark.ml it is a DataFrame column, in spark.mllib a field in spark.mllib.regression.LabeledPoint.
Depending on a model interpretation of the feature vector can be different though. As mentioned above for linear model these will be interpreted as numerical variables. For Naive Bayes theses are considered nominal. If model accepts both numerical and nominal variables Spark and treats each group in a different way, like decision / regression trees, you can provide categoricalFeaturesInfo parameter.
It is worth pointing out that dependent variables should be encoded as Double as well but, unlike independent variables, may require additional metadata to be handled properly. If you take a look at the indexed DataFrame you'll see that StringIndexer not only transforms x, but also adds attributes:
scala> org.apache.spark.ml.attribute.Attribute.fromStructField(indexed.schema(1))
res12: org.apache.spark.ml.attribute.Attribute = {"vals":["d","a","b","c"],"type":"nominal","name":"xIdx"}
Finally some Transformers from ML, like VectorIndexer, can automatically detect and encode categorical variables based on the number of distinct values.
Copying clarification from zero323 in the comments:
Categorical values before being passed to MLlib / ML estimators have to be encoded as Double. There quite a few built-in transformers like StringIndexer or OneHotEncoder which can be helpful here. If algorithm treats categorical features in a different manner than a numerical ones, like for example DecisionTree, you identify which variables are categorical using categoricalFeaturesInfo.
Finally some transformers use special attributes on columns to distinguish between different types of attributes.

Resources