How do I convert a Python DataFrame into a NumPy array - python-3.x

Below is a snippet that converts data into a NumPy array. It is then converted to a Pandas DataFrame where I intend to process it. I'm attempting to convert it back to a NumPy array. I'm failing at this. Badly.
import pandas as pd
import numpy as np
from pprint import pprint
data = [
('2020-11-01 00:00:00', 1.0),
('2020-11-02 00:00:00', 2.0)
]
coordinatesType = [('timestamp', 'datetime64[s]'), ('value', '<f8')]
npArray = np.asarray(data, coordinatesType)
df = pd.DataFrame(data = npArray)
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_numpy(coordinatesType)
pprint(mutatedNpArray)
# don't suply dtype for kicks
pprint(df.to_numpy())
This yields crazytown:
array([[('2020-11-01T00:00:00', 1.6041888e+18),
('1970-01-01T00:00:01', 1.0000000e+00)],
[('2020-11-02T00:00:00', 1.6042752e+18),
('1970-01-01T00:00:02', 2.0000000e+00)]],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
array([[Timestamp('2020-11-01 00:00:00'), 1.0],
[Timestamp('2020-11-02 00:00:00'), 2.0]], dtype=object)
I realize a DataFrame is really a fancy NumPy array under the hood, but I'm passing back to a function that accepts a simple NumPy array. Clearly I'm not handling dtypes correctly and/or I don't understand the data structure inside my DataFrame. Below is what the function I'm calling expects:
[('2020-11-01T00:00:00', 1.000 ),
('2020-11-02T00:00:00', 2.000 )],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
I'm really lost on how to do this. Or what I should be doing instead.
Help!
As #hpaul suggested, I tried the following:
# ...
df = df.set_index('timestamp')
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_records(coordinatesType)
# ...
All good!

Besides the to_records approach mentioned in comments, you can do:
df.apply(tuple, axis=1).to_numpy(coordinatesType)
Output:
array([('2020-11-01T00:00:00', 1.), ('2020-11-02T00:00:00', 2.)],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
Considerations:
I believe the issue here is related to the difference between the original array and the dataframe.
The shape your original numpy array is (2,), where each value is a tuple. When creating the dataframe, both df.shape and df.to_numpy() shapes are (2, 2) so that the dtype constructor does not work as expected. When converting rows to tuples into a pd.Series, you get the original shape of (2,).

Related

Pandas Series of dates to vlines kwarg in mplfinance plot

import numpy as np
import pandas as pd
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],
'w': [5, 7],
'n': [11, 8]})
df.reset_index()
print(list(df.loc[:,'dt'].values))
gives: ['2021-2-13', '2022-2-15']
NEEDED: [('2021-2-13'), ('2022-2-15')]
Important (at comment's Q): "NEEDED" is the way "mplfinance" accepts vlines argument for plot (checked) - I need to draw vertical lines for specified dates at x-axis of chart
import mplfinance as mpf
RES['Date'] = RES['Date'].dt.strftime('%Y-%m-%d')
my_vlines=RES.loc[:,'Date'].values # NOT WORKS
fig, axlist = mpf.plot( ohlc_df, type="candle", vlines= my_vlines, xrotation=30, returnfig=True, figsize=(6,4))
will only work if explcit my_vlines= [('2022-01-18'), ('2022-02-25')]
SOLVED: Oh, it really appears to be so simple after all
my_vlines=list(RES.loc[:,'Date'].values)
Your question asks for a list of Numpy arrays but your desired output looks like Tuples. If you need Tuples, note that it's the comma that makes the tuple not the parentheses, so you'd do something like this:
desired_format = [(x,) for x in list(df.loc[:,'dt'].values)]
If you want numpy arrays, you could do this
desired_format = [np.array(x) for x in list(df.loc[:,'dt'].values)]
I think I understand your problem. Please see the example code below and let me know if this resolves your problem. I expanded on your dataframe to meet mplfinance plot criteria.
import pandas as pd
import numpy as np
import mplfinance as mpf
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],'Open': [5,7],'Close': [11, 8],'High': [21,30],'Low': [7, 3]})
df['dt']=pd.to_datetime(df['dt'])
df.set_index('dt', inplace = True)
mpf.plot(df, vlines = dict(vlines = df.index.tolist()))

Iterating over columns from two dataframes to estimate correlation and p-value

I am trying to estimate Pearson's correlation coefficient and P-value from the corresponding columns of two dataframes. I managed to write this code so far but it is just providing me the results from the last columns. Need some help with this code. Also, want to save the outputs in a new dataframe.
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame(pd.read_excel('15_Oct_Yield_A.xlsx'))
df_2= pd.DataFrame(pd.read_excel('Oct_Z_index.xlsx'))
for column in df_1.columns[1:]:
for column in df_2.columns[1:]:
x = (df_1[column])
y = (df_2[column])
correl = stats.pearsonr(x, y)
Your looping setup is incorrect on a couple measures... You are using the same variable name in both for-loops which is going to cause problems. Also, you are computing correl outside of your inner loop... etc.
What you want to do is loop over the columns with 1 loop, assuming that both data frames have the same column names. If they do not, you will need to take extra steps to find the common column names and then iterate over them.
Something like this should work:
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame({ 'A': ['dog', 'pig', 'cat'],
'B': [0.25, 0.50, 0.75],
'C': [0.30, 0.40, 0.90]})
df_2 = pd.DataFrame({ 'A': ['bird', 'monkey', 'rat'],
'B': [0.20, 0.60, 0.90],
'C': [0.80, 0.50, 0.10]})
results = dict()
for column in df_1.columns[1:]:
correl = stats.pearsonr(df_1[column], df_2[column])
results[column] = correl
print(results)

Changing data types after interpolating

from tnorma import tnorma
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tnorma import tnorma
df6 = pd.read_csv('Static06_new.csv')
Z =tnorma(df6)
df = pd.DataFrame(Z)
print(df)
This is a simple enough code. The interpolation code "tnorma" is given here https://pypi.org/project/tnorma/
The Static06_new.csv files contains positional data with time column. The time column is not continuous, hence I am interpolating them.
The interpolation is successful, however, I am unable to convert the result back into a data-frame for further analysis.
The error received when running my code is as follows:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
values = np.array([convert(v) for v in values])
0
0 [[109.00000000000001, 0.009174311926606001, 35...
1 [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...
2 [0, 3694]
Not sure how to proceed. Ideal case I would like it back in a data-frame format and save it as .csv file.
Kind regards,

plt.errorbar for X string value

I have a dataframe as below
import pandas as pd
import matplotlib.pylab as plt
df = pd.DataFrame({'name':['one', 'two', 'three'], 'assess':[100,200,300]})
I want to build errorbar like this
c = 30
plt.errorbar(df['name'], df['assess'], yerr=c, fmt='o')
and of course i get
ValueError: could not convert string to float
I can convert string to float, but I'm losing value signatures and maybe there's a more elegant way?
Matplotlib can indeed only work with numerical data. There is an example in the matplotlib collection showing how to handle cases where you have categorical data. The solution is to plot a range of values and set the labels afterwards using plt.xticks(ticks, labels) or a combination of ax.set_xticks(ticks) and ax.set_xticklabels(labels).
In your case the former works fine:
import pandas as pd
import matplotlib.pylab as plt
df = pd.DataFrame({'name':['one', 'two', 'three'], 'assess':[100,200,300]})
c = 30
plt.errorbar(range(len(df['name'])), df['assess'], yerr=c, fmt='o')
plt.xticks(range(len(df['name'])), df['name'])
plt.show()

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.

Resources