featuretools: got an error "AttributeError: 'DataFrame' object has no attribute 'ww'" - featuretools

when i try to use featuretools[spark] on pyspark dataframe
my code are bellow:
import featuretools as ft
import pyspark.pandas as ps
from woodwork.logical_types import Double, Integer
ps.set_option("compute.default_index_type", "distributed")
id = [0, 1, 2, 3, 4]
values = [12, -35, 14, 103, -51]
spark_df = ps.DataFrame({"id": id, "values": values})
es = ft.EntitySet(id="spark_es")
es = es.add_dataframe(
dataframe_name="spark_input_df",
dataframe=spark_df,
index="id",
logical_types={"id": Integer, "values": Double},
)
es
bug got an error
"AttributeError: 'DataFrame' object has no attribute 'ww'"
anyone can help me ?
I just run the official code post on "https://featuretools.alteryx.com/en/stable/guides/using_spark_entitysets.html"

This code you provided works for me.
The lib versions:
Featuretools: 1.21.0
Pyspark: 3.3.0.dev0

Import pyspark.pandas as ps first

Related

AttributeError: type object 'PartialDependenceDisplay' has no attribute 'from_estimator'

I'm trying to implement Partial Dependence Plot using the following example:
from sklearn.inspection import PartialDependenceDisplay
from sklearn.datasets import make_friedman1
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
X, y = make_friedman1()
est1 = LinearRegression().fit(X, y)
est2 = RandomForestRegressor().fit(X, y)
disp1 = PartialDependenceDisplay.from_estimator(est1, X,
[1, 2])
disp2 = PartialDependenceDisplay.from_estimator(est2, X, [1, 2],
ax=disp1.axes_)
But I get the following error:
AttributeError Traceback (most recent call last)
<ipython-input-7-e98f23dac323> in <module>
6 est1 = LinearRegression().fit(X, y)
7 est2 = RandomForestRegressor().fit(X, y)
----> 8 disp1 = PartialDependenceDisplay.from_estimator(est1, X,
9 [1, 2])
10 disp2 = PartialDependenceDisplay.from_estimator(est2, X, [1, 2],
AttributeError: type object 'PartialDependenceDisplay' has no attribute 'from_estimator'
Is there any solution for this?
Thanks
I had same issue (Windows 10, Python 3.9.7 on Anaconda).
Issue resolved with upgrade sklearn package
!pip install --upgrade scikit-learn

gspread worksheet.update error - Worksheet has no attribute 'update'

I am trying to write a dataframe to an open Google Sheet in Google Colab, but am getting the error:
AttributeError: 'Worksheet' object has no attribute 'update'
I documented and tested the parts up to the error.
# General Imports
# Example at https://colab.research.google.com/notebooks/io.ipynb
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
#Import the library, authenticate, and create the interface to Sheets.
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
import numpy as np
import pandas as pd
# Load the DataFrame
dataframe = [['A', 'B', 'C'], ['1', '2' ,'3'], ['Mary', 'Mai', 'Kat']]
print(dataframe)
# Open the Google Sheet
# This assumes that you have worksheet called "RS Output" with sheet "Data" on your Google Drive,
gc = gspread.authorize(GoogleCredentials.get_application_default())
my_workbook = 'RS Output'
my_sheet = "Data"
worksheet = gc.open(my_workbook).worksheet(my_sheet)
list_of_lists = worksheet.get_all_values()
print(list_of_lists)
# update the Google Sheet with the values from the Dataframe
# per gspread documentation at
# https://gspread.readthedocs.io/en/latest/user-guide.html
worksheet.update([dataframe.columns.values.tolist()] + worksheet.values.tolist())
This is the output:
[['A', 'B', 'C'], ['1', '2', '3'], ['Mary', 'Mai', 'Kat']]
[['Testing'], ['This']]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-132-e085af26b2ed> in <module>()
21 # https://gspread.readthedocs.io/en/latest/user-guide.html
22
---> 23 worksheet.update([dataframe.columns.values.tolist()] + worksheet.values.tolist())
AttributeError: 'Worksheet' object has no attribute 'update'
I can't seem to find a clear example of how to write the dataframe to a Google Sheet.
Thanks
I had same issue, this is my first time using colab.research.google.com notebook.
it turned out the default gspread module was version 3.0
!pip install --upgrade gspread!
Updated it to version 3.7 and fixed the missing .update() problem.
Found existing installation: gspread 3.0.1
Uninstalling gspread-3.0.1:
Successfully uninstalled gspread-3.0.1
Successfully installed gspread-3.7.0
Big thanks to: Shashank Rautela
AttributeError: 'Worksheet' object has no attribute 'update' means that the variable worksheet has no update attribute in it, you can verify that by adding print(dir(worksheet)) in your code.
If the variable has update attribute, it should print something like this:
Also, I tried to replicate your code and found some issues:
dataframe = [['A', 'B', 'C'], ['1', '2' ,'3'], ['Mary', 'Mai', 'Kat']] is already a list. dataframe.columns.values.tolist() will give you error since the variable dataframe is a list and has no attribute columns. Using only the variable dataframe in the update method is enough since it is already a list of list. Here is an example of a dataframe: {'col1': [1, 2], 'col2': [3, 4]}.
Incorrect worksheet.update() usage. According to this document, the parameter of update() are range and values (list of list if the range contains multiple cells). The parameter of your update() method should look like this: worksheet.update("Range", data in form of list of list).
Here is an example on how to use the update() method:
Using List:
Code:
data = [["It" , "works!"]]
worksheet.update("A1:B1", data)
Before:
After:
Using panda's dataframe.
Code:
df = pd.DataFrame({'Name': ['A', 'B', 'C'], 'Age': [20, 19, 23]})
values = df.columns.values.tolist()
sh.update("A1:B1", [values])
Before:
After:
Based on how you used the update() method, you want to insert the column names
above the current data of worksheet. Instead of using update, you can use insert_rows()
Code:
df = pd.DataFrame({'Name': ['A', 'B', 'C'], 'Age': [20, 19, 23]})
values = df.columns.values.tolist()
worksheet.insert_rows([values], row=1, value_input_option='RAW')
Before:
After:
References:
insert_rows
update
I ran into the same issue on a Jupyter notebook running on a server (ubuntu 18.04) while it works fine using Pycharm on my local machine (ubuntu 20.04) instead.
Meanwhile, here's how I push my pandas dataframe to a google spreadsheet:
import string
# create a spreadsheet range that matches the size of the df (including 1 row for the column names). It looks like that: 'A1:AA3'
letters = list(string.ascii_uppercase)
col_names_spreadsheet = letters+list(np.array([[X+x for x in letters] for X in letters]).flat)
range_for_df = col_names_spreadsheet[0]+"1"+":"+col_names_spreadsheet[df.shape[1]-1]+str(df.shape[0]+1)
# retrieve the matching cells
cell_list = worksheet.range(range_for_df)
# flatten the df, add the column names at the beginning
cell_values = list(df.columns)+list(df.values.flat)
# set the value of each cell
for i, val in enumerate(cell_values): #gives us a tuple of an index and value
cell_list[i].value = val #use the index on cell_list and the val from cell_values
# update the cells in bulk
worksheet.update_cells(cell_list)
if the df has dates it may return this error
Object of type date is not JSON serializable
In this case I use this
# turn all datetime columns into strings
import datetime
dt_cols = list(df.columns[[type(df[col].iloc[0]) is datetime.date for col in df.columns]])
for c in dt_cols:
df[c] = df[c].apply(lambda x: x.isoformat())
credit to this guy for the trick: Python/gspread - how can I update multiple cells with DIFFERENT VALUES at once?
!pip install --upgrade gspread
upgrade the gspread lib with above command.
you will be able to call the update method.
If you are getting any kind of Attribute Error ( assuming that you have used correct syntax and correct attributes for gspread )
Then that is because you are using gspread's old version 3.0.1, if you haven't used gspread before in google colab then this is the standard version that comes pre-installed. Just do
!pip install --upgrade gspread
At the time of writing this gspread gets upgraded to version 3.7.1 with above command.
Happy Coding!

Azure Databricks & pyspark - substring errors

Getting two errors with my Databricks Spark script with the following line:
df = spark.createDataFrame(pdDf).withColumn('month', substring(col('dt'), 0, 7))
The first one:
AttributeError: 'Series' object has no attribute 'substr'
and
NameError: name 'substr' is not defined
I wonder what I am doing wrong...
Turned out I had not imported the pyspark.sql.functions
from pyspark.sql.functions import *

module 'tensorflow' has no attribute 'variable_scope' with TFANN in Colab

i start using Colab to create ANN and try out the TFANN package and receive error with the following simple sample code from https://github.com/nicholastoddsmith/pythonml#tfann:
pip install TFANN
import numpy as np
from TFANN import ANNR
A = np.random.rand(32, 4)
Y = np.random.rand(32, 1)
a = ANNR([4], [('F', 4), ('AF', 'tanh'), ('F', 1)], maxIter = 16, name = 'mlpr1')
a.fit(A, Y)
S = a.score(A, Y)
YH = a.predict(A)
The error i received is: AttributeError: module 'tensorflow' has no attribute 'variable_scope'
Can someone tell me why i received this error and how can i fix this?
Many thanks!

Syntax error for pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

I tried importing pyspark.sql.functions.from_utc_timestamp(timestamp, tz) but it always shows an invalid syntax error. How do we use this module to convert set of values in epoch time to UTC in spark?
Some example code (taken from the docs, and modified) to do what you want:
>>> from pyspark.sql.functions import from_utc_timestamp
>>> df = sqlContext.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(from_utc_timestamp(df.t, "PST").alias('t')).collect()
[Row(t=datetime.datetime(1997, 2, 28, 2, 30))]

Resources