AttributeError: 'ElephasEstimator' object has no attribute 'setFeaturesCol' - apache-spark

I am trying yo run a Keras model for a binary text classification using Elephas in Apache Spark. Below is the my code:
#my initial spark statements
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('Elephas_App').setMaster('local[4]')
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
#SQLcontext is created using sc for realational functionality
sql_context = SQLContext(sc)
#elephas estimator parameters
optimizer_conf = optimizers.Adam(lr = 0.01)
opt_conf = optimizers.serialize(optimizer_conf)
estimator = ElephasEstimator()
estimator.set_keras_model_config(model.to_yaml())
estimator.set_categorical_labels(True)
estimator.set_nb_classes(tar_class)
estimator.set_num_workers(1)
estimator.set_epochs(5)
estimator.set_batch_size(64)
estimator.setFeaturesCol("features")
estimator.setLabelCol("label")
estimator.set_verbosity(1)
estimator.set_validation_split(0.10)
estimator.set_optimizer_config(opt_conf)
estimator.set_mode("synchronous")
estimator.set_loss("binary_crossentrophy")
estimator.set_metrics(["acc"])
I am facing with the following issue:
AttributeError Traceback (most recent call last)
<ipython-input-92-74397f47b924> in <module>()
7 estimator.set_epochs(5)
8 estimator.set_batch_size(64)
----> 9 estimator.setFeaturesCol("features")
10 estimator.setLabelCol("label")
11 estimator.set_verbosity(1)
AttributeError: 'ElephasEstimator' object has no attribute 'setFeaturesCol'
This issue exists for both "setFeaturesCol" and "setLabelCol".
Can anyone please help me as I am new to this?
Thanks in advance!

The HasLabelCol and HasFeaturesCols mixins were changed in Spark 3.0.x+ to remove the setter methods, hence the issue. The featuresCol and labelCol can be supplied in the ElephasEstimator constructor:
ElephasEstimator(featuresCol='features', labelCol='label')
However, in your application, this shouldn't be necessary, as the default feature column is 'features' and the default label column is 'label' - you should be able to omit those lines and run as normal.

Related

Error saving a linear regression model with MLLib

Trying to save my linear regression model to disk I receive this error: "TypeError: save() takes 2 positional arguments but 3 were given"
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sc= SparkContext()
lr = LinearRegression(featuresCol = 'features', labelCol='NextOrderInDays', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
lr_model.save(sc, "lr_model.model")
Searching the web outputs something similar to what I wrote. What do I miss as 3rd argument?
Thanks
You use the ml package not the mllib: from pyspark.ml.regression import LinearRegression.
So the save function has only one argument: the path (cf. documentation).

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

error TypeError: unorderable types: int() < str()

I am getting this error
Using Python version 3.5.2+ (default, Sep 22 2016 12:18:14)
SparkSession available as 'spark'.
Traceback (most recent call last):
File "/home/saria/PycharmProjects/TfidfLDA/main.py", line 30, in <module>
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
File "/home/saria/tf27/lib/python3.5/site-packages/pyparsing.py", line 956, in col
return 1 if 0<loc<len(s) and s[loc-1] == '\n' else loc - s.rfind("\n", 0, loc)
TypeError: unorderable types: int() < str()
Process finished with exit code 1
when I am running the following code. I should explain that the error happen in this line:
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
I reviewd these cases:
enter link description here
enter link description here
but they are about the conversion int and string, specially reading Input.
but here I dont have input,
Explanations abt the code:
this code is doing tfidf+lda using Dataframe
# I used alias to avoid confusion with the mllib library
from pyparsing import col
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import HashingTF as MLHashingTF, Tokenizer, HashingTF, IDF, StringIndexer
from pyspark.ml.feature import IDF as MLIDF
from pyspark.python.pyspark.shell import sqlContext, sc
from pyspark.sql.types import DoubleType, StructField, StringType, StructType
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
dbURL = "hdfs://en.wikipedia.org/wiki/Music"
file = sc.textFile("1.txt")
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)
indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
model = LDA.train(corpus, k=2)
May please give me your idea,
when I delete the str in the error prone line:
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
, It throws a new error
TypeError: col() missing 1 required positional argument: 'strg'
Update
My main goal is to run this code :
tfidf then lda

TypeError: 'Builder' object is not callable Spark structured streaming

On running the example given in the programming guide[link] for python spark structured streaming
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
I get below Error :
TypeError: 'Builder' object is not callable
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession.builder()\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark\
.readStream\
.format('socket')\
.option('host', 'localhost')\
.option('port', 9999)\
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.start()
query.awaitTermination()
Error :
omkar#rudra:~/thesis/backUp$ spark-submit structured.py
Traceback (most recent call last):
File "/home/omkar/thesis/backUp/structured.py", line 8, in <module>
spark = SparkSession.builder()\
TypeError: 'Builder' object is not callable
For
spark = SparkSession.builder()\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
modify .builder() to .builder as :
spark = SparkSession.builder\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
Source : https://issues.apache.org/jira/browse/SPARK-18426
When running python example in Structured Streaming Guide, get the error:
spark = SparkSession.builder().master("local[1]").appName("Example").getOrCreate()
TypeError: 'Builder' object is not callable
This is fixed by changing .builder() to .builder
spark = SparkSession.builder.master("local[1]").appName("Demo").getOrCreate()
After removing this-() in builder while creating sparksession,the code will run.

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website:
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
However, the example above wouldn't run and gave me the following errors:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-aaffcd1239c9> in <module>()
1 from pyspark import *
2 data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
----> 3 df = spark.createDataFrame(data, ["features"])
4 kmeans = KMeans(k=2, seed=1)
5 model = kmeans.fit(df)
NameError: name 'spark' is not defined
What additional configuration/variable needs to be set to get the example running?
You can add
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.
Answer by ηŽ‡ζ€€δΈ€ is good and will work for the first time.
But the second time you try it, it will throw the following exception :
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at <ipython-input-3-786525f7559f>:10
There are two ways to avoid it.
1) Using SparkContext.getOrCreate() instead of SparkContext():
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
2) Using sc.stop() in the end, or before you start another SparkContext.
Since you are calling createDataFrame(), you need to do this:
df = sqlContext.createDataFrame(data, ["features"])
instead of this:
df = spark.createDataFrame(data, ["features"])
spark stands there as the sqlContext.
In general, some people have that as sc, so if that didn't work, you could try:
df = sc.createDataFrame(data, ["features"])
You have to import the spark as following if you are using python then it will create
a spark session but remember it is an old method though it will work.
from pyspark.shell import spark
If it errors you regarding other open session do this:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
scraped_data=spark.read.json("/Users/reihaneh/Desktop/nov3_final_tst1/")

Resources