Why does PCA in pyspark run out of memory? - apache-spark

When I run PCA in pyspark, I run out of memory. This is pyspark 1.6.3 and teh execution environment is a zeppelin notebook. Here is an example. Let df be a pyspark DataFrame where 'vectors' is the desired input column (containing a SparseVector of data).
from pyspark.ml.feature import PCA
pca = PCA(k = 100, inputCol="vectors", outputCol = "pca").fit(df)
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2419389767585347468.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "/usr/hdp/current/spark-client/python/pyspark/ml/pipeline.py", line 69, in fit
return self._fit(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o222.fit.
: java.lang.OutOfMemoryError: Java heap space
But check this out:
import pandas as pd
import numpy as np
pandf = df.toPandas()
densevectors = [np.array(sparse.toArray()) for sparse in pandf['vectors']]
xtrain = np.vstack(densevectors)
from sklearn.decomposition import PCA as skPCA
skpca = skPCA(n_components=100).fit(xtrain)
skpca.components_.shape
(100, 41277)
Execution time is 14 seconds. There are no memory problems, of course, because the input dataset only has ~9000 rows of sparse vectors. In spark-defaults.conf, driver and executor memory are both set at 12g, and this is an 8 node cluster that should have 32g available per node. There is no way the entire input dataset even takes up 1 MB, not even as a .csv format.
Why is pyspark's PCA implementation running out of memory?

Related

Python3 Pandas.DataFrame.info() Error Key: 30

So I was digging around some datasets, and trying to use pandas to analyze then and i stumbled across the following error.. and my brain froze :(
here is the snippet where the exception is being raised
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = pd.DataFrame(X)
data['class'] = y
data.head()
data.tail()
data.columns
print('length of data is', len(data))
data.shape
data.info()
here's the error trackback
C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\Scripts\python.exe C:/Users/97150/PycharmProjects/EmbeddedLinux/AI/project.py
length of data is 569
Traceback (most recent call last):
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 30
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/97150/PycharmProjects/EmbeddedLinux/AI/project.py", line 42, in <module>
data.info()
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\frame.py", line 2587, in info
self, verbose, buf, max_cols, memory_usage, null_counts
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\io\formats\info.py", line 250, in info
self._verbose_repr(lines, ids, dtypes, show_counts)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\io\formats\info.py", line 335, in _verbose_repr
dtype = dtypes[i]
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\series.py", line 882, in __getitem__
return self._get_value(key)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\97150\PycharmProjects\EmbeddedLinux\venv\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 30
Process finished with exit code 1
note: I'm using PyCharm community 2020.2, and checked for updates and such, and nothing changed
So, turned out, pandas is straight up acting weird.
removing the () from the data.info() fixed the issue :)
You can alternatively try passing verbose=True and null_counts=True arguments to the .info() method to display the result(you can just use verbose argument if you don't want to consider null values).
data.info(verbose=True, null_counts=True)
Let me know if things work out for you.

Py4JError: An error occurred while calling o90.fit

I want to apply random forest algorithm over a dataframe consisting of three columns namely JournalID, IndexedJournalID(Obtained using StringIndexer of Spark) and feature vector. I used below code to read the dataframe from parquet file and apply String Indexer over JournalID column to convert it to categorical type.
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
df=spark.read.parquet('JouID-UBTFIDFVectors-server22.parquet')
labelIndexer = StringIndexer(inputCol="journalid", outputCol="IndexedJournalID")
labelsDF=labelIndexer.fit(df)
df1=labelsDF.transform(df)
# This function converts sparse vectors to dense vectors....I applied this on raw features column to convert them to VectorUDT type.....
parse_ = udf(lambda l: Vectors.dense(l), VectorUDT())
df2 = df1.withColumn("featuresNew", parse_(df1["features"])).drop('features')
New Dataframe Schema(df2) is as follows:
root
|-- journalid: string (nullable = true)
|-- indexedLabel: double (nullable = false)
|-- featuresNew: vector (nullable = true)
Then I split df2 into training and test set and create object of random forest classifier as below:
(trainingData, testData) = df2.randomSplit([0.8, 0.2])
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="featuresNew", numTrees=2 )
Finally apply fit() method over trainingData obtained above.
rfModel=rf.fit(trainingData)
With this I am able to train model on 100 instances of input dataframe. However,over whole training data, this line gives following error.
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53652)
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle
num_updates = read_int(self.rfile)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/serializers.py", line 685, in read_int
raise EOFError
EOFError
----------------------------------------
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:41060)
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-46d7488961c7>", line 1, in <module>
rfModel=rf.fit(trainingData)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 288, in _fit
java_model = self._fit_java(dataset)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 285, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o90.fit
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
stb = value._render_traceback_()
AttributeError: 'Py4JError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
.(traceback...not writing due to space issue)
.
.
Py4JError: An error occurred while calling o90.fit
This error is not very descriptive and hence it has become difficult for me to identify the where I am going wrong. Any help would help a lot.
Input Description:
Input Dataframe Contains 2696512 rows and each row's feature vector is of 262144 length.
After going through lot of related questions on stackoverflow , I thought this may be happening because of running this in jupyter-notebook. So later I ran it on commandline using spark-submit script and I am not getting this error anymore. I don't know though why this error is popping up if I run this in jupyter-notebook.

spark java.lang.stackoverflow logistic regression fit with large dataset

I am trying to fit a logistic regression model for a data set with 470 features and 10 million training instances. Here is a snippet of my code.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula
formula = RFormula(formula = "label ~ .-classWeight")
bestregLambdaVal = 0.005
bestregAlphaVal = 0.01
lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight")
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])
I have also created a checkpoint directory,
sc.setCheckpointDir('checkpoint/')
as suggested here:
Spark gives a StackOverflowError when training using ALS
However I get an error and here is a partial trace:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
I would also like to note that the 470 feature columns were iteratively added to spark data frame using withcolumn().
So the mistake I was making is that, when checkpointing the dataframe, I would only do:
mySparkDataFrame.checkpoint(eager=True)
The right was to do:
mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)
This is based on another question I had asked (and got an answer for) here:
pyspark rdd isCheckPointed() is false
Also, it is recommended to persist() the dataframe before checkpoint and also to count() it after the checkpoint

Zeppelin/Spark: org.apache.spark.SparkException: Cannot run program "/usr/bin/": error=13, no permission

I try to get a basic regression run with Zeppelin 0.7.2 and Spark 2.1.1 on Debian 9. Both zeppelin are "installed" in /usr/local/ that means /usr/local/zeppelin/ and /usr/local/spark. Zeppelin also knows the correct SPARK_HOME. First I load the data:
%spark.pyspark
from sqlalchemy import create_engine #sql query
import pandas as pd #sql query
from pyspark import SparkContext #Spark DataFrame
from pyspark.sql import SQLContext #Spark DataFrame
# database connection and sql query
pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass#host:3306/db').connect())
print(pdf.size) # size of pandas dataFrame
# convert pandas dataFrame into spark dataFrame
sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf)
sdf.printSchema()# what does the spark dataFrame look like?
Fine, it works and I get the output with 46977 row and three cols:
46977
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: date (nullable = true)
Ok, now I want to do the regression:
%spark.pyspark
# do a linear regression with sparks ml libs
# https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# choose several inputCols and transform the "Features" column(s) into the correct vector format
vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features")
data=vectorAssembler.transform(sdf)
print(data)
# Split the data into 70% training and 30% test sets.
trainingData,testData = data.randomSplit([0.7, 0.3], 0.0)
print(trainingData)
# Configure the model.
lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10)
## Train the model using the training data.
lrm = lr.fit(trainingData)
## Run the test data through the model and display its predictions for PetalLength.
#predictions = lrm.transform(testData)
#predictions.show()
But while doing lr.fit(trainingData), I get errors in the console (and log files of zeppelin). The errors seems to be while starting spark: Cannot run program "/usr/bin/": error=13, Keine Berechtigung. I wonder what should to be started in /usr/bin/ since I only use the path /usr/local/.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 9, in <module>
File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o70.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
It was a configuration error in Zeppelins conf/zeppelin-env.sh. There, I had the following line uncommented that caused the error and I now commented the line and it works:
#export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
So the problem was that the path to PYSPARK_PYTHON was not set correctly, now it uses the default python binary. I found the solution by looking for the string /usr/bin/ by doing grep -R "/usr/bin/" in the Zeppelin base directory and checked the files.

How could I write the right entry point in Spark 2.0 program (Actually pyspark 2.0)?

Today, I wanna try some new features with Spark2.0, here is my program:
#coding:utf-8
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("/Users/lyj/Programs/Apache/Spark2/examples/src/main/resources/people.json")
df.show()
but it errors as follow:
Traceback (most recent call last):
File "/Users/lyj/Programs/kiseliugit/MyPysparkCodes/test/spark2.0.py", line 5, in <module>
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/context.py", line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/java_gateway.py", line 116, in launch_gateway
java_import(gateway.jvm, "org.apache.spark.SparkConf")
File "/Library/Python/2.7/site-packages/py4j/java_gateway.py", line 90, in java_import
return_value = get_return_value(answer, gateway_client, None, None)
File "/Library/Python/2.7/site-packages/py4j/protocol.py", line 306, in get_return_value
value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
KeyError: u'y'
What's wrong with these few lines of codes? Does it have the problem with java environment? Plus, I use IDE PyCharm for developing.
Try to upgrade py4j, pip install py4j --upgrade
It's worked for me.

Resources