PySpark: ModuleNotFoundError: No module named 'app' - apache-spark

I am saving a dataframe to a CSV file in PySpark using below statement:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')
But i am getting below error
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'
i am using PySpark version 2.3.0
I am getting this error while trying to write to a file.
import json, jsonschema
from pyspark.sql import functions
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType
from datetime import datetime
import os
feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)
df_all = feb.union(apr)
df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])
create_emi_amount_udf = udf(create_emi_amount, FloatType())
df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))
df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')

The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.
So, somewhere in your method create_emi_amount you use or import the app module.
A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....

Thanks to ggeop! He helped me out. ggeop has explained the problem. But the solution may not be correct if the 'app' is his own package.
My solution is to add the file in sparkcontext:
sc = SparkContext()
sc.addPyFile("app.zip")
But you have to zip app package first, and you have to make sure the zipped packaged get app directory.
i.e. if your app is at:/home/workplace/app
then you have to do the zip under workplace, which will zip all directories under workplace including app.
The other way is to send the file in spark-submit, as below:
--py-files app.zip
--py-files myapp.py

Related

Pyspark general import problems

I succesfully instaled Spark and Pyspark in my machine, added path variables, etc. but keeps facing import problems.
This is the code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.hive.exec.dynamic.partition", "true") \
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
And this is the error message:
"C:\...\Desktop\Clube\venv\Scripts\python.exe" "C:.../Desktop/Clube/services/ce_modelo_analise.py"
Traceback (most recent call last):
File "C:\...\Desktop\Clube\services\ce_modelo_analise.py", line 1, in <module>
from pyspark.sql import SparkSession
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\context.py", line 31, in <module>
from pyspark import accumulators
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\accumulators.py", line 97, in <module>
from pyspark.serializers import read_int, PickleSerializer
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\serializers.py", line 71, in <module>
from pyspark import cloudpickle
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 145, in <module>
_cell_set_template_code = _make_cell_set_template_code()
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: 'bytes' object cannot be interpreted as an integer
If I remove the import line, those problems disappear. As I said before, my path variables are set:
and
Also, Spark is running correctly in cmd:
Going deeper I found the problem: I'm using Spark in version 2.4, which works with Python 3.7 tops.
As I was using Python 3.10, the problem was happening.
So if you're experiencing the same kind of issue, try to change your versions.

Using Faker with PySpark Dataframe to Anonymise Data

I am trying to change a few columns in my Spark DataFrame, I have a few columns like :
First Name
Last Name
Email
I want to anonymise this and generate meaningful values for which am using Faker.
But if i use
df.withColumn('FirstName', lit(fake.first_name()))
It adds the same name for all rows , something like :
As you can see it has the same value for each first name, ideally i would like to have different faker value and not a constant. How would I achieve this ?
Update 1 :
I looked at Steven's suggestion and here is my updated code
import pyspark.sql.functions as sf
from faker import Faker
from pyspark.sql import functions as F
MSG_FORMAT = '%(asctime)s %(levelname)s %(name)s: %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logging.basicConfig(format=MSG_FORMAT, datefmt=DATETIME_FORMAT)
logger = logging.getLogger("[SFDC-GLUE-LOG]")
fake = Faker()
source_df = spark.read.format("jdbc").option("url",connection_url).option("query",query).option("driver", driver_name).option("user", user_name).option("password", password).option("StmtCallLimit",0).load()
fake_firstname = F.udf(fake.first_name)
masked_df=source_df.withColumn("FirstName", fake_firstname())
Now i Get
Traceback (most recent call last):
File "script_2020-08-05-17-15-26.py", line 52, in <module>
masked_df=source_df.withColumn("FirstName", fake_firstname())
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle weakref objects
you need to use an UDF for that :
from pyspark.sql import functions as F
fake_firstname = F.udf(fake.first_name)
df.withColumn("FirstName", fake_firstname())
I had the same problem, follow the solution that worked for me.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from faker import Factory
def fake_name():
faker = Factory.create()
return faker.name()
fake_name_udf = udf(fake_name, StringType())
df = df.withColumn('name', fake_name_udf())

Getting a Key Error 'gs' when trying to write a dask dataframe to csv on google cloud storage

I have the following code below where I'm 1) importing a csv file from a gcs bucket 2) doing some etl on it and 3) converting it to dask df before writing the dask df to_csv. All goes to plan until the very end when I get a KeyError: 'gs' upon writing to csv back in a gcs bucket.
here is my code - can anyone help me understand where the key error comes from?
def stage1_1ph_prod_master(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
source_bucket = 'sourcebucket'
destination_path = 'gs://destination_bucket/ddf-*ph_master_static.csv'
storage_client = storage.Client()
source_bucket = storage_client.bucket(source_bucket)
# load in the col names
col_names = ["PPG_Code","PPG_Code_Name","SAP_Product_Name","CP_Sku_Code","UPC_Unit","UPC_Case","Category","Product_Category","Sub_Category","Brand","Sub_Brand","Variant","Size","Gender","Last_Updated_By","Last_Updated_On","Created_By","Created_On","Gross_Weight_Case_kg","Case_Height_mm",]
df = pd.DataFrame(columns=col_names)
for file in list(source_bucket.list_blobs()):
file_path="gs://{}/{}".format(file.bucket.name, file.name)
df = df.append(pd.read_csv(file_path, header=None, skiprows=28, names=col_names, encoding='Latin_1'))
ddf0 = dd.from_pandas(df,npartitions=1, sort=True)
ddf0.to_csv(destination_path) # Key Error happens here
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 43, in stage1_1ph_prod_master ddf0.to_csv(destination_path) File "/env/local/lib/python3.7/site-packages/dask/dataframe/core.py", line 1299, in to_csv return to_csv(self, filename, **kwargs) File "/env/local/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 741, in to_csv **(storage_options or {}) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 302, in open_files urlpath, mode, num=num, name_function=name_function, storage_options=kwargs File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 425, in get_fs_token_paths fs, fs_token = get_fs(protocol, options) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 571, in get_fs cls = _filesystems[protocol] KeyError: 'gs'
gcsfs and dask has recently changed to use the fsspec package. The former has been released, but the latter is in master only. So gcsfs is no longer registering itself with the filesystems in dask, because fsspec already knows about it, but the version of dask you are using does not yet know about fsspec.
In short, please downgrade gcsfs until we have a chance to release dask, or use dask from master.

How to fix "No FileSystem for scheme: gs" in pyspark?

I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here's the code:
import pandas as pd
import numpy as np
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
conf = SparkConf().setAll([('spark.executor.memory', '16g'),
('spark.executor.cores','4'),
('spark.cores.max','4')]).setMaster('local[*]')
spark = (SparkSession.
builder.
config(conf=conf).
getOrCreate())
sc = spark.sparkContext
import glob
import bz2
import json
import pickle
bucket_path = "gs://<SOME_PATH>/"
client = storage.Client(project='<SOME_PROJECT>')
bucket = client.get_bucket ('<SOME_PATH>')
blobs = bucket.list_blobs()
theframes = []
for blob in blobs:
print(blob.name)
testspark = spark.read.json(bucket_path + blob.name).cache()
theframes.append(testspark)
It's reading files from the bucket fine (I can see the print out from blob.name), but then crashes like this:
Traceback (most recent call last):
File "test_code.py", line 66, in <module>
testspark = spark.read.json(bucket_path + blob.name).cache()
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/anaconda3/envs/py37base/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o51.json.
: java.io.IOException: No FileSystem for scheme: gs
I've seen this type of error discussed on stackoverflow, but most solutions seem to be in Scala while I have pyspark, and/or involve messing with core-site.xml, which I've done to no effect.
I am using spark 2.4.1 and python 3.6.7.
Help would be much appreciated!
Some config params are required to recognize "gs" as a distributed filesystem.
Use this setting for google cloud storage connector, gcs-connector-hadoop2-latest.jar
spark = SparkSession \
.builder \
.config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar") \
.getOrCreate()
Other configs that can be set from pyspark
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
Alternatively you can set up these configs in core-site.xml or spark-defaults.conf.
Hadoop Configuration on Command Line
You can also use spark.hadoop-prefixed configuration properties to set things up when pyspark (or spark-submit in general), e.g.
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

How could I write the right entry point in Spark 2.0 program (Actually pyspark 2.0)?

Today, I wanna try some new features with Spark2.0, here is my program:
#coding:utf-8
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("/Users/lyj/Programs/Apache/Spark2/examples/src/main/resources/people.json")
df.show()
but it errors as follow:
Traceback (most recent call last):
File "/Users/lyj/Programs/kiseliugit/MyPysparkCodes/test/spark2.0.py", line 5, in <module>
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/context.py", line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/java_gateway.py", line 116, in launch_gateway
java_import(gateway.jvm, "org.apache.spark.SparkConf")
File "/Library/Python/2.7/site-packages/py4j/java_gateway.py", line 90, in java_import
return_value = get_return_value(answer, gateway_client, None, None)
File "/Library/Python/2.7/site-packages/py4j/protocol.py", line 306, in get_return_value
value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
KeyError: u'y'
What's wrong with these few lines of codes? Does it have the problem with java environment? Plus, I use IDE PyCharm for developing.
Try to upgrade py4j, pip install py4j --upgrade
It's worked for me.

Resources