Using Faker with PySpark Dataframe to Anonymise Data - apache-spark

I am trying to change a few columns in my Spark DataFrame, I have a few columns like :
First Name
Last Name
Email
I want to anonymise this and generate meaningful values for which am using Faker.
But if i use
df.withColumn('FirstName', lit(fake.first_name()))
It adds the same name for all rows , something like :
As you can see it has the same value for each first name, ideally i would like to have different faker value and not a constant. How would I achieve this ?
Update 1 :
I looked at Steven's suggestion and here is my updated code
import pyspark.sql.functions as sf
from faker import Faker
from pyspark.sql import functions as F
MSG_FORMAT = '%(asctime)s %(levelname)s %(name)s: %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logging.basicConfig(format=MSG_FORMAT, datefmt=DATETIME_FORMAT)
logger = logging.getLogger("[SFDC-GLUE-LOG]")
fake = Faker()
source_df = spark.read.format("jdbc").option("url",connection_url).option("query",query).option("driver", driver_name).option("user", user_name).option("password", password).option("StmtCallLimit",0).load()
fake_firstname = F.udf(fake.first_name)
masked_df=source_df.withColumn("FirstName", fake_firstname())
Now i Get
Traceback (most recent call last):
File "script_2020-08-05-17-15-26.py", line 52, in <module>
masked_df=source_df.withColumn("FirstName", fake_firstname())
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/mnt/yarn/usercache/root/appcache/application_1596647211940_0002/container_1596647211940_0002_01_000001/pyspark.zip/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle weakref objects

you need to use an UDF for that :
from pyspark.sql import functions as F
fake_firstname = F.udf(fake.first_name)
df.withColumn("FirstName", fake_firstname())

I had the same problem, follow the solution that worked for me.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from faker import Factory
def fake_name():
faker = Factory.create()
return faker.name()
fake_name_udf = udf(fake_name, StringType())
df = df.withColumn('name', fake_name_udf())

Related

how to filter a particular column with python pandas?

I have an excel file where I have 2 columns: 'Name' and 'size'. The 'Name' column has multiple file types, namely ".apk, .dat, .vdex, .ttc" etc. But I only want to populate the files with the file extension ending with .apk. I do not want any other file type in the new excel file.
I have written the below code:
import pandas as pd
import json
def json_to_excel():
with open('installed-files.json') as jf:
data = json.load(jf)
df = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_xl = new_df.to_excel('abc.xlsx')
return new_xl
def filter_apk(): `MODIFIED CODE`
old_xl = json_to_excel()
data = pd.read_excel(old_xl)
a = data[data["Name"].str.contains("\.apk")]
a.to_excel('zybg.xlsx')
Above program does following:
json_to_excel(), takes a Json file, converts it to a .xlsx format and save.
filter_apk() is suppose to create multiple excel file based on the file extension present in "Name" column.
1st function is doing what I intend to.
2nd function is not doing anything. Neither its throwing any error. I have followed this weblink
Below are the few samples of the "name" column
/system/product/<Path_to>/abc.apk
/system/fonts/wwwr.ttc
/system/framework/framework.jar
/system/<Path_to>/icu.dat
/system/<Path_to>/Normal.apk
/system/<Path_to>/Tv.apk
How to get that working? Or is there a better way to achieve the objective?
Please suggest.
ERROR
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'NoneType'>
Note:
I have all the files at the same location.
modified code:
import pandas as pd
import json
def json_to_excel():
with open('installed-files.json') as jf:
data = json.load(jf)
df = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_df.to_excel('abc.xlsx')
def filter_apk():
json_to_excel()
old_xl = pd.read_excel('abc.xlsx')
data = pd.read_excel(old_xl)
a = data[data["Name"].str.contains("\.apk")]
a.to_excel('zybg.xlsx')
t = filter_apk()
print(t)
New error:
Traceback (most recent call last):
File "C:/Users/amitesh.sahay/PycharmProjects/work_allocation/TASKS/Jenkins.py", line 89, in <module>
t = filter_apk()
File "C:/Users/amitesh.sahay/PycharmProjects/work_allocation/TASKS/Jenkins.py", line 84, in filter_apk
data = pd.read_excel(old_xl)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_base.py", line 344, in __init__
filepath_or_buffer, _, _, _ = get_filepath_or_buffer(filepath_or_buffer)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\common.py", line 243, in get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
There is a difference between your use-case and use-case shown in the weblink. You want to apply a single filter (apk files), whereas the example you saw had multiple filters which were to be applied one after another (multiple species).
This will do the trick.
def filter_apk():
old_xl = json_to_excel()
data = pd.read_excel(old_xl)
a = data[data["Name"].str.contains("\.apk")]
a.to_excel("<path_to_new_excel>\\new_excel_name.xlsx")
Regarding your new updated question. I guess your first function is not working as you think it is working.
new_xl = new_df.to_excel('abc.xlsx')
This will write an excel file, as you are expecting it to do. Which works.
However, assigning it to new_xl, does not do anything since there is no return on pd.to_excel. So when you return new_xl as output of your json_to_excel function, you actually return None. Therefore in your second function, old_xl = json_to_excel() will make old_xl have the value None.
So, your functions should be something like this:
def json_to_excel():
with open('installed-files.json') as jf:
data = json.load(jf)
df = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_df.to_excel('abc.xlsx')
def filter_apk():
json_to_excel()
data= pd.read_excel('abc.xlsx')
a = data[data["Name"].str.contains("\.apk")]
a.to_excel('zybg.xlsx')

Getting a Key Error 'gs' when trying to write a dask dataframe to csv on google cloud storage

I have the following code below where I'm 1) importing a csv file from a gcs bucket 2) doing some etl on it and 3) converting it to dask df before writing the dask df to_csv. All goes to plan until the very end when I get a KeyError: 'gs' upon writing to csv back in a gcs bucket.
here is my code - can anyone help me understand where the key error comes from?
def stage1_1ph_prod_master(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
source_bucket = 'sourcebucket'
destination_path = 'gs://destination_bucket/ddf-*ph_master_static.csv'
storage_client = storage.Client()
source_bucket = storage_client.bucket(source_bucket)
# load in the col names
col_names = ["PPG_Code","PPG_Code_Name","SAP_Product_Name","CP_Sku_Code","UPC_Unit","UPC_Case","Category","Product_Category","Sub_Category","Brand","Sub_Brand","Variant","Size","Gender","Last_Updated_By","Last_Updated_On","Created_By","Created_On","Gross_Weight_Case_kg","Case_Height_mm",]
df = pd.DataFrame(columns=col_names)
for file in list(source_bucket.list_blobs()):
file_path="gs://{}/{}".format(file.bucket.name, file.name)
df = df.append(pd.read_csv(file_path, header=None, skiprows=28, names=col_names, encoding='Latin_1'))
ddf0 = dd.from_pandas(df,npartitions=1, sort=True)
ddf0.to_csv(destination_path) # Key Error happens here
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 43, in stage1_1ph_prod_master ddf0.to_csv(destination_path) File "/env/local/lib/python3.7/site-packages/dask/dataframe/core.py", line 1299, in to_csv return to_csv(self, filename, **kwargs) File "/env/local/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 741, in to_csv **(storage_options or {}) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 302, in open_files urlpath, mode, num=num, name_function=name_function, storage_options=kwargs File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 425, in get_fs_token_paths fs, fs_token = get_fs(protocol, options) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 571, in get_fs cls = _filesystems[protocol] KeyError: 'gs'
gcsfs and dask has recently changed to use the fsspec package. The former has been released, but the latter is in master only. So gcsfs is no longer registering itself with the filesystems in dask, because fsspec already knows about it, but the version of dask you are using does not yet know about fsspec.
In short, please downgrade gcsfs until we have a chance to release dask, or use dask from master.

PySpark: ModuleNotFoundError: No module named 'app'

I am saving a dataframe to a CSV file in PySpark using below statement:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')
But i am getting below error
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'
i am using PySpark version 2.3.0
I am getting this error while trying to write to a file.
import json, jsonschema
from pyspark.sql import functions
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType
from datetime import datetime
import os
feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)
df_all = feb.union(apr)
df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])
create_emi_amount_udf = udf(create_emi_amount, FloatType())
df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))
df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')
The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.
So, somewhere in your method create_emi_amount you use or import the app module.
A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....
Thanks to ggeop! He helped me out. ggeop has explained the problem. But the solution may not be correct if the 'app' is his own package.
My solution is to add the file in sparkcontext:
sc = SparkContext()
sc.addPyFile("app.zip")
But you have to zip app package first, and you have to make sure the zipped packaged get app directory.
i.e. if your app is at:/home/workplace/app
then you have to do the zip under workplace, which will zip all directories under workplace including app.
The other way is to send the file in spark-submit, as below:
--py-files app.zip
--py-files myapp.py

"NameError: name 'datetime' is not defined" with datetime imported

I know there are a lot of datetime not defined posts but they all seem to forget the obvious import of datetime. I can't figure out why I'm getting this error. When I do each step in iPython it works well, but the method dosen't
import requests
import datetime
def daily_price_historical(symbol, comparison_symbol, limit=1, aggregate=1, exchange='', allData='true'):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}&allData={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate, allData)
if exchange:
url += '&e={}'.format(exchange)
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
datetime.datetime.fromtimestamp()
return df
This code produces this error:
Traceback (most recent call last):
File "C:\Users\20115619\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-29-4f015e05113f>", line 1, in <module>
rv.get_prices(30, 'ETH')
File "C:\Users\20115619\Desktop\projects\testDash\Revas.py", line 161, in get_prices
for symbol in symbols:
File "C:\Users\20115619\Desktop\projects\testDash\Revas.py", line 50, in daily_price_historical
df = pd.DataFrame(data)
File "C:\Users\20115619\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 4372, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'time'
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
I think that line is the problem.
Your Dataframe df at the end of the line doesn't have the attribute .time
For what it's worth I'm on Python 3.6.0 and this runs perfectly for me:
import requests
import datetime
import pandas as pd
def daily_price_historical(symbol, comparison_symbol, limit=1, aggregate=1, exchange='', allData='true'):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}&allData={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate, allData)
if exchange:
url += '&e={}'.format(exchange)
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
#I don't have the following function, but it's not needed to run this
#datetime.datetime.fromtimestamp()
return df
df = daily_price_historical('BTC', 'ETH')
print(df)
Note, I commented out the line that calls an external function that I do not have. Perhaps you have a global variable causing a problem?
Update as per the comments:
I'd use join instead to make the URL:
url = "".join(["https://min-api.cryptocompare.com/data/histoday?fsym=", str(symbol.upper()), "&tsym=", str(comparison_symbol.upper()), "&limit=", str(limit), "&aggregate=", str(aggregate), "&allData=", str(allData)])

Pyspark: Using lambda function and .withColumn produces a none-type error I'm having trouble understanding

I have the following code below. Essentially what I'm trying to do is to generate some new columns from the values in existing ones. After I do that, I save the dataframe with the new columns as a table in the cluster. Sorry I'm new to pyspark still.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DecimalType
import numpy as np
import math
df = sqlContext.sql('select * from db.mytable')
angle_av = udf(lambda (x, y): -10 if x == 0 else math.atan2(y/x)*180/np.pi, DecimalType(20,10))
df = df.withColumn('a_v_angle', angle_av(array('a_v_real', 'a_v_imag')))
df.createOrReplaceTempView('temp')
sqlContext.sql('create table new_table as select * from temp')
These operations actually don't produce any errors. I then attempt to store the df as a table and get the following error (i'm guessing since this is when the operations are actually executed):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 103, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
return lambda *a: f(*a)
File "<stdin>", line 14, in <lambda>
TypeError: unsupported operand type(s) for /: 'NoneType' and 'NoneType'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
This happens because input values are null / None. function should check its input and proceed accordingly.
f x == 0 or x is None
or just
if not x

Resources