Writing a dictionary of Spark data frames to S3 bucket - python-3.x

Suppose we have a dictionary of PySpark dataframes. Is there a way to write this dictionary to an S3 bucket? The purpose of this is to read these PySpark data frames and then convert them into pandas data frames. Below is some code and the errors I get:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df1 = rdd.toDF()
df1.printSchema()
columns = ["language","users_count"]
data = [("C", "2000"), ("Java", "10000"), ("Lisp", "300")]
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df2 = rdd.toDF()
df2.printSchema()
spark_dict = {df1: '1', df2: '2'}
import boto3
import pickle
s3_resource = boto3.resource('s3')
bucket='test'
key='pickle_list.pkl'
pickle_byte_obj = pickle.dumps(spark_dict)
try:
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)
except:
print("Error in writing to S3 bucket")
with this error:
An error was encountered:
can't pickle _thread.RLock objects
Traceback (most recent call last):
TypeError: can't pickle _thread.RLock objects
Also tried dumping the dictionary of PySpark data frames to a json file:
import json
flatten_dfs_json = json.dumps(spark_dict)
and got this error:
An error was encountered:
Object of type DataFrame is not JSON serializable
Traceback (most recent call last):
File "/usr/lib64/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib64/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib64/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib64/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type DataFrame is not JSON serializable

Suppose we have a dictionary of PySpark dataframes. Is there a way to write this dictionary to an S3 bucket?
Yes (you might need to configure access key and secret key)
df.write.format('json').save('s3a://bucket-name/path')
The purpose of this is to read these PySpark data frames and then convert them into pandas data frames.
My 2 cents: This sounds wrong to me, you don't have to convert data to Pandas, that defeats the purpose of using Spark at the first place.

Related

How to take a subset of parquet files to create a deltatable using deltalake_rs python library

I am using deltalake 0.4.5 Python library to read .parquet files into a deltatable and then convert into a pandas dataframe, following the instructions here: https://pypi.org/project/deltalake/.
Here's the Python code to do this:
from deltalake import DeltaTable
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
files = dt.files() # OK, returns the list of parquet files with full s3 path
# ['s3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00001-8765abc67.parquet',
# 's3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00002-7643adc87.parquet',
# ........]
total_file_count = len(files0) # OK, returns 115530
pt = dt.to_pyarrow_table() # hangs
df = dt.to_pyarrow_table().to_pandas() # hangs
I believe it hangs because of the number of files is high 115K+.
So for my PoC, I wanted to read files only for a day or hour. So, I tried to set the table_path variable up to the hour, but it gives Not a Delta table error as, showing below:
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.7/site-packages/deltalake/table.py", line 19, in __init__
self._table = RawDeltaTable(table_path, version=version)
deltalake.PyDeltaTableError: Not a Delta table
How can I achieve this?
If deltalake Python library can't be used to achieve this, what other tools/libraries are there I should try?

Python Dask module error .. AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'

I am trying to learn how to use dask module in order to overcome a memory problem on a script. Whilst I was reading a csv and creating a dask dataframe from it, I got the following error:
File "C:\Users\username\AppData\Local\Programs\Python\Python39\lib\site-packages\fsspec\implementations\local.py", line 147, in _strip_protocol
if path.startswith("file://"):
AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'
here's my code:
import dask.array as da
import dask.dataframe as ddf
'''Read .csv straight to list'''
with open (wd+inputfilename+extension_csv, 'r') as f:
reader = csv.reader(f)
df = ddf.read_csv(f) #data here is a dask pandas dataframe
Any help on this?
thanks.
So the dask.dataframe read_csv function takes a filename, not a file handle.
Your code should be:
import dask.array as da
import dask.dataframe as ddf
df = ddf.read_csv(wd+inputfilename+extension_csv)
I don't believe you need the reader = csv.reader(f) line (which isn't imported anyway)?

Getting a Key Error 'gs' when trying to write a dask dataframe to csv on google cloud storage

I have the following code below where I'm 1) importing a csv file from a gcs bucket 2) doing some etl on it and 3) converting it to dask df before writing the dask df to_csv. All goes to plan until the very end when I get a KeyError: 'gs' upon writing to csv back in a gcs bucket.
here is my code - can anyone help me understand where the key error comes from?
def stage1_1ph_prod_master(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
source_bucket = 'sourcebucket'
destination_path = 'gs://destination_bucket/ddf-*ph_master_static.csv'
storage_client = storage.Client()
source_bucket = storage_client.bucket(source_bucket)
# load in the col names
col_names = ["PPG_Code","PPG_Code_Name","SAP_Product_Name","CP_Sku_Code","UPC_Unit","UPC_Case","Category","Product_Category","Sub_Category","Brand","Sub_Brand","Variant","Size","Gender","Last_Updated_By","Last_Updated_On","Created_By","Created_On","Gross_Weight_Case_kg","Case_Height_mm",]
df = pd.DataFrame(columns=col_names)
for file in list(source_bucket.list_blobs()):
file_path="gs://{}/{}".format(file.bucket.name, file.name)
df = df.append(pd.read_csv(file_path, header=None, skiprows=28, names=col_names, encoding='Latin_1'))
ddf0 = dd.from_pandas(df,npartitions=1, sort=True)
ddf0.to_csv(destination_path) # Key Error happens here
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 43, in stage1_1ph_prod_master ddf0.to_csv(destination_path) File "/env/local/lib/python3.7/site-packages/dask/dataframe/core.py", line 1299, in to_csv return to_csv(self, filename, **kwargs) File "/env/local/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 741, in to_csv **(storage_options or {}) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 302, in open_files urlpath, mode, num=num, name_function=name_function, storage_options=kwargs File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 425, in get_fs_token_paths fs, fs_token = get_fs(protocol, options) File "/env/local/lib/python3.7/site-packages/dask/bytes/core.py", line 571, in get_fs cls = _filesystems[protocol] KeyError: 'gs'
gcsfs and dask has recently changed to use the fsspec package. The former has been released, but the latter is in master only. So gcsfs is no longer registering itself with the filesystems in dask, because fsspec already knows about it, but the version of dask you are using does not yet know about fsspec.
In short, please downgrade gcsfs until we have a chance to release dask, or use dask from master.

Creating a DataFrame from Row results in 'infer schema issue'

When I began learning PySpark, I used a list to create a dataframe. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row instead. However, when I try to create one using Row, I get infer schema issue. This is my code:
>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
So I created a schema
>>> schema = StructType([StructField('name', StringType()),
... StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)
but then, this error gets thrown.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
data = list(data)
File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>
The createDataFrame function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
Out:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
+-------+---+
| name|age|
+-------+---+
|Severin| 33|
| John| 48|
+-------+---+
In the pyspark docs (link) you can find more details about the createDataFrame function.
you need to create a list of type Row and pass that list with schema to your createDataFrame() method. sample example
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
department1 = Row(id='AAAAAAAAAAAAAA', type='XXXXX',cost='2')
department2 = Row(id='AAAAAAAAAAAAAA', type='YYYYY',cost='32')
department3 = Row(id='BBBBBBBBBBBBBB', type='XXXXX',cost='42')
department4 = Row(id='BBBBBBBBBBBBBB', type='YYYYY',cost='142')
department5 = Row(id='BBBBBBBBBBBBBB', type='ZZZZZ',cost='149')
department6 = Row(id='CCCCCCCCCCCCCC', type='XXXXX',cost='15')
department7 = Row(id='CCCCCCCCCCCCCC', type='YYYYY',cost='23')
department8 = Row(id='CCCCCCCCCCCCCC', type='ZZZZZ',cost='10')
schema = StructType([StructField('id', StringType()), StructField('type',StringType()),StructField('cost', StringType())])
rows = [department1,department2,department3,department4,department5,department6,department7,department8 ]
df = spark.createDataFrame(rows, schema)
If you're just making a pandas dataframe, you can convert each Row to a dict and then rely on pandas' type inference, if that's good enough for your needs. This worked for me:
import pandas as pd
sample = output.head(5) #this returns a list of Row objects
df = pd.DataFrame([x.asDict() for x in sample])
I have had a similar problem recently and the answers here helped me untderstand the problem better.
my code:
row = Row(name="Alice", age=11)
spark.createDataFrame(row).show()
resulted in a very similar error:
An error was encountered:
Can not infer schema for type: <class 'int'>
Traceback ...
the cause of the problem:
createDataFrame expects an array of rows. So if you only have one row and don't want to invent more, simply make it an array: [row]
row = Row(name="Alice", age=11)
spark.createDataFrame([row]).show()

module 'pyspark_csv' has no attribute 'csvToDataframe'

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error saying "module 'pyspark_csv' has no attribute 'csvToDataframe".
here is my code:
import findspark
findspark.init()
findspark.find()
import pyspark
sc=pyspark.SparkContext(appName="myAppName")
sqlCtx = pyspark.SQLContext
#csv to dataframe
sc.addPyFile('/usr/spark-1.5.0/python/pyspark_csv.py')
sc.addPyFile('https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py')
import pyspark_csv as pycsv
#skipping the header
def skip_header(idx, iterator):
if(idx == 0):
next(iterator)
return iterator
#loading the dataset
data=sc.textFile('gdeltdata/20160427.CSV')
data_header = data.first()
data_body = data.mapPartitionsWithIndex(skip_header)
data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError Traceback (most recent call last)
<ipython-input-10-8e47cd9759e6> in <module>()
----> 1 data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError: module 'pyspark_csv' has no attribute 'csvToDataframe'
As mentioned in https://github.com/seahboonsiew/pyspark-csv
Please try using the following command:
csvToDataFrame
with Frame instead of frame

Resources