How to read sample records parquet file in S3? - python-3.x

I have 100s of parquet files in S3, I want to check whether all the parquet files are created properly or not. Basically the downstream system should able to read these parquet files without any issue. Before downstream system read these files, I want my python scripts to read the sample, 10 records for each parquet files.
I using the below syntax to read the parquet file:
import pandas as pd
from boto3 import client
conn = client('s3')
buffer = io.BytesIO()
s3 = boto3.resource('s3')
result = s3.get_object(Bucket="my bucket", Key="my file location")
text = result["Body"].read().decode()
Need your input to read sample records, not all the records from parquet file. Thank you.

Related

Streaming parquet files from S3 (Python)

I should begin by saying that this is not running in Spark.
What I am attempting to do is
stream n records from a parquet file in S3
process
stream back to a different file in S3
...but am only inquiring about the first step.
Have tried various things like:
from pyarrow import fs
from pyarrow.parquet import ParquetFile
s3 = fs.S3FileSystem(access_key=aws_key, secret_key=aws_secret)
with s3.open_input_stream(filepath) as f:
print(type(f)) # pyarrow.lib.NativeFile
parquet_file = ParquetFile(f)
for i in parquet_file.iter_batches(): # .read_row_groups() would be better
# process
...but getting OSError: only valid on seekable files , and not sure how to get around it.
Apologies if this is a duplicate. I searched but didn't find quite the fit I was looking for.
Try using open_input_file which 'Open an input file for random access reading.' instead of open_input_stream which 'Open an input stream for sequential reading.'
For context, in a parquet file the metadata is at the end so you need to be able to go back and forth in the file.

How to read in stream multiple .zip folder, unzip and write in stream each files contains by unzipp folder through Spark?

I have archive with zip files that I would like to open 'through' Spark in streaming and write in streaming the unzip files in other directory that kip the name of the zip file(one by one).
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
Is there an easy way to read and write the above code in streaming ? Thank you for your help.
As far as I know, Spark can't read archives out of the box.
A ZIP file is both archiving and compressing data. If you can, use a program like gzip to compress the data but keep each file separate, so don't archive multiple files into a single one.
If the archive is a given, and can't be changed. You can consider reading it with sparkContext.binaryFiles(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) This would allow you to have the zipped file in a byte array in spark, so you can write a mapper function which can unzip and return the content of the file. You can then flatten that result to get an RDD of the files' contents.

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

Lambda Function to convert csv to parquet from s3

I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.

How to list S3 objects in parallel in PySpark using flatMap()?

I have a dataframe where each row contains a prefix that points to a location in S3. I want to use flatMap() to iterate over each row, list the S3 objects in each prefix and return a new dataframe that contains a row per file that was listed in S3.
I've got this code:
import boto3
s3 = boto3.resource('s3')
def flatmap_list_s3_files(row):
bucket = s3.Bucket(row.bucket)
s3_files = []
for obj in bucket.objects.filter(Prefix=row.prefix):
s3_files.append(obj.key)
rows = []
for f in s3_files:
row_dict = row.asDict()
row_dict['s3_obj'] = f
rows.append(Row(**row_dict))
return rows
df = <code that loads the dataframe>
df.rdd.flatMap(lambda x: flatmap_list_s3_files(x))).toDF()
The only problem is that the s3 object isn't pickleable I guess? So I'm getting this error and I'm not sure what to try next:
PicklingError: Cannot pickle files that are not opened for reading
I'm a spark noob so I'm hoping there's some other API or some way to parallelize the listing of files in S3 and join that together with the original dataframe. To be clear, I'm not trying to READ any of the data in the S3 files themselves, I'm building a table that is essentially a metadata catalogue of all the files in S3. Any tips would be greatly appreciated.
you can't send an s3 client around your spark cluster; you need to share all the information needed to create one and instantiate it at the far end. I don't know about .py but in the java APIs you'd just pass the path around as a string and then convert that to a Path object, call Path.getFileSystem() and work on there. The Spark workers will cache the Filesystem instances for fast reuse

Resources