I should begin by saying that this is not running in Spark.
What I am attempting to do is
stream n records from a parquet file in S3
process
stream back to a different file in S3
...but am only inquiring about the first step.
Have tried various things like:
from pyarrow import fs
from pyarrow.parquet import ParquetFile
s3 = fs.S3FileSystem(access_key=aws_key, secret_key=aws_secret)
with s3.open_input_stream(filepath) as f:
print(type(f)) # pyarrow.lib.NativeFile
parquet_file = ParquetFile(f)
for i in parquet_file.iter_batches(): # .read_row_groups() would be better
# process
...but getting OSError: only valid on seekable files , and not sure how to get around it.
Apologies if this is a duplicate. I searched but didn't find quite the fit I was looking for.
Try using open_input_file which 'Open an input file for random access reading.' instead of open_input_stream which 'Open an input stream for sequential reading.'
For context, in a parquet file the metadata is at the end so you need to be able to go back and forth in the file.
Related
I have archive with zip files that I would like to open 'through' Spark in streaming and write in streaming the unzip files in other directory that kip the name of the zip file(one by one).
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
Is there an easy way to read and write the above code in streaming ? Thank you for your help.
As far as I know, Spark can't read archives out of the box.
A ZIP file is both archiving and compressing data. If you can, use a program like gzip to compress the data but keep each file separate, so don't archive multiple files into a single one.
If the archive is a given, and can't be changed. You can consider reading it with sparkContext.binaryFiles(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) This would allow you to have the zipped file in a byte array in spark, so you can write a mapper function which can unzip and return the content of the file. You can then flatten that result to get an RDD of the files' contents.
I hope this question is as straightforward as I think it is.
Here is some background:
I'm helping out on python backend that is getting messy data as a csv. In its current state it just reroutes the url given by the API and triggers a download on the client computer. I wrote a utility using Pandas and xlsxwriter that cleans up this data, separates into multiple tabs and makes some graphs then writes them to a .xlsx file. Basically like this:
import pandas as pd
writer = pd.ExcelWriter(output_file_name, engine = 'xlsxwriter')
#Do a bunch of stuff and save each tab to writer
writer.save() #Writes the file
This .xlsx file would be created locally and there would need to be additional backend stuff that uploads it and cleans up the local file.
Seeing as the file is created all at once using the .save() method at the end, I was thinking its probably possible to trigger an upload directly without creating the local file at all, but I'm not seeing anything in xlsxwriter documentation about it. Is there any way to avoid saving a local file within or outside of xlsxwriter?
Assuming df is your dataframe variable:
import io
buffer = io.BytesIO()
writer = pd.ExcelWriter(buffer, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
data = buffer.getvalue()
data contains the binary data of excel file. For instance, you can use requests module to upload file somewhere.
I have a variety of very large (~4GB each) csv files that contain different formats. These come from data recorders from over 10 different manufacturers. I am attempting to consolidate all of these into BigQuery. In order to load these up on a daily basis I want to first load these files into Cloud Storage, determine the schema, and then load into BigQuery. Due to the fact that some of the files have additional header information (from 2 - ~30 lines) I have produced my own functions to determine the most likely header row and the schema from a sample of each file (~100 lines), which I can then use in the job_config when loading the files to BQ.
This works fine when I am working with files from local storage direct to BQ as I can use a context manager and then Python's csv module, specifically the Sniffer and reader objects. However, there does not seem to be an equivalent method of using a context manager direct from Storage. I do not want to bypass Cloud Storage in case any of these files are interrupted when loading into BQ.
What I can get to work:
# initialise variables
with open(csv_file, newline = '', encoding=encoding) as datafile:
dialect = csv.Sniffer().sniff(datafile.read(chunk_size))
reader = csv.reader(datafile, dialect)
sample_rows = []
row_num = 0
for row in reader:
sample_rows.append(row)
row_num+=1
if (row_num >100):
break
sample_rows
# Carry out schema and header investigation...
With Google Cloud Storage I have attempted to use download_as_string and download_to_file, which provide binary object representations of the data, but then I cannot get the csv module to work with any of the data. I have attempted to use .decode('utf-8') and it returns a looong string with \r\n's. I then used splitlines() to get a list of the data but still the csv functions keep giving a dialect and reader that splits the data into single characters as each entry.
Has anyone managed to get a work around to use the csv module with files stored in Cloud Storage without downloading the whole file?
After having a look at the csv source code on GitHub, I have managed to use the io module and csv module in Python to solve this problem. The io.BytesIO and TextIOWrapper were the two key functions to use. Probably not a common use case but thought I would post the answer here to save some time for anyone that needs it.
# Set up storage client and create a blob object from csv file that you are trying to read from GCS.
content = blob.download_as_string(start = 0, end = 10240) # Read a chunk of bytes that will include all header data and the recorded data itself.
bytes_buffer = io.BytesIO(content)
wrapped_text = io.TextIOWrapper(bytes_buffer, encoding = encoding, newline = newline)
dialect = csv.Sniffer().sniff(wrapped_text.read())
wrapped_text.seek(0)
reader = csv.reader(wrapped_text, dialect)
# Do what you will with the reader object
I have a dataframe where each row contains a prefix that points to a location in S3. I want to use flatMap() to iterate over each row, list the S3 objects in each prefix and return a new dataframe that contains a row per file that was listed in S3.
I've got this code:
import boto3
s3 = boto3.resource('s3')
def flatmap_list_s3_files(row):
bucket = s3.Bucket(row.bucket)
s3_files = []
for obj in bucket.objects.filter(Prefix=row.prefix):
s3_files.append(obj.key)
rows = []
for f in s3_files:
row_dict = row.asDict()
row_dict['s3_obj'] = f
rows.append(Row(**row_dict))
return rows
df = <code that loads the dataframe>
df.rdd.flatMap(lambda x: flatmap_list_s3_files(x))).toDF()
The only problem is that the s3 object isn't pickleable I guess? So I'm getting this error and I'm not sure what to try next:
PicklingError: Cannot pickle files that are not opened for reading
I'm a spark noob so I'm hoping there's some other API or some way to parallelize the listing of files in S3 and join that together with the original dataframe. To be clear, I'm not trying to READ any of the data in the S3 files themselves, I'm building a table that is essentially a metadata catalogue of all the files in S3. Any tips would be greatly appreciated.
you can't send an s3 client around your spark cluster; you need to share all the information needed to create one and instantiate it at the far end. I don't know about .py but in the java APIs you'd just pass the path around as a string and then convert that to a Path object, call Path.getFileSystem() and work on there. The Spark workers will cache the Filesystem instances for fast reuse
I have 100s of parquet files in S3, I want to check whether all the parquet files are created properly or not. Basically the downstream system should able to read these parquet files without any issue. Before downstream system read these files, I want my python scripts to read the sample, 10 records for each parquet files.
I using the below syntax to read the parquet file:
import pandas as pd
from boto3 import client
conn = client('s3')
buffer = io.BytesIO()
s3 = boto3.resource('s3')
result = s3.get_object(Bucket="my bucket", Key="my file location")
text = result["Body"].read().decode()
Need your input to read sample records, not all the records from parquet file. Thank you.