I tried to read parquet from s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
},
engine='pyarrow',
)
It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()
My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?
Thanks,
Currently if there is no metadata file and if you're using the PyArrow backend then Dask is probably sending a request to read metadata from each of the individual partitions on S3. This is quite slow.
Dask's dataframe parquet reader is being rewritten now to help address this. You might consider using fastparquet until then and the ignore_divisions keyword (or something like that), or checking back in a month or two.
Related
I'm trying to work on a dataset with 510,000 rows and 636 columns. I loaded it into a dataframe using the dask dataframe method, but the entries can't be displayed. When i try to get the shape, it results in delays. Is there a way for me to analyze the whole dataset without using big data technologies like Pyspark?
from dask import dataframe
import requests
import zipfile
import os
import pandas as pd
if os.path.exists('pisa2012.zip') == False:
r = requests.get('https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip', allow_redirects=True)
open('pisa2012.zip', 'wb').write(r.content)
if os.path.exists('pisa2012.csv') == False:
with zipfile.ZipFile('pisa2012.zip', 'r') as zip_ref:
zip_ref.extractall('./')
df_pisa = dataframe.read_csv('pisa2012.csv')
df_pisa.shape #Output:(Delayed('int-e9d8366d-1b9e-4f8e-a83a-1d4cac510621'), 636)
Firstly, spark, dask and vaex are all "big data" technologies.
it results in delays
If you read the documentation, you will see that dask is lazy and only performs operations on demand, you have to want to. The reason is, that just getting the shape requires reading all the data, but the data will not be held in memory - that is the whole point and the feature that lets you work with bigger-than-memory data (otherwise, just use pandas).
This works:
df_pisa.shape.compute()
Bute, better, figure out what you actually want to do with the data; I assume you are not just after the shape. You can put multiple operations/delayed objects into a dask.compute() to do them at once and not have to repeat expensive tasks like reading/parsing the file.
I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)
I'm trying to load more than 20 million records to my Dynamodb table using below code from EMR 5 node cluster. But it is taking more hours and hours time to load completely. I have much more huge data to load, but i want to load it in span of few minutes. How to achieve this?
Below is my code. I just changed original column names and I have 20 columns to insert. The problem here is with slow loading.
import boto3
import json
import decimal
dynamodb = boto3.resource('dynamodb','us-west')
table = dynamodb.Table('EMP')
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='mybucket', Key='emp-rec.json')
records = json.loads(obj['Body'].read().decode('utf-8'), parse_float = decimal.Decimal)
with table.batch_writer() as batch:
for rec in records:
batch.put_item(Item=rec)
First, you should use Amazon CloudWatch to check whether you are hitting limits for your configure Write Capacity Units on the table. If so, you can increase the capacity, at least for the duration of the load.
Second, the code is creating batches of one record, which wouldn't be very efficient. The batch_writer() can be used to process multiple records, such as in this sample code from the batch_writer() documentation:
with table.batch_writer() as batch:
for _ in xrange(1000000):
batch.put_item(Item={'HashKey': '...',
'Otherstuff': '...'})
Notice how the for loop is inside the batch_writer()? That way, multiple records are stored within one batch. Your code sample, however, has the for outside of the batch_writer(), which results in a batch size of one.
I am trying to figure out what is the fastest way to write a LARGE pandas DataFrame to S3 filesystem. I am currently trying two ways:
1) Through gzip compression (BytesIO) and boto3
gz_buffer = BytesIO()
with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, s3_path + name_zip)
s3_object.put(Body=gz_buffer.getvalue())
which for a dataframe of 7M rows takes around 420seconds to write to S3.
2) Through writing to csv file without compression (StringIO buffer)
csv_buffer = StringIO()
data.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, s3_path + name_csv).put(Body=csv_buffer.getvalue())
which takes around 371 seconds...
The question is:
Is there any other faster way to write a pandas dataframe to S3?
Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.
import boto3
s3 = boto3.client('s3')
csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')
# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)
The docs for s3.upload_fileobj are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj
You can try using s3fs with pandas compression to upload to S3. StringIO or BytesIO are memory hogging.
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
df.to_csv(f, compression='gzip')
It really depends on the content, but that is not related to boto3. Try first to dump your DataFrame locally and see what's fastest and what size you get.
Here are some suggestions that we have found to be fast, for cases between a few MB to over 2GB (although, for more than 2GB, you really want parquet and possibly split it into a parquet dataset):
Lots of mixed text/numerical data (SQL-oriented content): use df.to_parquet(file).
Mostly numerical data (e.g. if your columns df.dtypes indicate a happy numpy array of a single type, not Object): you can try df_to_hdf(file, 'key').
One bit of advice: try to split your df in some shards that are meaningful to you (e.g., by time for timeseries). Especially if you have a lot of updates to a single shard (e.g. the last one in a time series), it will make your download/upload much faster.
What we have found is that, HDF5 are bulkier (uncompressed), but they save/load fantastically fast from/into memory. Parquets are by default snappy-compressed, so they tend to be smaller (depending on the entropy of your data, of course; penalty for you if you save totally random numbers).
For boto3 client, both multipart_chunksize and multipart_threshold are 8MB by default, which is often a fine choice. You can check via:
tc = boto3.s3.transfer.TransferConfig()
print(f'chunksize: {tc.multipart_chunksize}, threshold: {tc.multipart_threshold}')
Also, the default is to use 10 threads for each upload (which does nothing unless the size of your object is larger than the threshold above).
Another question is how to upload many files efficiently. That is not handled by any definition in TransferConfig. But I digress, the original question is about a single object.
First, check that you are writing to a bucket that is in the same region as your notebook.
Second, you can try the option to upload using multi-part which takes files that are larger than a few GB and upload them in parallel:
from boto3.s3.transfer import TransferConfig
def s3_upload_file(args):
s3 = boto3.resource('s3')
GB = 1024 ** 3
config = TransferConfig(multipart_threshold=5 * GB)
s3.meta.client.upload_file(args.path, args.bucket, os.path.basename(args.path),Config=config)
I have a large data chunk(about 10M rows) in Amazon-Redishift, that I was to obtain in a Pandas data-frame and store the data in a pickle file. However, it shows "Out of Memory" exception for obvious reasons, because of the size of data. I tried a lot other things like sqlalchemy, however, not able to crack the Problem. Can anyone suggest a better way or code to get through it.
My current (simple) code snippet goes as below:
import psycopg2
import pandas as pd
import numpy as np
cnxn = psycopg2.connect(dbname=<mydatabase>, host='my_redshift_Server_Name', port='5439', user=<username>, password=<pwd>)
sql = "Select * from mydatabase.mytable"
df = pd.read_sql(sql, cnxn, columns=1)
pd.to_pickle(df, 'Base_Data.pkl')
print(df.head(50))
cnxn.close()
print(df.head(50))
1) find the row count in the table and the maximum chunk of the table that you can pull by adding order by [column] limit [number] offset 0 and increasing the limit number reasonably
2) add a loop that will produce the sql with the limit that you found and increasing offset, i.e. if you can pull 10k rows your statements would be:
... limit 10000 offset 0;
... limit 10000 offset 10000;
... limit 10000 offset 20000;
until you reach the table row count
3) in the same loop, append every new obtained set of rows to your dataframe.
p.s. this will work assuming you won't run into any issues with memory/disk on client end which I can't guarantee since you have such issue on a cluster which is likely higher grade hardware. To avoid the problem you would just write a new file on every iteration instead of appending.
Also, the whole approach is probably not right. You'd better unload the table to S3 which is pretty quick because the data is copied from every node independently, and then do whatever needed against the flat file on S3 to transform it to the final format you need.
If you're using pickle to just transfer the data somewhere else, I'd repeat the suggestion from AlexYes's answer - just use S3.
But if you want to be able to work with the data locally, you have to limit yourself to the algorithms that do not require all data to work.
In this case, I would suggest something like HDF5 or Parquet for data storage and Dask for data processing since it doesn't require all the data to reside in memory - it can work in chunks and in parallel. You can migrate your data from Redshift using this code:
from dask import dataframe as dd
d = dd.read_sql_table(my_table, my_db_url, index_col=my_table_index_col)
d.to_hdf('Base_Data.hd5', key='data')