I'm trying to load more than 20 million records to my Dynamodb table using below code from EMR 5 node cluster. But it is taking more hours and hours time to load completely. I have much more huge data to load, but i want to load it in span of few minutes. How to achieve this?
Below is my code. I just changed original column names and I have 20 columns to insert. The problem here is with slow loading.
import boto3
import json
import decimal
dynamodb = boto3.resource('dynamodb','us-west')
table = dynamodb.Table('EMP')
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='mybucket', Key='emp-rec.json')
records = json.loads(obj['Body'].read().decode('utf-8'), parse_float = decimal.Decimal)
with table.batch_writer() as batch:
for rec in records:
batch.put_item(Item=rec)
First, you should use Amazon CloudWatch to check whether you are hitting limits for your configure Write Capacity Units on the table. If so, you can increase the capacity, at least for the duration of the load.
Second, the code is creating batches of one record, which wouldn't be very efficient. The batch_writer() can be used to process multiple records, such as in this sample code from the batch_writer() documentation:
with table.batch_writer() as batch:
for _ in xrange(1000000):
batch.put_item(Item={'HashKey': '...',
'Otherstuff': '...'})
Notice how the for loop is inside the batch_writer()? That way, multiple records are stored within one batch. Your code sample, however, has the for outside of the batch_writer(), which results in a batch size of one.
Related
Hi everyone I am struggling with this problem: I am trying to insert in an azure db a python list made of approx 100k rows, using this code:
list_of_rows = [...]
self.azure_cursor.fast_executemany = True
self.azure_cursor.executemany('''INSERT INTO table_name VALUES(?,?,?,?,?,?,?,?,?,?,?)''',list_of_rows)
The problem is that it takes ages to do that (about 43 seconds for 100k rows, for an amount of less than 30 MB of data) and I don't know how to improve it, because I am already using fast_executemany and as seen from azure dashboard I don't reach max DTU granted by my subscription plan (S1-20 DTU).
I've also tried to see if an index would help, but there are no advantages (and trying to run the query in SSMS no index is recommended).
Finally, the problem is not about connection, since I am using 1Gb/s download/upload
Does someone know how to improve these performances?
UPDATE
Tried to use the code below as suggested in the page linked by by Shiraz Bhaiji:
Firstly, I create a pandas dataframe from my list of rows, then set up the engine and create the event listener and then I use df.to_sql
self.df = pd.DataFrame(data = list_of_rows , columns=['A','B','C'])
params='DRIVER=driver;SERVER=server;PORT=1433;DATABASE=databas;UID=username;PWD=password'
db_params = urllib.parse.quote_plus(params)
self.engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(db_params))
#event.listens_for(self.engine, "before_cursor_execute")
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df.to_sql('table_name', self.engine, index=False, if_exists="append", schema="dbo")
The code below takes the same time as pure executemany. I tried to remove PK (there are no other indexes on the table) and it made insert faster, now it takes 22 seconds, but is too much for 100k rows for a total amount of 30 MB of data
If you use the to_sql function instead, you can speed up the inserts.
See: https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990
I need to know how many write units a certain object will consume before inserting it to DynamoDB.
There are some docs from AWS explaining how to estimate write units, however, I would like to check if there is some built in function in boto3 before writing my own.
import boto3
some_object = { ... }
dynamo = boto3.resource('dynamodb')
# get kb size my object will consume
size = dynamo.get_size_of(some_object)
# or even how many write units
writers = dynamo.get_writers(some_object)
There is nothing inside boto3 for calculating the size of items you're going to write. We sometimes use the code from this blog post to check python data structures before writing them.
I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.
I have a large data chunk(about 10M rows) in Amazon-Redishift, that I was to obtain in a Pandas data-frame and store the data in a pickle file. However, it shows "Out of Memory" exception for obvious reasons, because of the size of data. I tried a lot other things like sqlalchemy, however, not able to crack the Problem. Can anyone suggest a better way or code to get through it.
My current (simple) code snippet goes as below:
import psycopg2
import pandas as pd
import numpy as np
cnxn = psycopg2.connect(dbname=<mydatabase>, host='my_redshift_Server_Name', port='5439', user=<username>, password=<pwd>)
sql = "Select * from mydatabase.mytable"
df = pd.read_sql(sql, cnxn, columns=1)
pd.to_pickle(df, 'Base_Data.pkl')
print(df.head(50))
cnxn.close()
print(df.head(50))
1) find the row count in the table and the maximum chunk of the table that you can pull by adding order by [column] limit [number] offset 0 and increasing the limit number reasonably
2) add a loop that will produce the sql with the limit that you found and increasing offset, i.e. if you can pull 10k rows your statements would be:
... limit 10000 offset 0;
... limit 10000 offset 10000;
... limit 10000 offset 20000;
until you reach the table row count
3) in the same loop, append every new obtained set of rows to your dataframe.
p.s. this will work assuming you won't run into any issues with memory/disk on client end which I can't guarantee since you have such issue on a cluster which is likely higher grade hardware. To avoid the problem you would just write a new file on every iteration instead of appending.
Also, the whole approach is probably not right. You'd better unload the table to S3 which is pretty quick because the data is copied from every node independently, and then do whatever needed against the flat file on S3 to transform it to the final format you need.
If you're using pickle to just transfer the data somewhere else, I'd repeat the suggestion from AlexYes's answer - just use S3.
But if you want to be able to work with the data locally, you have to limit yourself to the algorithms that do not require all data to work.
In this case, I would suggest something like HDF5 or Parquet for data storage and Dask for data processing since it doesn't require all the data to reside in memory - it can work in chunks and in parallel. You can migrate your data from Redshift using this code:
from dask import dataframe as dd
d = dd.read_sql_table(my_table, my_db_url, index_col=my_table_index_col)
d.to_hdf('Base_Data.hd5', key='data')
I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)