Is there a way to write every row of my spark dataframe as a new item in a dynamoDB table ? (in pySpark)
I used this code with boto3 library, but I wonder if there's another way, avoiding the pandas and the for loop steps :
sparkDF_dict = sparkDF.toPandas().to_dict('records')
for item in sparkDF_dict :
table.put_item(Item = item)
DynamoDB offers a BatchWriteItem API. It is available in boto3, so you could call it after creating slices of the sparkDF_dict 25 elements long. Note, the BatchWriteItem API only supports writing 25 items at a time, and not all writes may succeed at first (as they may get throttled on the service side and come back to you in the UnprocessedItems part of the response). Your application will need to look at UnprocessedItems in the response and retry as needed.
Related
Right now I'm copying files on Google Cloud Storage to Bigquery using the following line in node.js:
const bigquery = new BigQuery();
bigquery.dataset(xx).table(xx).load(storage.bucket(bucketName).file(fileName));
But now I'd like to add a new timestamp column to this file. So how can I do this?
So two questions I could think of:
First read this file into some data structure like array:
array = FunctionToReadFileNameToArray(FileName);
Do we have such a function? Suppose we have, then it's quite easy to manipulate upon the array to add timestamp column.
Second, load the new array data into bigquery. But I only find one way to insert streaming data:
bigquery.dataset(xx).table(xx).insert(rows);
And here rows is different data structure like dictionary/map but not array. So how can we load array into bigquery?
Thanks
I'm going to assume you have a file (Object) of structured records (JSON, XML, CSV). The first task would appear to be opening that GCS object for reading. You would then read one record at a time. You would then augment that record with your desired extra column (timestamp) and then invoke the insert() API. This API can take a single object to be inserted or an array of objects.
However ... if this is a one-time event or can be performed in batch ... you may find it cheaper to read the GCS object and write a new GCS object containing your desired data and THEN load the data into BQ as a unit. Looking at the pricing for BQ, we seem to find that streaming inserts are charged at $0.01 per 200MB in addition to the storage costs which would be bypassed for a GCS object load as a unit. My own thinking is that doing extra work to save pennies is a poor use of time/money but if you are processing TB of data over months, it may add up.
I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.
By using Boto3's batch insert, maximum how many records we can insert into Dynamodb's table. Suppose i'm reading my input json from S3 bucket which is of 6gb in size.
And it cause any performance issues while inserting as a batch. Any sample is helpful. I just started looking into this, based on my findings i'll update here.
Thanks in advance.
You can use the Boto3 batch_writer() function to do this. The batch writer handles chunking up the items into batches, retrying, etc. You create the batch writer as a context manager, add all of your items within the context, and the batch writer sends your batch requests when it exits the context.
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('table-name')
with table.batch_writer() as writer:
for item in table_data:
writer.put_item(Item=item)
There's a complete working code example on GitHub here: https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/dynamodb/batching/dynamo_batching.py.
You can find information like this in the service documentation for BatchWriteItem:
A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB.
There are no performance issues, aside from consuming the write capacity units.
Right now the way I am doing my workflow is like this:
get a list of rows from a postgres database (let's say 10.000)
for each row I need to call an API endpoint and get a value, so 10.000 values returned from API
for each row that I have a value returned I need to update a field in the database. 10.000 rows updated
Right now I am doing a update after each API fetch but as you can imagine this isn't the most optimized way.
What other option do I have?
Probably bottleneck in that code is fetching the data from API. This trick only allows to send many small queries to DB faster without having to wait roundtrip time between each update.
To do multiple updates in single query you could use common table expressions and pack multiple small queries to single CTE query:
https://runkit.com/embed/uyx5f6vumxfy
knex
.with('firstUpdate', knex.raw('?', [knex('table').update({ colName: 'foo' }).where('id', 1)]))
.with('secondUpdate', knex.raw('?', [knex('table').update({ colName: 'bar' }).where('id', 2)]))
.select(1)
knex.raw trick there is a workaround, since .with(string, function) implementation has a bug.
Let me first inform all of you that I am very new to Spark.
I need to process a huge number of records in a table and when it is grouped by email it is around 1 million. I need to perform multiple logical calculations based on the data set against individual email and update the database based on the logical calculation
Roughly my code structure is like
//Initial Data Load ...
import sparkSession.implicits._
var tableData = sparkSession.read.jdbc(<JDBC_URL>, <TABLE NAME>, connectionProperties).select("email").where(<CUSTOM CONDITION>)
//Data Frame with Records with grouping on email count greater than one
var recordsGroupedBy =tableData.groupBy("email").count().withColumnRenamed("count", "recordcount").filter("recordcount > 1 ").toDF()
//Now comes the processing after grouping against email using processDataAgainstEmail() method
recordsGroupedBy.collect().foreach(x=>processDataAgainstEmail(x.getAs("email"),sparkSession))
Here I see foreach is not parallelly executed. I need to invoke the method processDataAgainstEmail(,) in parallel.
But if I try to parallelize by doing
Hi I can get a list by invoking
val emailList =dataFrameWithGroupedByMultipleRecords.select("email").rdd.map(r => r(0).asInstanceOf[String]).collect().toList
var rdd = sc.parallelize(emailList )
rdd.foreach(x => processDataAgainstEmail(x.getAs("email"),sparkSession))
This is not supported as I can not pass sparkSession when using parallelize.
Can anybody help me with this as in processDataAgainstEmail(,) multiple operations would be performed related to database insert and update and also spark dataframe and spark SQL operations needs to be performed?
To summerize I need to invoke parallelly processDataAgainstEmail(,) with sparksession
In case it is not all possible to pass spark sessions, the method won't be able to perform anything on the database. I am not sure what would be the alternate way as parallelism on email is a must for my scenario.
The forEach is the method the list that operates on each element of the list sequentially, so you are acting on it one at a time, and passing that to processDataAgainstEmail method.
Once you have gotten the resultant list, you then invoke the sc.parallelize on to parallelize the creation of the dataframe from the list of records you created/manipulated in the previous step. The parallelization, as I can see in the pySpark, is the property of creating of the dataframe, not acting the result of any operation.