Insert items in DynamoDB using lambdas without losing data - python-3.x

I am new to AWS and I am trying to load data into a DynamoDB using lambda functions and python. The problem I have is the following: When I try to load a record into a table, the items that have the same Partition Key as the element I'm trying to insert are removed from the table. This is the code that I'm using (I got it from the AWS documentation):
import boto3
from pprint import pprint
def put_car(car_id, car_type, message, dynamodb=None):
if not dynamodb:
dynamodb = boto3.resource('dynamodb', region_name='eu-west-1')
table = dynamodb.Table('Cars')
response = table.put_item(
Item={
'car_type': car_type,
'car_id': car_id,
'message': message,
}
)
return response
def lambda_handler(event, context):
car_resp = put_car("1", "Cartype1",
"Car 1")
print("Put car succeeded:")
pprint(car_resp)
A possible solution would be to read all the records first and load them all again, including the record I wanted to insert, but this solution seems quite inefficient and I think there may be some easier way to do it.

Related

how to avoid duplication in BigQuery by streaming insert

I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)
BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.

Need better approach to load oracle blob data into Mongodb collection using Gridfs

Recently, I started working on new project where I need to transfer oracle table data into Mongodb collections.
Oracle table consists one BLOB datatype column.
I wanted to transfer oracle table blob data into Mongodb using GridFS and I even succeed, but I am unable to scale it up.
If I use same script for 10k or 50k records, Its taking very very long time.
Please suggest me, is there anywhere i can improve or is there better way to achieve my goal.
Thank you in advance.
Please find out sample code which I am using to load small amount of data
from pymongo import MongoClient
import cx_Oracle
from gridfs import GridFS
import pickle
import sys
client = MongoClient('localhost:27017/sample')
dbm = client.sample
db = <--oracle connection----->
cursor = db.cursor()
def get_notes_file_sys():
return GridFS(dbm,'notes')
def save_data_in_file(fs,note,file_name):
gridin = None
file_ids = {}
data_blob = pickle.dumps(note['file_content_blob'])
del note['file_content_blob']
gridin = fs.open_upload_stream(file_name, chunk_size_bytes=261120, metadata=note)
gridin.write(data_blob)
gridin.close()
file_ids['note_id'] = gridin._id
return file_ids
# ---------------------------Uploading files start---------------------------------------
fs = get_notes_file_sys()
query = ("""SELECT id, file_name, file_content_blob, author, created_at FROM notes fetch next 10 rows only""")
cursor.execute(query)
rows = cursor.fetchall()
col = [co[0] for co in cursor.description]
final_arr= []
for row in rows:
data = dict(zip(col,row))
file_name = data['file_name']
if data["file_content_blob"] is None:
data["file_content_blob"] = None
else:
# This below line is taking more time
data["file_content_blob"] = data["file_content_blob"].read()
note_id = save_data_in_file(fs,data,file_name)
data['note_id'] = note_id
final_arr.append(data)
dbm['notes'].bulk_insert(final_arr)
Two things comes to mind:
Don't move to Mongo. Just use Oracle's SODA document storage model: https://cx-oracle.readthedocs.io/en/latest/user_guide/soda.html Also take a look at Oracle's JSON DB service: https://blogs.oracle.com/jsondb/autonomous-json-database
Fetch BLOBs as Bytes, which is much faster than the method you are using https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes There is an example at https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py

Python dict datatype error while after reading message from AWS SQS and Put it into AWS DynamoDB

My use case is to take a JSON message from SQS body and insert data into DynamoDB
Using the lambda function in python.
the issue is I am able to read and print the JSON message from SQS queue into cloud watch log but when I try to insert the same JSON in dynamoDB it gives below Error
Invalid type for parameter Item, value: {'name': 2}, type: class 'str', valid types: class 'dict'
Below is the lambda code I am using and an error occurred at line number 12 where I am trying to insert using put_item
import json
import boto3
dynamodb = boto3.resource('dynamodb')
dynamoTable = dynamodb.Table('message')
def lambda_handler(event, context):
for record in event['Records']:
data1 = record["body"]
jsondata1 = json.loads(data1)
print(jsondata1)
dynamoTable.put_item(Item=jsondata1)
However, it is able to print the SQS JSON to cloud watch log as below
after so many R&D i am able to find the solution is by splitting the string by comma and then recreating an json which will create json of dict data type and not string
Below is the code for the same solution
import json
import boto3
import ast
dynamodb = boto3.resource('dynamodb')
dynamoTable = dynamodb.Table('message')
def lambda_handler(event, context):
for record in event['Records']:
data1 = record["body"]
jsondata1 = json.loads(data1)
mess1 = jsondata1["Message"]
id = jsondata1["MessageId"]
jsonmess = json.loads(mess1)
s = jsonmess.replace("{" ,"")
finalstring = s.replace("}" , "")
split = finalstring.split(",")
dict = {'messageID':id}
for x in split:
keyvalue = x.split(":")
print(keyvalue)
dict[keyvalue[0]]=keyvalue[1]
dynamoTable.put_item(Item=dict)

DynamoDB scan not returning desired output

I have a simple python script that is scanning a DynamoDB table. The table holds ARNs for all the accounts I own. There is one primary key "ARNs" of data type string. When I scan the table, I would like to only get the ARN string returned. I am having trouble finding anything in the boto3 documentation that can accomplish this. Below is my code, the returned output, and the desired output.
CODE:
import boto3
dynamo = boto3.client('dynamodb')
# Scans Dynamo for all account role ARNs
def get_arns():
response = dynamo.scan(TableName='AllAccountARNs')
print(response)
get_arns()
OUTPUT:
{'ARNs': {'S': 'arn:aws:iam::xxxxxxx:role/custom_role'}},
{'ARNs': {'S': 'arn:aws:iam::yyyyyyy:role/custom_role'}},
{'ARNs': {'S': 'arn:aws:iam::zzzzzzz:role/custom_role'}}
DESIRED OUPUT:
arn:aws:iam::xxxxxxx:role/custom_role
arn:aws:iam::yyyyyyy:role/custom_role
arn:aws:iam::zzzzzzz:role/custom_role
Here's an example of how to do this with a boto3 DynamoDB Client:
import boto3
ddb = boto3.client('dynamodb')
rsp = ddb.scan(TableName='AllAccountARNs')
for item in rsp['Items']:
print(item['ARNs']['S'])
Here's the same thing, but using a boto3 DynamoDB Table Resource:
import boto3
dynamodb = boto3.resource('dynamodb')
tbl = dynamodb.Table('AllAccountARNs')
rsp = tbl.scan()
for item in rsp['Items']:
print(item['ARNs'])
Note that these examples do not handle large result sets. If LastEvaluatedKey is present in the response, you will need to paginate the result set. See the boto3 documentation.
For more information on Client vs. Resource, see here.

Querying with cqlengine

I am trying to hook the cqlengine CQL 3 object mapper with my web application running on CherryPy. Athough the documentation is very clear about querying, I am still not aware how to make queries on an existing table(and an existing keyspace) in my cassandra database. For instance I already have this table Movies containing the fields Title, rating, Year. I want to make the CQL query
SELECT * FROM Movies
How do I go ahead with the query after establishing the connection with
from cqlengine import connection
connection.setup(['127.0.0.1:9160'])
The KEYSPACE is called "TEST1".
Abhiroop Sarkar,
I highly suggest that you read through all of the documentation at:
Current Object Mapper Documentation
Legacy CQLEngine Documentation
Installation: pip install cassandra-driver
And take a look at this example project by the creator of CQLEngine, rustyrazorblade:
Example Project - Meat bot
Keep in mind, CQLEngine has been merged into the DataStax Cassandra-driver:
Official Python Cassandra Driver Documentation
You'll want to do something like this:
CQLEngine <= 0.21.0:
from cqlengine.connection import setup
setup(['127.0.0.1'], 'keyspace_name', retry_connect=True)
If you need to create the keyspace still:
from cqlengine.management import create_keyspace
create_keyspace(
'keyspace_name',
replication_factor=1,
strategy_class='SimpleStrategy'
)
Setup your Cassandra Data Model
You can do this in the same .py or in your models.py:
import datetime
import uuid
from cqlengine import columns, Model
class YourModel(Model):
__key_space__ = 'keyspace_name' # Not Required
__table_name__ = 'columnfamily_name' # Not Required
some_int = columns.Integer(
primary_key=True,
partition_key=True
)
time = columns.TimeUUID(
primary_key=True,
clustering_order='DESC',
default=uuid.uuid1,
)
some_uuid = columns.UUID(primary_key=True, default=uuid.uuid4)
created = columns.DateTime(default=datetime.datetime.utcnow)
some_text = columns.Text(required=True)
def __str__(self):
return self.some_text
def to_dict(self):
data = {
'text': self.some_text,
'created': self.created,
'some_int': self.some_int,
}
return data
Sync your Cassandra ColumnFamilies
from cqlengine.management import sync_table
from .models import YourModel
sync_table(YourModel)
Considering everything above, you can put all of the connection and syncing together, as many examples have outlined, say this is connection.py in our project:
from cqlengine.connection import setup
from cqlengine.management import sync_table
from .models import YourTable
def cass_connect():
setup(['127.0.0.1'], 'keyspace_name', retry_connect=True)
sync_table(YourTable)
Actually Using the Model and Data
from __future__ import print_function
from .connection import cass_connect
from .models import YourTable
def add_data():
cass_connect()
YourTable.create(
some_int=5,
some_text='Test0'
)
YourTable.create(
some_int=6,
some_text='Test1'
)
YourTable.create(
some_int=5,
some_text='Test2'
)
def query_data():
cass_connect()
query = YourTable.objects.filter(some_int=5)
# This will output each YourTable entry where some_int = 5
for item in query:
print(item)
Feel free to let ask for further clarification, if necessary.
The most straightforward way to achieve this is to make model classes which mirror the schema of your existing cql tables, then run queries on them
cqlengine is primarily an Object Mapper for Cassandra. It does not interrogate an existing database in order to create objects for existing tables. Rather it is usually intended to be used in the opposite direction (i.e. create tables from python classes). If you want to query an existing table using cqlengine you will need to create python models that exactly correspond to your existing tables.
For example, if your current Movies table had 3 columns, id, title, and release_date you would need to create a cqlengine model that had those three columns. Additionally, you would need to ensure that the table_name attribute on the class was exactly the same as the table name in the database.
from cqlengine import columns, Model
class Movie(Model):
__table_name__ = "movies"
id = columns.UUID(primary_key=True)
title = columns.Text()
release_date = columns.Date()
The key thing is to make sure that model exactly mirrors the existing table. If there are small differences you may be able to use sync_table(MyModel) to update the table to match your model.

Resources