Dynamodb range_key get's rounded - rounding

I am using boto (on windows) to create a new item in a dynamodb table, I set the range key to 129271300103.3986 but in the database it is set to 129271300103
is this a boto issue, or dynamodb issue?
EDIT: it is not just floats, large integers get chopped off as well,
I submit 129271300103398600
but I get 129271300103000000

I'm not sure about boto, but make sure your Marshaling of Object to DB string doesn't round it off.

Related

Pymongo returning wrong int value for get query

I am using pymongo 3.11.3 in my notebook project. The value for int32 is returning constant 100 by pymongo. Query is returning fine in MongoDB IDE. Here is the code
client = MongoClient("localhost", 27017)
db = client['mongodb_vs_mysql']
mongo_result = db.collection['covid19'].find().sort("Cases_person", -1).limit(30);
for i in list(mongo_result):
print(i)
The database has different values but when querying with pymongo, it is showing 100 for that column.
Need help
I'd be fairly certain that you're looking at different databases; for a start, you have different (albeit similar) ids, and different field names (Daily_Cumulative vs Daily / cumulative)
I was doing mistake in this code
db.collection['covid19'] should be db['covid19']

How can I order a dictionary by its values?

I am working on a data analysis project and I am trying to order my results in descending format, the first time I had a similar issue, I used sorted(dictt.items(), key= lambda x: x[1]) and it worked fine. Now, I am having an error of type: "AttributeError: 'set' object has no attribute 'items'".
What am I doing wrong?
Actual output but I want it sorted
There is a special collection for what you are trying to do called OrderedDict:
https://docs.python.org/3/library/collections.html#collections.OrderedDict
The problem is that sorting the way you did returns a set of items instead of a dict and you have to turn it into a dict again. However, the standard dict doesn't keep track of entry order for efficiency reasons, so if you really want that, just plug the set of items into an OrderedDict like so:
myOrderedDict = OrderedDict(sorted(dictt.items()))

boto3 - Getting files only uploaded in the past month in S3

I am writing a python3 lambda function which needs to return all of the files that were uploaded to an S3 bucket in the past 30 days from the time that the function is ran.
How should I approach this? Ideally, I want to only iterate through the files from the past 30 days and nothing else - there are thousands upon thousands of files in the S3 bucket that I am iterating through, and maybe 100 max will be updated/uploaded per month. It would be very inefficient to have to iterate through every file and compare dates like that. There is also a 29 second time limit for AWS API gateway.
Any help would be greatly appreciated. Thanks!
You will need to iterate through the list of objects (sample code: List s3 buckets with its size in csv format) and compare the date within the Python code (sample code: Get day old filepaths from s3 bucket).
There is no filter when listing objects (aside from Prefix).
An alternative is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket. You could parse that CSV instead of the listing objects.
A more extreme option is to keep a separate database of objects, which would need to be updated whenever objects are added/deleted. This could be done via Amazon S3 Events that trigger an AWS Lambda function. Lots of work, though.
I can't give you an 100% answer, since you have asked for the upload date, but if you can live with the 'last modified' value, this code snippet should do the job:
import boto3
import datetime
paginator = boto3.resource('s3').meta.client.get_paginator('list_objects')
date = datetime.datetime.now() - datetime.timedelta(30)
filtered_files = (page['Key'] for page in paginator.paginate(Bucket="bucketname").search(f"Contents[?to_string(LastModified)>='\"{date}\"']"))
For filterting I used JMESPath
From the architect perspective
The bottle neck is that whether if you can iterate all objects with in 30 seconds. If natively there are too many files, there are a few more options you can use:
Create a aws lambda function that triggered by S3:PutObject event, and store the S3 key, and last_modified_at information into Dynamodb (A AWS Key Value NoSQL database). Then you can easily use Dynamodb to filter the S3 key and retrieve those S3 object accordingly.
Crreate a aws lambda function that triggered by S3:PutObject event, and move the file to a partitioned S3 Key schema location such as s3://bucket/datalake/year=${year}/month=${month}/day=${day}/your-file.csv. Then you can easily use the partition information to locate the subset of your objects, which fits in 30 seconds hard limit.
From programming perspective
Here's the code snippet solves your problem using this library s3pathlib:
from datetime import datetime, timedelta
from s3pathlib import S3path
# define a folder
p_dir = S3Path("bucket/my-folder/")
# find one month ago datetime
now = datetime.utcnow()
one_month_ago = now - timedelta(days=30)
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable Attribute can be used for filtering
S3Path.last_modified_at >= one_month_ago
):
# do whatever you like
print(p.console_url) # click link to open it in console, inspect
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document

AWS Lambda Nodejs: Get all objects created in the last 24hours from a S3 bucket

I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated
Thank you
You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time.
For example:
s3://bucket/reports/date=2019-08-28/report1.json
s3://bucket/reports/date=2019-08-28/report2.json
s3://bucket/reports/date=2019-08-28/report3.json
s3://bucket/reports/date=2019-08-29/report1.json
This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html

Unable to read column types from amazon redshift using psycopg2

I'm trying to access the types of columns in a table in redshift using psycopg2.
I'm doing this by running a simple query on pg_table_def like as follows:
SELECT * FROM pg_table_def;
This returns the traceback:
psycopg2.NotSupportedError: Column "schemaname" has unsupported type "name"
So it seems like the types of the columns that store schema (and other similar information on further queries) are not supported by psycopg2.
Has anyone run into this issue or a similar one and is aware of a workaround? My primary goal in this is to be able to return the types of columns in the table. For the purposes of what I'm doing, I can't use another postgresql adapter.
Using:
python- 3.6.2
psycopg2- 2.7.4
pandas- 0.17.1
You could do something like below, and could return the result back to calling service.
cur.execute("select * from pg_table_def where tablename='sales'")
results = cur.fetchall()
for row in results:
print ("ColumnNanme=>"+row[2] +",DataType=>"+row[3]+",encoding=>"+row[4])
Not sure about exception, if all the permissions are fine, then, it should work fine, print something like below.
ColumnNanme=>salesid,DataType=>integer,encoding=>lzo
ColumnNanme=>commission,DataType=>numeric(8,2),encoding=>lzo
ColumnNanme=>saledate,DataType=>date,encoding=>lzo
ColumnNanme=>description,DataType=>character varying(255),encoding=>lzo

Resources