Amazon DynamoDB [DDB] parallel scan - boto 3 - python - python-3.x

I am trying to do parallel scan but it looks like it processing sequentially and not fetching full records
My ddb has 400k records and i am trying to pull all records using DDB parallel scans.I want to use 40 threads
dynamodb = boto3.resource('dynamodb',
region_name='us-east-1',
endpoint_url="https://dynamodb.us-east-1.amazonaws.com",
aws_access_key_id="123",
aws_secret_access_key="456")
table = dynamodb.Table('solved_history')
last_evaluated_key = None
a = 0
while True:
if last_evaluated_key:
response = table.scan(ExclusiveStartKey=last_evaluated_key)
allResults = response['Items']
for each in allResults:
storeRespectively(each["decodedText"], each["base64StringOfImage"])
a += 1
else:
response = table.scan(TotalSegments=40, Segment=39)
allResults = response['Items']
for each in allResults:
storeRespectively(each["decodedText"], each["base64StringOfImage"])
a += 1
last_evaluated_key = response.get('LastEvaluatedKey')
if not last_evaluated_key:
break
print(a)

Related

sqlAlchemy Row Count from Results

Having recently upgraded sqlAlchemy and Python to 3.8, this code no longer works to get a row count from search results, via the sqlAlchemy ORM. It seems the use of _saved_cursor._result.rows has been depreciated. (Error: AttributeError: 'LegacyCursorResult' object has no attribute '_saved_cursor')
def get_clients(db, status):
clients = Table("clients", db.metadata, autoload=True)
qry = clients.select().where(clients.c.status == status)
res = qry.execute()
rowcount = len(res._saved_cursor._result.rows)
return rowcount
We have this very ugly code that works, but this way has to loop through all the results to get the count.
def get_clients(db, status):
clients = Table("clients", db.metadata, autoload=True)
qry = clients.select().where(clients.c.status == status)
res = qry.execute()
rowcount = 0
for row in res:
rowcount += 1
return rowcount
Without using raw sql, what is the most efficient means to get the row count using sqlAlchemy ORM?
The solution is to use the func method from sqlAlchemy, and to render the results as scalar.
from sqlalchemy import Table, func
def get_clients(db, status):
clients = Table("clients", db.metadata, autoload=True)
qry = select([func.count()]).select_from(clients).where(clients.c.status == status)
row_count = qry.execute().scalar()
return row_count

Not able to get more than 1000 rows from azure storage table

I am beginner in Python and try to retrieve all rows (more than 1000 rows) from azure storage table. Below is the sample code.
The code gives me 1000 records but the table (testtable) has more than 50000 rows.i read in some blog using continuation token can pull all records. Let me know how can i implement in this
table='testtable'
now2='14042018'
count=0
try:
table_service = TableService(account_name=myaccount, account_key=mykey)
logging.info('connected successfully')
except Exception as e:
logging.error(e)
tasks = table_service.query_entities(table,filter='PartitionKey eq \'' + now2 + '\'')
for task in tasks:
count=count+1
a=task.desc
#print(a)
print(count)
Update:
Even if I just use this line of code:
entities = table_service.query_entities(table,filter='PartitionKey eq \'' + now2 + '\'')
and all of the rows in my table can be fetched(more than 10000 rows).
Use the code below:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
table_service = TableService(account_name='your account',account_key='your key')
table='tasktable'
now2='03042018'
count=0
next_pk=None
next_rk = None
while True:
entities = table_service.query_entities(table,filter='PartitionKey eq \'' + now2 + '\'')
for entity in entities:
count=count+1
if hasattr(entities, 'x_ms_continuation'):
x_ms_continuation = getattr(entities, 'x_ms_continuation')
next_pk = x_ms_continuation['nextpartitionkey']
next_rk = x_ms_continuation['nextrowkey']
else:
break
print(count)
Test result as below:

Spark doesn't read the file properly

I run Flume to ingest Twitter data into HDFS (in JSON format) and run Spark to read that file.
But somehow, it doesn't return the correct result: it seems the content of the file is not updated.
Here's my Flume configuration:
TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS
TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = some_keywords
TwitterAgent01.sinks.HDFS.channel = MemoryChannel01
TwitterAgent01.sinks.HDFS.type = hdfs
TwitterAgent01.sinks.HDFS.hdfs.path = hdfs://hadoop01:8020/warehouse/raw/twitter/provider/m=%Y%m/
TwitterAgent01.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent01.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent01.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent01.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent01.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent01.sinks.HDFS.hdfs.rollInterval = 86400
TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000
After that I check the output with hdfs dfs -cat and it returns more than 1000 rows, meaning that the data was successfully inserted.
But in Spark that's not the case
spark.read.json("/warehouse/raw/twitter/provider").filter("m=201802").show()
only has 6 rows.
Did I miss something here?
I'm not entirely sure of why you specified the latter part of the path as the condition expression of filter.
I believe to correctly read your file you can just write:
spark.read.json("/warehouse/raw/twitter/provider/m=201802").show()

Saving data from SDSS query in Python

I got the data from SDSS by the following query
query = """SELECT TOP 20000
p.psfMag_u, p.psfMag_g, p.psfMag_r, p.psfMag_i, p.psfMag_z, s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1
ORDER BY s.specobjid ASC
"""
data = SDSS.query_sql(query).to_pandas()
psfMag_g psfMag_u psfMag_r psfMag_i psfMag_z class
19.08541 17.41787 16.59118 16.03507 15.68560 GALAXY
There are 20000 data.
I want to save in csv format. How to save the data?

how do i fetch sqlite column based on its len and limting 10 using python

Trying to list sqlite3 value from highest to lowest limited it to 10 using python,
here's my current code
connection = sqlite3.connect('mydb.db')
database = connection.cursor()
all_user = str(database.execute("SELECT logtext from logs order by logtext limit 10 ")
I can't figure the logic of using len(logtext)
and how to actually list from highest to lowest limiting them by 10.
Try
connection = sqlite3.connect('mydb.db')
cursor = connection.cursor()
cursor.execute("SELECT logtext from logs order by length(logtext) desc limit 10")
results = cursor.fetchall()
See also https://stackoverflow.com/a/3606923/960709

Resources