I'm using the Drive files: list API (https://developers.google.com/drive/api/v3/reference/files/list) to list the file permissions on a truly jumbo account, and it's taking forever (currently 28M files after running for nearly a week). Anyone have any tips or tricks for doing this any faster?
I know how to use multiprocessing/multithreading to process a bunch of accounts in parallel, but I'm stumped in terms of speeding up any one single account.
I'm using python3 if it makes a difference, though it seems like the slowness is largely due to the fact that it takes a couple of seconds for the Google APIs to return each batch of 100 files.
Editing to add what I'm currently doing that takes a lot of time (this is a snippet from a larger script, but captures the essential parts):
items_count=0
while True:
if items_count>0: log(logging.DEBUG,curr_user+": Getting next set of items. Items so far: "+str(items_count))
results=gw_list_files_page(service,curr_user,nextPageToken)
if results==None:
log(logging.ERROR,curr_user+": error when listing files")
return
nextPageToken=results.get('nextPageToken')
if not nextPageToken:
break
def gw_list_files_page(service,curr_user,nextPageToken):
# Per https://stackoverflow.com/questions/58425646/the-pagesize-parameter-does-not-work-in-google-drive-api-v3, even though we're setting pageSize to be 1000, we can only get 100 at a time:
# "For instance, including permissions in the files fields will limit the set to 100 each, while including parents will limit it to 360 items each."
results=None
if nextPageToken:
results = service.files().list(pageToken=str(nextPageToken), pageSize=1000, q="'"+curr_user+"' in owners", fields="nextPageToken,incompleteSearch,files(mimeType,parents,shared,permissions(type,emailAddress,role,deleted),id,name)").execute()
else:
results = service.files().list(pageSize=1000, q="'"+curr_user+"' in owners", fields="nextPageToken,incompleteSearch,files(mimeType,parents,shared,permissions(type,emailAddress,role,deleted),id,name)").execute()
return results
Related
I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs
I have a question regarding the Python API of Interactive Brokers.
Can multiple asset and stock contracts be passed into reqMktData() function and obtain the last prices? (I can set the snapshots = TRUE in reqMktData to get the last price. You can assume that I have subscribed to the appropriate data services.)
To put things in perspective, this is what I am trying to do:
1) Call reqMktData, get last prices for multiple assets.
2) Feed the data into my prediction engine, and do something
3) Go to step 1.
When I contacted Interactive Brokers, they said:
"Only one contract can be passed to reqMktData() at one time, so there is no bulk request feature in requesting real time data."
Obviously one way to get around this is to do a loop but this is too slow. Another way to do this is through multithreading but this is a lot of work plus I can't afford the extra expense of a new computer. I am not interested in either one.
Any suggestions?
You can only specify 1 contract in each reqMktData call. There is no choice but to use a loop of some type. The speed shouldn't be an issue as you can make up to 50 requests per second, maybe even more for snapshots.
The speed issue could be that you want too much data (> 50/s) or you're using an old version of the IB python api, check in connection.py for lock.acquire, I've deleted all of them. Also, if there has been no trade for >10 seconds, IB will wait for a trade before sending a snapshot. Test with active symbols.
However, what you should do is request live streaming data by setting snapshot to false and just keep track of the last price in the stream. You can stream up to 100 tickers with the default minimums. You keep them separate by using unique ticker ids.
My program runs fine with limited data but when I put in all four databases activewidth won't work.
Database 1 has 29990 entries.
Database 2 has around 27000 entries.
Database 3 has roughly 17000 entries.
Database 4 has 430 entries.
Each database, grouped by its kind and includes business type, name, address, city, state, phone number, longitude, latitude, sales tax info, and daily hours of operation.
In total 12.1Mb of data.
With database 1 only in the program it works fine and I can scroll over a point on the map and activewidth will increase the size of the dot and the program will bring up the underlying data on the left hand side of the screen just like it is suppose to do.
Now that I have added in all four maps and can click them on and off separately, with only #1 turned on activewidth won't work and the underlying data won't show on the left. The points on the map are there and I can click through all four checkbuttons and turn on and off the points. I currently don't have the code in yet for the underlying data on database 2-4, just the ability to turn them on and off. Only activewidth isn't working now that I've got it so I can view the points for all 4 databases.
I decided to try commenting out all code for databases 2-4 and see what would happen and it went back to working fine again. Then I went and added in database 2 back into the mix and it went back to not working again. Then I tried database 2 only and it was activewidthing fine as long as database 1 was commented out. With database 1 active the activewidth was very slow to work/not working at all.
Is there a feasible maximum number of entries I can use. Hopefully not because I still have several more databases to get finished off and added into the program that will take the total number of entries up over 100K before all is said and done.
Nothing else makes sense since I'm just changing self.alocation to self.blocation, when I go to add in database 2-4. I'm just changing the identifier to show which database is being worked with and copying the rest of the code over between routines since everything is the same...just different business separated into appropriately grouped databases. It seems it's in the amount of data that is being used and not in the anything doing with the way the program is written.
I figured by splitting up the files, not only for my benefit but also to make the files smaller it would help alleviate the problem but so far it hasn't. Is there any other way to work around data overload?
self.alocation = []
for x in range(0, len(self.atype)):
pix1x = round((float(self.along[x])+(-self.longitudecenter+(self.p/2)))/(self.p/714),0)
pix1y = round((((self.p/2) + self.latitudecenter-(float(self.alat[x])))/(self.p/714)),0)
z = self.canvas.create_line(pix1x, pix1y, pix1x+4, pix1y+4, activewidth="10.0", fill = '', width = 5)
self.alocation.append((z,x))
Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/
this is our situation:
We store user messages in table Storage. The Partition key is the UserId and the RowKey is used as a message id.
When a users opens his message panel we want to just .Take(x) number of messages, we don't care about the sortOrder. But what we have noticed is that the time it takes to get the messages varies very much by the number of messages we take.
We did some small tests:
We did 50 * .Take(X) and compared the differences:
So we did .Take(1) 50 times and .Take(100) 50 times etc.
To make an extra check we did the same test 5 times.
Here are the results:
As you can see there are some HUGE differences. The difference between 1 and 2 is very strange. The same for 199-200.
Does anybody have any clue how this is happening? The Table Storage is on a live server btw, not development storage.
Many thanks.
X: # Takes
Y: Test Number
Update
The problem only seems to come when I'm using a wireless network. But I'm using the cable the times are normal.
Possibly the data is collected in batches of a certain number x. When you request x+1 rows, it would have to take two batches and then drop a certain number.
Try running your test with increments of 1 as the Take() parameter, to confirm or dismiss this assumption.