Tweepy cursor .pages() with api.search_users returning same page again and again - python-3.x

auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
user_objs = []
name = "phungsuk wangdu"
id_strs = {}
page_no = 0
try:
for page in tweepy.Cursor(api.search_users, name).pages(3):
dup_count = 0
print("******* Page", str(page_no))
print("Length of page", len(page))
user_objs.extend(page)
for user_obj in page:
id_str = user_obj._json['id_str']
if id_str in id_strs:
# print("Duplicate for:", id_str, "from page number:", id_strs[id_str])
dup_count += 1
else:
# print(id_str)
id_strs[id_str] = page_no
time.sleep(1)
print("Duplicates in page", str(page_no), str(dup_count))
page_no += 1
except Exception as ex:
print(ex)
With the above code, I am trying to get the search results for users using tweepy(Python 3.5.2, tweepy 3.5.0) cursor. The results are being duplicated with the pages parameter being passed. Is it the right way to query the search_users using the tweepy cursor? I am getting results for the above code with the following pattern:
1. for low search results(name = "phungsuk wangdu") (There are actually 9 results returned for manual search on twitter website):
******* Page 0
Length of page 2
Duplicates in page 0 0
******* Page 1
Length of page 2
Duplicates in page 1 2
******* Page 2
Length of page 2
Duplicates in page 2 2
******* Page 3
Length of page 2
Duplicates in page 3 2
2. for high search results (name = "jon snow")
******* Page 0
Length of page 20
Duplicates in page 0 0
******* Page 1
Length of page 20
Duplicates in page 1 20
******* Page 2
Length of page 20
Duplicates in page 2 0
******* Page 3
Length of page 20
Duplicates in page 3 0

Try adding this attribute to the Cursor; it should reduce the duplicates.
q= <your query> +" -filter:retweets"

There are two issues here.
Tweepy's pageiterator for cursor starts pagenumber from 0 while python's page number starts from 1.
Python returns results from the last available page for page numbers that are greater than available results.
I made a pull request to tweepy with both the fixes.

Related

How to download alots of page by twenty-twenty

I have an app that download a books, where each book can contain a random pages. So I coded a logic using threading:
while pagenum < (int(Book["total_pages"])+1):
tlist = []
for a in range (20):
if pagenum < (int(Book["total_pages"])+1):
x = threading.Thread(target=ThreadDownloads, args=(queue, pagenum, destination))
tlist.append(x)
pagenum += 1
for x in tlist:
x.start()
for x in tlist:
x.join()
The ThreadDownloads functions is download a specified pages and it's working great, so the problem is this: it ran ThreadDownloads 20 times but it downloaded only a part of the pages:
pages between 1 and 7 but not those between 8 and 20
pages between 21 and 29 but not those between 30 and 40
I dont think my ThreadDownloads function is working correctly because before applying threading it was running fine.
Any help is appreciated.

(Neo4j / py2neo) Update relationship after it's been created

I am having troubles with updating a relationship property:
my goal is to map a dataset like the following into a Neo4j graph:
PersonName IllnessType
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 1
I basically cycle over the lines of this dataset, creating a Node for each Person and each Illness found on the line, and merging to avoid duplicates:
from py2neo import *
graph = Graph()
person_node= Node("Person", **kwargs)
graph.merge(person_node, "Person", "Name")
illness_node = Node("Illness", **kwargs)
graph.merge(illness_node, "Illness", "IllnessType")
edge = Relationship.type("SUFFERS_FROM")
rel = sfEdge(person_node, illness_node)
self.graph.merge(rel)
What I like to add now, is to add a weight on the "SUFFERS_FROM" edge that count how many times a person has suffered from a certain illness. What I tried to do was:
rm = RelationshipMatcher()
edge_to_increment = rm.match(nodes=(None, patNode), r_type=None).first()
if edge_to_increment is None:
edge_to_increment = edge(person_node, illness_node)
edge_to_increment["COUNT"]=1
self.graph.merge(edge_to_increment)
else:
edge_to_increment["COUNT"] += 1
c = e2r["COUNT"]
But then when I visualize the result, all edges have weight 1 even though the edge B-->1 should have weight 2.
Thanks in advance

Azure FaceAPI limits iteration to 20 items

I have a list of image urls from which I use MS Azure faceAPI to extract some features from the photos. The problem is that whenever I iterate more than 20 urls, it seems not to work on any url after the 20th one. There is no error shown. However, when I manually changed the range to iterate the next 20 urls, it worked.
Side note, on free version, MS Azure Face allows only 20 requests/minute; however, even when I let time sleep up to 60s per 10 requests, the problem still persists.
FYI, I have 360,000 urls in total, and sofar I have made only about 1000 requests.
Can anyone help tell me why this happens and how to solve this? Thank you so much!
# My codes
i = 0
for post in list_post[800:1000]:
i += 1
try:
image_url = post['photo_url']
headers = {'Ocp-Apim-Subscription-Key': KEY}
params = {
'returnFaceId': 'true',
'returnFaceLandmarks': 'false',
'returnFaceAttributes': 'age,gender,headPose,smile,facialHair,glasses,emotion,hair,makeup,occlusion,accessories,blur,exposure,noise',
}
response = requests.post(face_api_url, params=params, headers=headers, json={"url": image_url})
post['face_feature'] = response.json()[0]
except (KeyError, IndexError):
continue
if i % 10 == 0:
time.sleep(60)
The free version has a max of 30 000 request per month, your 356 000 faces will therefore take a year to run.
The standard version costs USD 1 per 1000 requests, giving a total cost of USD 360. This option supports 10 transactions per second.
https://azure.microsoft.com/en-au/pricing/details/cognitive-services/face-api/

How to display CSV data as a table view on html page using flask?

I'm using Python 3.7.7 and working on the following:
I have a directory with csv files that are being created in this folder daily:
daily-2020-08-19.csv
daily-2020-08-18.csv
daily-2020-08-17.csv
... and so on...
structure of my data file is sorted by customers:
daily-2020-08-17.csv
userId, calls, customers, emails
2 0 11 30
1 1 5 11
3 11 0 0
daily-2020-08-18.csv
userId, calls, customers, emails
1 11 15 5
2 0 2 5
3 1 1 1
5 0 1 1
4 0 0 0
daily-2020-08-19.csv
userId, calls, customers, emails
2 13 30 55
1 11 15 5
5 3 3 5
3 2 2 1
4 1 1 3
7 1 1 1
6 10 0 5
userId = numbers of active users on a given day. With time there is expected a higher number of Ids as new users are registering everyday.
I've managed to sum data for the last 3 days using pandas and now I have 3days.csv:
userId, customers
2 43
1 35
5 4
3 3
4 1
7 1
6 0
How can I display the above data as a table view on html page using flask? I'm completely new to flask and aiming to have userId showing in my table as an URL which would then bring me to a separate html page specific to that particular userID.
My code so far:
from flask import Flask
import os
app = Flask (__name__)
#fun var
filepath = os.path.join(os.path.dirname(__file__),'path/to/my/3days.csv')
open_read = open(filepath,'r')
page =''
while True:
read_data = open_read.readline()
page += '<p>%s</p>' % read_data
if open_read.readline() == '':
break
#app.route("/")
def index():
return page
if __name__ == "__main__":
app.run()
With the above code I managed to get only some values from my 3days.csv. However I need a table view with userId as URL that brings to web page dedicated to the particular userId. Could someone help with this?
Output:
Thank you in advance!
In table_page.html:
<table>
{% for a_user in page %}
<tr>
<td>User# {{ a_user.userId }}</td>
<td>{{ a_user.customers }}</td>
</tr>
{% endfor %}
</table>
In views.py:
#app.route("/")
def index():
# page need to be a list, like [{'userId': '2', 'customers': '43'},{'userId': '1', 'customers': '35'}, ...]
return render_template("table_page.html", page=page)
Then make a user page:
#app.route('user_page/<id>')
def user_page(id=None):
....
To convert 3days.csv into the page list (assuming it is a string variable and not a csv file. If it is a csv file, you'll need to file.read() it):
pages = []
lines = 3days_text.split('\n')
for line in lines:
line_as_list = line.split(",")
pages.append({'userId': line_as_list[0], 'customers': line_as_list[1])

How do I extract a particular section from textfile in python3

This is my python file
path = '/my/file/list.txt'
with open(path,'rt') as file:
print("step 1")
collected_lines = []
started = False
for line in file:
for n in range(1, 10):
if line.startswith('PLAY NO.{}'.format(n)):
started = True
print("started at line {}".format(line[0]))
continue
if started:
collected_lines.append(line)
if started and line == 'PLAY NO.{}'.format(n+1):
print("end at line {}".format(line[0]))
break
print(collected_lines.append(line))
This is my code..
OUTPUT:
None
None
None
None
None
None
Now I want the lines starting from play No2 to play No3.... But I am getting None.. Please any suggestions... I am using Python 3.5
Sorry this is the first time asking question on this site..
My file looks like this..
textfile.txt
Hello and Welcome This is the list of plays being performed here
PLAY NO. 1
1. adknjkd
2. skdi
3. ljdij
PLAY NO. 2
1. hsnfhkjdnckj
2. sjndkjhnd and so on
path = 'list.txt'
collected_lines = []
with open(path,'rt') as file:
print("step 1")
started = False
lineNo = 0
for line in file:
lineNo += 1
for n in range(1, 10):
# print('PLAY NO. {}'.format(n))
if started and line.lstrip().startswith('PLAY NO. {}'.format(n)):
print("### end at line {}".format(lineNo))
started = False
break
if line.lstrip().startswith('PLAY NO. {}'.format(n)):
started = True
print("### started at line {}".format(lineNo))
break
if started:
collected_lines.append(line)
print("collected_lines: \n\n", *[ item for item in collected_lines ])
gives:
step 1
### started at line 2
### end at line 7
collected_lines:
PLAY NO. 1
1. adknjkd
2. skdi
3. ljdij
NOTES about fixed issues:
used .lstrip() in order to make .startswith() working as expected
added a space between NO. and {} in startswith('PLAY NO. {}'.format(n) so that the if condition can find the line
rearranged the order of ifs in order to avoid that the end line is considered found at the start line
added started = False to the loop to stop collecting the lines.
The problem with the leading spaces was already enough to prevent the code from finding the line. Fixing this alone wouldn't fix the problem because of the missing space in the format string, so both issues had to be fixed to make the code work as expected. And so on ... see NOTES above.
If you want a dict with as labels the play number and as items a list with the lines about the play you can use a defaultdict
Defining the text
text = """Hello and Welcome This is the list of plays being performed here
PLAY NO. 1
1. adknjkd
2. skdi
3. ljdij
PLAY NO. 2
1. hsnfhkjdnckj
2. sjndkjhnd and so on"""
Defining the regular expression
regex = re.compile('^\s*PLAY NO. (\d+)$')
Parsing the lines
label = None # no play to start with
recorded_lines = defaultdict(list)
for line_no, line in enumerate(StringIO(text)):
# In the real code replace the 'StringIO(text)' with 'file'
try:
play_no = int(regex.findall(line)[0])
# If this regex does not match, it will throw an IndexError
# The code underneath is only executed when a new play starts
if label: # if there is no play underway, there can be no ending
print('PLAY NO. %i ended at line number %i' % (label, line_no-1))
label = play_no
print('PLAY NO. %i started at line number %i' % (play_no, line_no))
except IndexError:
# no new play started
if label and line.strip():
recorded_lines[play_no].append(line.strip())
print(line_no, line)
print(recorded_lines)
yields
defaultdict(list,
{1: [(2, '1. adknjkd'), (3, '2. skdi'), (4, '3. ljdij')],
2: [(7, '1. hsnfhkjdnckj'), (8, '2. sjndkjhnd and so on')]})
and this output on the stout:
0 Hello and Welcome This is the list of plays being performed here
PLAY NO. 1 started at line number 1
1 PLAY NO. 1
2 1. adknjkd
3 2. skdi
4 3. ljdij
5
PLAY NO. 1 ended at line number 5
PLAY NO. 2 started at line number 6
6 PLAY NO. 2
7 1. hsnfhkjdnckj
8 2. sjndkjhnd and so on

Resources