When we download data from instagram, it scans the entire posts of the account even when you provide a time window(it will skip the dates older but will still scan the whole history) through the following command:
instaloader --login=username
--password=password
--post-metadata-txt="{likes} likes, {comments} comments."
--post-filter="date_utc >= datetime(2019, 12, 31) and not is_video"
This is very inefficient. I am wondering is there any more efficient way to download data?
This is not directly supported by Instaloader's command line interface, meaning that you have to write a little Python script to achieve that. There is an example in the Instaloader documentation for downloading posts in a specific period. It differs from what you want to achieve in only very few points:
Use a custom post_metadata_txt_pattern. To do so, instantiate Instaloader with
L = instaloader.Instaloader(post_metadata_txt_pattern="{likes} likes, {comments} comments.")
Log in:
L.load_session_from_file('username')
Load until the latest post (Not having a specific SINCE date). This allows for a even simpler loop. Also filter by not is_video:
for post in takewhile(lambda p: p.date_utc > datetime(2019, 12, 31), posts):
if not post.is_video:
L.download_post(post, 'target_directory')
The key is the takewhile function, which ends the download loop when a post is encountered that does not match the given condition. Considering that posts come newest-first, the download loop terminates as soon as all new-enough posts have been downloaded.
Putting it all together, we get:
from datetime import datetime
from itertools import takewhile
import instaloader
L = instaloader.Instaloader(post_metadata_txt_pattern="{likes} likes, {comments} comments.")
L.load_session_from_file('username')
posts = L.get_hashtag_posts('hashtag')
# or
# posts = instaloader.Profile.from_username(L.context, 'profile').get_posts()
for post in takewhile(lambda p: p.date_utc > datetime(2019, 12, 31), posts):
if not post.is_video:
L.download_post(post, 'target_directory')
Write that in a file, e.g. downloader.py and execute it with python downloader.py. The call to load_session_from_file assumes that there's already a saved Instaloader session, to get one, simply call instaloader -l username before executing the code snippet.
Related
from bs4 import BeautifulSoup
import requests
import smtplib
import time
def live_news():
source = requests.get(
"https://economictimes.indiatimes.com/news/politics-and-nation/coronavirus-
cases-in-india-live-news-latest-updates-april6/liveblog/75000925.cms"
).text
soup = BeautifulSoup(source, "lxml")
livepage = soup.find("div", class_="pageliveblog")
each_story = livepage.find("div", class_="eachStory")
news_time = each_story.span.text
new_news = each_story.div.text[8::]
print(f"{news_time}\n{new_news}")
while(True):
live_news()
time.sleep(300)
So basically what I'm trying is to scrape latest news updates from a news website. What I'm looking for is to print only the latest news along with its time not the entire news headlines.
With the above code I can get the latest news update and the program will send request to the server every 5mins(that's the delay I've given). But the problem here is, it will print the same previously printed news again after 5 mins if there are no other latest news updated in the page. I don't want the program to print the same news again, instead I would like to add some conditions to the program. So that It will check every 5 mins if there are any new updates or its the same previous news. If there are any new updates then it should print it otherwise should not.
The solution that I can think of is an if-statement. With the first run of your code, a variable check_last_time is empty and when you call live_news() it gets assigned the value of news_time.
After that, each time live_news() is called, it firsts check to see if the current news_time is not the same as check_last_time and if it is not, then it will print the new story:
# Initialise the variable outside of the function
check_last_time = []
def live_news():
...
...
# Check to see with the times don't match
if news_time != check_last_time:
print(f"{news_time}\n{new_news}")
# Update the variable with the new time
check_last_time = news_time
I have found the answer myself. I feel kinda stupid it's very simple, all you need is an additional file to store the value. Since between every execution, the variable values will get reset and so you will need an additional file to read/write data whenever you need.
Is it possible to obtain the url from Google search result page, given the keyword? Actually, I have a csv file that contains a lot of companies name. And I want there website which shows up on the top of search result in google, when I upload that csv file it fetch the company name/keyword and put it on the search field.
For eg: - stack overflow, this is one of the entry in my csv file and it should be fetched and put in the search field, and it should return the best match/first url from search result. Eg: - www.stackoverflow.com
And this returned result should be stored in the same file which I have uploaded and next to the keyword for it searched.
I am not aware much about these concepts, so any help will be very appreciated.
Thanks!
google package has one dependency on beautifulsoup which need to be installed first.
then install :
pip install google
search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0)
query : query string that we want to search for.
tld : tld stands for top level domain which means we want to search our result on google.com or google.in or some other domain.
lang : lang stands for language.
num : Number of results we want.
start : First result to retrieve.
stop : Last result to retrieve. Use None to keep searching forever.
pause : Lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. If the stop parameter is None the iterator will loop forever.
Below code is the solution for your question.
import pandas
from googlesearch import search
df = pandas.read_csv('test.csv')
result = []
for i in range(len(df['keys'])):
for j in search(df['keys'][i], tld="com", num=10, stop=1, pause=2):
result.append(j)
dict1 = {'keys': df['keys'], 'url': result}
df = pandas.DataFrame(dict1)
df.to_csv('test.csv')
Sample input format file image:
Output File Image:
I have successfully written code to web scrape an https text page
https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt
this page is automatically updated every 60sec. I have used beautifulSoup4 to do so. Here are my two questions: 1)how do I call a loop to re-scrape the page every 60 seconds? 2) since there are no html tags associated with the page how can only scrape a specific line of data?
I was thinking that I might have to save the scraped page as a CVS file then use the saved page to extract the data I need. However, I'm hoping that this can all be done without saving the page to my local machine. I was hoping that there is some python package that can do all of this for me without saving the page.
import bs4 as bs
import urllib
sauce = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
soup = bs.BeautifulSoup (sauce,'lxml')
print (soup)
I would like to automatically scrape the first line of data every 60 seconds Here is an example first line of data
2019 03 30 1233 58572 45180 9.94e-09 1.00e-09
The header that goes with this data is
YR MO DA HHMM Day Day Short Long
Ultimately I would like to use PyAutoGUI to trigger a ccd imaging application to start a sequence of images when the 'Short' and or "Long" x-ray flux reaches e-04 or greater.
Every tool has its place.
BeautifulSoup is a wonderful tool, but the .txt suffix on that URL is a big hint that this isn't quite the HTML input which bs4 was designed for.
Recommend you use a simpler approach for this fairly simple input.
from itertools import filterfalse
def is_comment(line):
return (line.startswith(':')
or line.startswith('#'))
lines = list(filterfalse(is_comment, sauce.split('\n')))
Now you can do word split on each line to convert to CSV or pandas dataframe.
Or you can just use lines[0] to access the first line.
For example, you might parse it out in this way:
yr, mo, da, hhmm, jday, sec, short, long = map(float, lines[0].split())
I want a script code to collecting random tweet from Chicago without any keyword that every 30 min run automatically and collect tweet for 20 millisecond (for example)
All Available codes need keywords and in most of them I can't define geographic location.
Thanks for your helps.
See these pages : An Introduction to Text Mining using Twitter Streaming API and Python and this page too run a python script every hour
This is very doable. With Twitter's REST API a keyword is required; however, Twitter also provides a Streaming API which can use either a keyword or a location to filter tweets. In your case, you would need to define the bounding box of of Chicago in longitudes and latitudes. Then supply this to Twitter's statuses/filter endpoint documented here: https://developer.twitter.com/en/docs/tweets/filter-realtime/api-reference/post-statuses-filter.html. This endpoint has a locations parameter that you would use. It returns tweets as they are posted. No timer required.
You can use tweepy for this. Or, with TwitterAPI you would simply do something like this:
from TwitterAPI import TwitterAPI
api = TwitterAPI(CONSUMERKEY,CONSUMERSECRET,ACCESSTOKENKEY,ACCESSTOKENSECRET)
r = api.request('statuses/filter', {'locations':'-87.9,41.6,-87.5,42.0'})
for item in r:
print(item)
This script is giving ma a 500 Error, any ideas?
I am taking the script from a page from python samples and also using the path given to me by my hosting company (and I know it works because I have another script that does work.)
The file has 755 permissions as well as it's directory:
#!/home3/master/bin/python
import sys
sys.path.insert(1,'/home3/master/lib/python2.6/site-packages')
from twython import Twython
twitter = Twython()
trends = twitter.getCurrentTrends()
print trends
There are two problems with this code. The first is you have not included any OAuth data, so the Twitter API will reject whatever you send. The second is there is no getCurrentTrends() attribute in Twython. Did you mean get_available_trends or get_closest_trends?