hello when i use like_by_feed it dosen't working and just return
---->> Total of links feched for analysis: 0
---->> Total of links feched for analysis: 0
i think the problem of the xbath -->> //article/div[2]/div[2]/a this xpath use when the bot want to get linke of post
The xpath seems to be broken. You can use the following to fix the Like By Feed
Find the file "xpath_compile.py" (I'm on Windows : \Python\Lib\site-packages\instapy\xpath_compile.py)
Find and replace the text
xpath["get_links_from_feed"] = {"get_links":
"//article/div[3]/div[2]/a"}
with
xpath["get_links_from_feed"] = {"get_links":
"//article/div/div[3]/div[2]/a"}
Save the file and run your Bot again.
Related
I want to download a Video on a CNN's website using Python 3 automatically.
The address is https://edition.cnn.com/cnn10 .
Every weekday the CNN will put a video on that website.
I know how to find the video's real URL using Chrome browser manually:
Open the adress in Chrome browser, then press F12, like this:
enter image description here
Choosing the label of Network, then click the PLAY button to start playing the video, like this:
enter image description here
Then I can get the real URL, like this:
enter image description here
enter image description here
The video real URL is https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/13/caption/ten-0914.cnn_2253601_768x432_1300k.mp4
Using the following python 3 code can download that video:
import requests
print("Download is start!")
url = 'https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/13/caption/ten-0914.cnn_2253601_768x432_1300k.mp4'
r = requests.get(url, stream = True)
with open('20180914.mp4', "wb") as mp4:
for chunk in r.iter_content(chunk_size = 768*432):
if chunk:
mp4.write(chunk)
print("Download over!")
My problem is how to get that URL using Python or any other automatic ways? Because I want to download automatically those different videos every weekday.
What works I have already did:
I am looking for a solution on the Internet but failed.
Then I got some video's URL and try to find a regular pattern about these URL, like this:
https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/11/caption/ten-0912.cnn_2250840_768x432_1300k.mp4
https://pmd.cdn.turner.com/cnn/big/cnn10/2018/09/10/caption/ten-0911.cnn_2249562_768x432_1300k.mp4
Obviously, the date is related to every weekday, but there is 7 "random" numbers in url. I still can not understand these numbers now!
Any Help will be appreciated !
Thank you !
I am new to python and i'm using This script to get tweets. But the problem is that it is not giving full Text.Instead it is giving me URL of tweet.
output
'
"text": "#Damien85901071 #Loic_23 #EdwinZeTwiter #Christo33332 #lequipedusoir #Cristiano #RealMadrid_FR #realfrance_fr\u2026 ' ShortenURL",
what changes i need to make in this script to get full text ?
Look into Twitter's tweet_mode=extended option and the places in the Python code where you might need to add that to the script.
I met a very weird problem when I am trying to parse data from a JavaScript written website. Maybe because I am not an expert of web development.
Here is what happened:
I am trying to get all the comments data from The Globe and Mail. If you check its source code, there is no way to use Python and parse the comments data from the source code, everything is written in JavaScript.
However, there is a magic tool called "Gigya" API, it could return all the comments from a JS written website. Gigya getComments method
When I was using these lines of code in Python Scrapy Spider, it could return all the comments.
data = {"categoryID": self.categoryID,
"streamID": streamId,
"APIKey": self.apikey,
"callback": "foo",
"threadLimit": 1000 # assume all the articles have no more then 1000 comments
}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
comments_lst = loads(r.read().decode("utf-8"))["comments"]
However, The Globe and Mail is updating their website, all the comments posted before Nov. 28 have been hiden from the web for now. That's why on the sample url I am shoing here, you could only see 2 comments, because they were posted after Nov. 28. And these 2 new comments have been added the new feature - the "React" button.
The weird thing is, righ now when I am running my code, I can get all those hidden hundreds of comments published before Nov. 28, but cannot get the new commnets we can see on the website now.
I have tried all the Gigya comment related methods, none of them worked, other Gigya methods, do not look like something helpful...
Is there any way to solve this problem?
Or at least, do you know why, I can get all the hiden comments but cannot get visible new commnets that have the new features?
Finally, I solved the problem with Python selenium library, it's free and it's super cool.
So, it seems although in the source code of JS written website, we could not see the content, it's in fact has HTML page where we can parse the content.
First of all, I installed Firebug on Firefox, with this Addon, I'm able to see the HTML page of the url and it's very easy to help you locate the content, just search key words in Firebug
Then I wrote the code like this:
from selenium import webdriver
import time
def main():
comment_urls = [
"http://www.theglobeandmail.com/opinion/a-fascists-win-americas-moral-loss/article32753320/comments/"
]
for comment_url in comment_urls:
driver = webdriver.Firefox()
driver.get(comment_url)
time.sleep(5)
htmlSource = driver.page_source
clk = driver.find_element_by_css_selector('div.c3qHyJD')
clk.click()
reaction_counts = driver.find_elements_by_class_name('c2oytXt')
for rc in reaction_counts:
print(rc.text)
if __name__ == "__main__":
main()
The data I am parsing here are those content cannot be found in HTML page until you click the reaction image on the website. What makes selenium super cool is that click() method. After you found the element you can click, just use this method, then those generated elements will appear in the HTML and become parsable. Super cool!
I know that my authentication is working and I can make tweets just fine. I have no reason to believe that the tweepy library is causing this problem, though I suppose I have no reason to rule it out. My code looks something like this, attempting to tweet an emoji flag. No other emoji is working either.
auth = tweepy.OAuthHandler(keys.twitter['consumer_key'], keys.twitter['consumer_secret'])
auth.set_access_token(keys.twitter['access_token'], keys.twitter['access_token_secret'])
api = tweepy.API(auth)
print('Connected to the Twitter API...')
api.update_status(r'testing \U0001F1EB\U0001F1F7')
I get an error code 400 with seemingly no additional info about what the reason is. Trying to determine if the problem is the encoding in the string is somehow wrong, or if it is simply some sort of problem with sending it to Twitter's API.
You have unicode in your string. Because you are using Python 2 use:
api.update_status(u"testing \U0001F1EB\U0001F1F7")
Notice that I changed r"x" to u"x".
If you are using Python 3, you can just use:
api.update_status("testing \U0001F1EB\U0001F1F7")
Update: This is the tweet that I sent using this code:
https://twitter.com/twistedhardware/status/686527722634383360
I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
You can use pdf-link-checker
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
To install it with pip:
pip install pdf-link-checker
Unfortunately, one dependency (pdfminer) is broken. To fix it:
pip uninstall pdfminer
pip install pdfminer==20110515
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
pdftotext file.pdf file.txt
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
m/http[^\s]+/i
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
Save to filename.py, run as python filename.py pdfname.pdf.