Searching Youtube videos using youtube-dl - python-3.x

I am trying to build a Discord Music Bot and I need to search the YouTube using keywords given by the user. Currently I know how to play from a url.
loop = loop or asyncio.get_event_loop()
data = await loop.run_in_executor( None, lambda: ytdl.extract_info(url, download=not stream))
if "entries" in data:
data = data["entries"][0]
filename = data["url"] if stream else ytdl.prepare_filename(data)
return cls(discord.FFmpegPCMAudio(filename, **ffmpeg_options), data=data)

Youtube_DL has a extract_info method that you can use. Instead of giving it a link, you just have to pass ytsearch:args like so:
from requests import get
from youtube_dl import YoutubeDL
YDL_OPTIONS = {'format': 'bestaudio', 'noplaylist':'True'}
def search(arg):
with YoutubeDL(YDL_OPTIONS) as ydl:
try:
get(arg)
except:
video = ydl.extract_info(f"ytsearch:{arg}", download=False)['entries'][0]
else:
video = ydl.extract_info(arg, download=False)
return video
A few important things with this function:
It works with both words and urls
If you make a youtube search, the output will be a list a dictionnaries. In this case, it will return the first result
It will return a dictionnary containing the following informations:
video_infos = search("30 sec video")
#Doesn't contain all the data, some keys are not very important
cleared_data = {
'channel': video['uploader'],
'channel_url': video['uploader_url'],
'title': video['title'],
'description': video['description'],
'video_url': video['webpage_url'],
'duration': video['duration'], #in seconds
'upload_date': video['upload_data'], #YYYYDDMM
'thumbnail': video['thumbnail'],
'audio_source': video['formats'][0]['url'],
'view_count': video['view_count'],
'like_count': video['like_count'],
'dislike_count': video['dislike_count'],
}

I'm not sure youtube-dl is good to search for youtube urls using keywords. You should probably take a look at youtube-search for this.

Related

Unable to perform login for scraping with Scrapy

my first time asking a question here so please bear with me if I'm not providing everything that is needed.
I'm trying to build a spider that goes to this website (https://newslink.sg/user/Login.action), logs in (I have a valid set of username and password) and then scrape some pages.
I'm unable to get past the login stage.
I suspect it has to do with the formdata and what I enter inside, as there are "login.x" and "login.y" fields when I check the form data. The login.x and login.y fields seem to change whenever I log in again.
This question and answer seems to provide a hint of how I can fix things but I don't know how to go about extracting the correct values.
Python scrapy - Login Authenication Issue
Below is my code with some modification.
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request
class BtscrapeSpider(scrapy.Spider):
name = "btscrape"
#allowed_domains = [""]
start_urls = [
"https://newslink.sg/user/Login.action"
]
def start_requests(self):
return [scrapy.FormRequest("https://newslink.sg/user/Login.action",
formdata={'IDToken1': 'myusername',
'IDToken2': 'mypassword',
'login.x': 'what do I do here?',
'login.y': 'what do I do here?'
},
callback=self.after_login)]
def after_login(self, response):
return Request(
url="webpage I want to scrape after login",
callback=self.parse_bt
)
def parse_bt(self, response): # Define parse() function.
items = [] # Element for storing scraped information.
hxs = Selector(response) # Selector allows us to grab HTML from the response (target website).
item = BtscrapeItem()
item['headline'] = hxs.xpath("/html/body/h2").extract() # headline.
item['section'] = hxs.xpath("/html/body/table/tbody/tr[1]/td[2]").extract() # section of newspaper that story appeared.
item['date'] = hxs.xpath("/html/body/table/tbody/tr[2]/td[2]/text()").extract()# date of publication
item['page'] = hxs.xpath("/html/body/table/tbody/tr[3]/td[2]/text()").extract() # page that story appeared.
item['word_num'] = hxs.xpath("/html/body/table/tbody/tr[4]/td[2]").extract() # number of words in story.
item['text'] = hxs.xpath("/html/body/div[#id='bodytext']/text()").extract() # text of story.
items.append(item)
return items
If I run the code without the login.x and login.y lines, I get blank scrapes.
Thanks for your help!
Two possible reasons:
You don't send a goto: https://newslink.sg/secure/redirect2.jsp?dest=https://newslink.sg/user/Login.action?login= form parameter
You need cookies for auth part
So I recommend you to rewrite it this way:
start_urls = [
"https://newslink.sg/user/Login.action"
]
def parse(self, response):
yield scrapy.FormRequest.from_response(
formnumber=1,
formdata={
'IDToken1': 'myusername',
'IDToken2': 'mypassword',
'login.x': '2',
'login.y': '6',
},
callback=self.after_login,
)
Scrapy will send goto automatically for you. login.x and login.y is just cursor coordinates when you click on a Login button.

crawl only tweets metadata without the tweet text using an ID list

CONTEXT: I have a list of tweet ids and their textual content and I need to crawl their metadata. However, my code crawls the tweet metadata and text as well. Since I have about 100K tweet ids I do not wish to waste time crawling the tweet text again.
Question: How can I adapt the following code so I would be able to download only tweet metadata. I'm using tweepy and python 3.6.
def get_tweets_single(twapi, idfilepath):
#tweet_id = '522778758168580098'
tw_list = []
with open(idfilepath,'r') as f1:#A File that Contains tweet IDS
lines = f1.readlines()
for line in lines:
try:
print(line.rstrip('\n'))
tweet = twapi.get_status(line.rstrip('\n'))#tweepy function to crawl tweet metadata
tw_list.append(tweet)
#tweet = twapi.statuses_lookup(id_=tweet_id,include_entities=True, trim_user=True)
with open(idjsonFile,'a',encoding='utf-8')as f2:
json.dump(tweet._json,f2)
except tweepy.TweepError as te:
print('Failed to get tweet ID %s: %s', tweet_id, te.message)
def main(args):
print('hello')
# connect to twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
get_tweets_single(api, idfilepath)
You cannot only download metadata about the tweet.
Looking at the documentation you can choose to exclude information about the user with trim_user=true - but that's the only thing you can strip out.

How to download bulk amount of images from google or any website

actually, I need to do a project on machine learning. In that I want a lot of images for training. I searched for this problem, but I failed to do so.
can anyone help me to solve this. Thanks in advance.
I used google images to download images using selenium. It is just a basic approach.
from selenium import webdriver
import time
import urllib.request
import os
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome("path\\to\\the\\webdriverFile")
browser.get("https://www.google.com")
search = browser.find_element_by_name(‘q’)
search.send_keys(key_words,Keys.ENTER) # use required key_words to download images
elem = browser.find_element_by_link_text(‘Images’)
elem.get_attribute(‘href’)
elem.click()
value = 0
for i in range(20):
browser.execute_script(“scrollBy(“+ str(value) +”,+1000);”)
value += 1000
time.sleep(3)
elem1 = browser.find_element_by_id(‘islmp’)
sub = elem1.find_elements_by_tag_name(“img”)
try:
os.mkdir(‘downloads’)
except FileExistsError:
pass
count = 0
for i in sub:
src = i.get_attribute('src')
try:
if src != None:
src = str(src)
print(src)
count+=1
urllib.request.urlretrieve(src,
os.path.join('downloads','image'+str(count)+'.jpg'))
else:
raise TypeError
except TypeError:
print('fail')
if count == required_images_number: ## use number as required
break
check this for detailed explanation.
download driver here
My tip to you is: Use pictures API. This is my favourite: Bing Image Search API
Following text from Send search queries using the REST API and Python.
Running the quickstart
To get started, set subscription_key to a valid subscription key for the Bing API service.
Python
subscription_key = None
assert subscription_key
Next, verify that the search_url endpoint is correct. At this writing, only one endpoint is used for Bing search APIs. If you encounter authorization errors, double-check this value against the Bing search endpoint in your Azure dashboard.
Python
search_url = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
Set search_term to look for images of puppies.
Python
search_term = "puppies"
The following block uses the requests library in Python to call out to the Bing search APIs and return the results as a JSON object. Observe that we pass in the API key via the headers dictionary and the search term via the params dictionary. To see the full list of options that can be used to filter search results, refer to the REST API documentation.
Python
import requests
headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
params = {"q": search_term, "license": "public", "imageType": "photo"}
response = requests.get(search_url, headers=headers, params=params)
response.raise_for_status()
search_results = response.json()
The search_results object contains the actual images along with rich metadata such as related items. For example, the following line of code can extract the thumbnail URLS for the first 16 results.
Python
thumbnail_urls = [img["thumbnailUrl"] for img in search_results["value"][:16]]
Then use the PIL library to download the thumbnail images and the matplotlib library to render them on a $4 \times 4$ grid.
Python
%matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
f, axes = plt.subplots(4, 4)
for i in range(4):
for j in range(4):
image_data = requests.get(thumbnail_urls[i+4*j])
image_data.raise_for_status()
image = Image.open(BytesIO(image_data.content))
axes[i][j].imshow(image)
axes[i][j].axis("off")
plt.show()
Sample JSON response
Responses from the Bing Image Search API are returned as JSON. This sample response has been truncated to show a single result.
JSON
{
"_type":"Images",
"instrumentation":{
"_type":"ResponseInstrumentation"
},
"readLink":"images\/search?q=tropical ocean",
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?q=tropical ocean&FORM=OIIARP",
"totalEstimatedMatches":842,
"nextOffset":47,
"value":[
{
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?view=detailv2&FORM=OIIRPO&q=tropical+ocean&id=8607ACDACB243BDEA7E1EF78127DA931E680E3A5&simid=608027248313960152",
"name":"My Life in the Ocean | The greatest WordPress.com site in ...",
"thumbnailUrl":"https:\/\/tse3.mm.bing.net\/th?id=OIP.fmwSKKmKpmZtJiBDps1kLAHaEo&pid=Api",
"datePublished":"2017-11-03T08:51:00.0000000Z",
"contentUrl":"https:\/\/mylifeintheocean.files.wordpress.com\/2012\/11\/tropical-ocean-wallpaper-1920x12003.jpg",
"hostPageUrl":"https:\/\/mylifeintheocean.wordpress.com\/",
"contentSize":"897388 B",
"encodingFormat":"jpeg",
"hostPageDisplayUrl":"https:\/\/mylifeintheocean.wordpress.com",
"width":1920,
"height":1200,
"thumbnail":{
"width":474,
"height":296
},
"imageInsightsToken":"ccid_fmwSKKmK*mid_8607ACDACB243BDEA7E1EF78127DA931E680E3A5*simid_608027248313960152*thid_OIP.fmwSKKmKpmZtJiBDps1kLAHaEo",
"insightsMetadata":{
"recipeSourcesCount":0,
"bestRepresentativeQuery":{
"text":"Tropical Beaches Desktop Wallpaper",
"displayText":"Tropical Beaches Desktop Wallpaper",
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?q=Tropical+Beaches+Desktop+Wallpaper&id=8607ACDACB243BDEA7E1EF78127DA931E680E3A5&FORM=IDBQDM"
},
"pagesIncludingCount":115,
"availableSizesCount":44
},
"imageId":"8607ACDACB243BDEA7E1EF78127DA931E680E3A5",
"accentColor":"0050B2"
}
}

How to subscribe with youtube api?

https://developers.google.com/youtube/v3/code_samples/apps-script#subscribe_to_channel
Hello,
I cant figure how to subscribe to a youtube channel with a post request. Im not looking to use YoutubeSubscriptions as shown above. Im simple looking to pass an api key, but cant seem to figure it out. Any suggestions?
If you don't want to use the YoutubeSubscriptions, you have to get the session_token after login youtube account.
The session_token is stored in the hidden input tag:
document.querySelector('input[name=session_token]').value
or full-text search XSRF_TOKEN field, the corresponding value is session_token, reference regular:
const regex = /\'XSRF_TOKEN\':(.*?)\"(.*?)\"/g
Below is an implementation in Python:
def YouTubeSubscribe(url,SessionManager):
while(1):
try:
html = SessionManager.get(url).content
session_token = (re.findall("XSRF_TOKEN\W*(.*)=", html , re.IGNORECASE)[0]).split('"')[0]
id_yt = url.replace("https://www.youtube.com/channel/","")
params = (('name', 'subscribeEndpoint'),)
data = [
('sej', '{"clickTrackingParams":"","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"subscribeEndpoint":{"channelIds":["'+id_yt+'"],"params":"EgIIAg%3D%3D"}}'),
('session_token', session_token+"=="),
]
response = SessionManager.post('https://www.youtube.com/service_ajax', params=params, data=data)
check_state = json.loads(response.content)['code']
if check_state == "SUCCESS":
return 1
else:
return 0
except Exception as e:
print "[E] YouTubeSubscribe:"+ str(e)
pass

Can't catch unsuccessful parsing events when using requests to parse a list of image urls python3

Hi I am using the following function to parse a list of img reference urls. The reference urls are in format "https://......strings/"; the parsed urls are in format "http://..../img.jpg".
Input: a text file, each line is a reference url
Expected output: a list of parsed urls.
Parsing Rules:
For each input url, requests parse it to html. html.url will either return 1) the original url or 2) a successfully parsed url. If the original url is returned, use BeautifulSoup to extract the img url inside the html. If somehow the extraction fails, resume original url. Then try the whole process again until maximum try_num is reached or success.
Actual behavior:
success: http://ww3.sinaimg.cn/large/005ZD4gwgw1fa6u080sdlj30ku112n4b.jpg
success: http://ww1.sinaimg.cn/large/005ZD4gwgw1fa6tzfpvkwj305t058t8o.jpg
success: https://weibo.cn/mblog/oripic?id=EjvpZdb4y&u=005ZD4gwgw1fa6tzvk95tg30n20cqqva&rl=1
As you can see, even with unsuccessful parsing (3rd), the function outputs "success" without any of the error message I put. I have tried manually parsing those reference urls and tested "html.url == imgref"; they are good. So it somehow happens when I feed a list of urls to it. I am really not sure where it went wrong. Seems all try..except blocks did not catch any exception.
Apologize ahead if I have made some very stupid mistakes and for the ambiguous title. I really don't know where the problem is so I can't find a good tile.
Thank you all.
def decode_imgurl(imgref,cookie):
connection_timeout = 90
try_num = 1
imgurl = imgref
start_time = time.time()
while imgurl == imgref:
while True:
try:
html = requests.get(imgref,cookies=cookie, timeout=60)
break
except Exception:
if time.time() > start_time + connection_timeout:
raise Exception('Unable to get connection %s after %s seconds of ConnectionErrors' \
% (imgref, self.connection_timeout))
else:
time.sleep(1)
if html.url == imgref:
soup = BeautifulSoup(html.content,"lxml")
try:
imgurl = soup.findAll("a",href=re.compile(r'^http://',re.I))[0]['href']
except IndexError: # happens with continuous requests in short time
print("requests too fast")
imgurl = imgref
time.sleep(1)
else:
imgurl = html.url
try_num += 1
if try_num > 10:
print("fail to parse. Please parse manually")
return ""
print("success: %s" % imgurl)
return imgurl
from config import cookie # import cookie file
with open(inputfile, 'r') as f:
for line in f:
if line.strip():
parsed_url = decode_imgurl(line.strip(), cookie)
print(parsed_url)

Resources