Scrape Humble Bundle games using "Requests" module - python-3.x

I'm actually trying get the information of the games contained in this webpage: https://www.humblebundle.com/store/search?sort=discount&filter=onsale
The first thing I tried was to replicate what one user did to help me on a similar problem days ago, doing a POST request to access directly where the data I need from the web comes from. Here's the link of that question in case you still don't know what I'm trying to achieve.
To do that, I first executed this code to get the HTML file of the web without the elements loaded:
import requests
req = requests.get("https://www.humblebundle.com/store/search?sort=discount&filter=onsale")
a = open("humble.txt", "w")
a.write(req.text)
a.close()
It returned me this code.
You can notice in the 1084 line a script called "storefront-constants-json-data", It called my attention because It was the unique that had some variables related to the page. Then I thought, "Hey, there must be more information about this script somewhere". The I clicked "Inspect element" on the web and went to the "Network" tab. I searched that script name in every JS file and found just one reference, this one.
At this point I'm lost, In fact, I don't even know If I'm in the right way (because I don't know any JavaScript). Can someone show me the path to get those Humble Bundle games :s ?.
Pd: I wrote a similar question yesterday but It was very vague, so I decided to rewrite It giving all the information I have and explaining what I've tried.
Pd2: I'd prefer not to do It with Selenium or similar modules, they are too slow.

The data you see on the webpage is loaded through AJAX requests from different URL. If you open Network Inspector, you can see the URL of the requests - and the data are returned in Json format:
import requests
data = requests.get('https://www.humblebundle.com/store/api/search?sort=discount&filter=onsale&request=1').json()
from pprint import pprint
pprint(data)
Prints:
{'num_pages': 245,
'num_results': 4894,
'page_index': 0,
'request': 1,
'results': [{'content_types': ['game'],
'cta_badge': None,
'current_price': [0.0, 'EUR'],
'delivery_methods': ['download'],
'empty_tpkds': {},
'featured_image_recommendation': 'https://hb.imgix.net/2e18a2a9316c0136abf25670bf67ed389c855e4f.jpeg?auto=compress,format&fit=crop&h=154&w=270&s=64e2f8ad8654541c0620d8e018fa2025',
'full_price': [0.01, 'EUR'],
'human_name': 'Crying Suns Demo',
'human_url': 'crying-suns-demo',
'icon': 'https://hb.imgix.net/2e18a2a9316c0136abf25670bf67ed389c855e4f.jpeg?auto=format&fit=crop&h=64&w=103&s=dcf803da86b9bcf4cd2c0d038ddf16fb',
'icon_dict': {'download': {'available': ['windows', 'mac'],
'unavailable': ['linux']}},
'large_capsule': 'https://hb.imgix.net/2e18a2a9316c0136abf25670bf67ed389c855e4f.jpeg?auto=compress,format&fit=crop&h=353&w=616&s=d50b680a5bfd2c6c6acdb4c745db8428',
'machine_name': 'cryingsuns_demo_storefront',
'non_rewards_charity_split': 0.0,
'platforms': ['windows', 'mac'],
'rating_for_current_region': 'pegi',
'rewards_split': 0.1,
'sale_end': 32503708740.0,
'sale_type': 'normal',
'standard_carousel_image': 'https://hb.imgix.net/2e18a2a9316c0136abf25670bf67ed389c855e4f.jpeg?auto=compress,format&fit=crop&h=206&w=360&s=015688fbe32c7e3e185bdcaddc72e02a',
'type': 'product',
'xray_traits_thumbnail': 'https://hb.imgix.net/2e18a2a9316c0136abf25670bf67ed389c855e4f.jpeg?auto=compress,format&fit=crop&h=84&w=135&s=eefaf495f9379b213672d82ddeae672a'},
...and so on.
The screenshot from network inspector:

Related

Can't read the tag using Pycomm3 from a LogixController Simulator

I have simulated a controllogix controller using the library CPPPO.
Command -
enip_server -v SCADA=INT[1000] TEXT=SSTRING[100] FLOAT=REAL
Output -
I want to use pycomm3 library to read and write the tags, as you can see above three tags have created by cpppo while starting the simulation server - SCADA, TEXT and FLOAT, i just want to read any one of them.
Here is the code I'm using -
from pycomm3 import LogixDriver
with LogixDriver('127.0.0.1') as plc:
print(plc)
# plc.write('TEXT', 'Hello World!')
print(plc.read('TEXT'))
Output -
The logs in CPPPO Server are -
Instead of Tags doesn't exist , we should receive the value of TEXT Tag
So, there is a couple things going on here. The main 'feature' of pycomm3 is how it handles everything automatically for you, but for it to do that it needs to first upload the tag list from the PLC. It looks like CPPPO doesn't implement those services, if you enable the logging you will see that it errors out when trying to upload the tag list. (I think this error should have bubbled up and exited the with block before ever trying to read the tag - I will get it changed in the next release) You can bypass this though by defining your own _initialize_driver method and setting the tag list manually:
from pycomm3 import SHORT_STRING, REAL # also need to import the CIP types
def _cpppo_initialize_driver(self, _, __):
self._cfg['use_instance_ids'] = False # force it to only use the tag name in requests
self._info = self.get_plc_info() # optional
self._tags = {
'TEXT': {
'tag_name': 'TEXT',
'tag_type': 'atomic',
'data_type_name': 'SHORT_STRING',
'data_type': 'SHORT_STRING',
'dim': 1,
'dimensions': [100, 0, 0],
'type_class': SHORT_STRING,
},
'FLOAT': {
'tag_name': 'FLOAT',
'tag_type': 'atomic',
'data_type_name': 'REAL',
'data_type': 'REAL',
'dim': 0,
'dimensions': [0, 0, 0],
'type_class': REAL,
}
}
LogixDriver._initialize_driver = _cpppo_initialize_driver
The _tags attribute is a dict of the tag name to the definition for the tag, this section in the docs has a lot more details about what each field is for. The examples I added are simple atomic tags, if you want to do structs it is a little more complicated.
In addition to that, I did find a bug dealing with the write method. Currently, it is including part of the request twice in the packet. Real PLCs seems to ignore this, but CPPPO doesn't handle it and leads to an error. I have a fix already in my development branch and can confirm both reads and writes will work. Unfortunately, I have a few other changes in progress that I need to get done before I release a new version. If you follow the repo on GitHub it will notify you when it is released. If writes are critical and waiting for a fix is not possible, I can give you the fix since it's fairly small.

Integrating selenium with 2captcha to solve Recapv2

I'm trying to make a script that autofills info for a specific form, I created that and it autofills fine but when I the script clicks on the enter button a captcha pops up, I decided to take the 2captcha route rather then the audio captcha bypass. I found a package that simplifies the 2cap API down for you(https://pypi.org/project/2captcha-python/#recaptcha-v2). I can send a request to 2captcha and they solve the captcha, I know this because my daily stats go up 1 everytime I run the script, but nothing happens on the selenium browser. Any reason?
current code is
solver = TwoCaptcha('MYAPIKEY')
config = {
'apiKey': 'MYAPIKEY',
'softId': 123,
'callback': 'https://your.site/result-receiver',
'defaultTimeout': 120,
'recaptchaTimeout': 600,
'pollingInterval': 10,
}
solver = TwoCaptcha(**config)
result = solver.recaptcha(sitekey='6Le-wvkSVVABCPBMRTvw0Q4Muexq1bi0DJwx_mJ-',
url="https://mailchi.mp/2d4364715d21/b66a2p7spo",)
I run this inside of a try: except: with the autofill code in it after It clicks enter(when the captcha pops up) any one have any ideas on how I can solve this? I've been trying for a couple hours and I can't figure it out.
Note: I left softId and callback as the default values because I don't have a softId from 2cap and I don't have a website either, if that's the issue please advise on how I can go about solving it.
Thanks in advance!

discord webhook can not send empty message

I have written this small PoC for discord webhooks and i am getting error that Can not send empty string. I tried to google but couldn't find a documentation or an answer
here is my code
import requests
discord_webhook_url = 'https://discordapp.com/api/webhooks/xxxxxxxxxxxxxxxxxx/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
data = {'status': 'success'}
headers = {'Content-Type': 'application/json'}
res = requests.post(discord_webhook_url, data=data, headers=headers)
print(res.content)
I'm late, but I came across this issue recently, and seeing as it has not been answered yet, I thought I document my solution to the problem.
For the most part, it is largely due to the structure of the payload being wrong.
https://birdie0.github.io/discord-webhooks-guide/discord_webhook.html provides an example of a working structure. https://discordapp.com/developers/docs/resources/channel#create-message is the official documentation.
I was also able to get a minimum test case working using: {"content": "Test"}.
If it still fails after that with the same error, the likely causes are:
If using curl, check to make sure there are no accidental escape / backslashes \
If using embeds with fields, ensure there are no empty values
When in doubt, ensure all values are populated, and not "". Through trial-and-error / the process of cancellation, you can figure out exactly what key-value pair is causing an issue, so I suggest playing with the webhook via curl before turning it into a full program.

Python3 - Error posting data to a stikked instance

I'm writing a Python 3 (3.5) script that will act as a simple command line interface to a user's stikked install. The API is pretty straight forward, and its documentation is available.
My post function is:
def submit_paste(paste):
global settings
data = {
'title': settings['title'],
'name': settings['author'],
'text': paste,
'lang': settings['language'],
'private': settings['private']
}
data = bytes(urllib.parse.urlencode(data).encode())
print(data)
handler = urllib.request.urlopen(settings['url'], data)
print(handler.read().decode('utf-8'))
When I run the script, I get the printed output of data, and the message returned from the API. The data encoding looks correct to me, and outputs:
b'private=0&text=hello%2C+world%0A&lang=text&title=Untitled&name=jacob'
As you can see, that contains the text= attribute, which is the only one actually required for the API call to successfully work. I've been able to successfully post to the API using curl as shown in that link.
The actual error produced by the API is:
Error: Missing paste text
Is the text attribute somehow being encoded incorrectly?
Turns out the problem wasn't with the post function, but with the URL. My virtual host automatically forwards http traffic to https. Apparently, Apache drops the post variables when it forwards.

Avoiding repetition with Flask - but is it too DRY?

Let us assume I serve data to colleagues in-office with a small Flask app, and let us also assume that it is a project I am not explicitly 'paid to do' so I don't have all the time in the world to write code.
It has occurred to me in my experimentation with pet projects at home that instead of decorating every last route with #app.route('/some/local/page') that I can do the following:
from flask import Flask, render_template, url_for, redirect, abort
from collections import OrderedDict
goodURLS = OrderedDict([('/index','Home'), ##can be passed to the template
('/about', 'About'), ##to create the navigation bar
('/foo', 'Foo'),
('/bar', 'Bar'), ##hence the use of OrderedDict
('/eggs', 'Eggs'), ##to have a set order for that navibar
('/spam', 'Spam')])
app = Flask(__name__)
#app.route('/<destination>')
def goThere(destination):
availableRoutes = goodURLS.keys():
if "/" + destination in availableRoutes:
return render_template('/%s.html' % destination, goodURLS=goodURLS)
else:
abort(404)
#app.errorhandler(404)
def notFound(e):
return render_template('/notFound.html'), 404
Now all I need to do is update my one list, and both my navigation bar and route handling function are lock-step.
Alternatively, I've written a method to determine the viable file locations by using os.walk in conjunction with file.endswith('.aGivenFileExtension') to locate every file which I mean to make accessible. The user's request can then be compared against the list this function returns (which obviously changes the serveTheUser() function.
from os import path, walk
def fileFinder(directory, extension=".html"):
"""Returns a list of files with a given file extension at a given path.
By default .html files are returned.
"""
foundFilesList = []
if path.exists(directory):
for p, d, files in walk(directory):
for file in files:
if file.endswith(extension):
foundFilesList.append(file)
return foundFilesList
goodRoutes = fileFinder('./templates/someFolderWithGoodRoutes/')
The question is, Is This Bad?
There are many aspects of Flask I'm just not using (mainly because I haven't needed to know about them yet) - so maybe this is actually limiting, or redundant when compared against a built-in feature of Flask. Does my lack of explicitly decorating each route rob me of a great feature of Flask?
Additionally, is either of these methods more or less safe than the other? I really don't know much about web security - and like I said, right now this is all in-office stuff, the security of my data is assured by our IT professional and there are no incoming requests from outside the office - but in a real-world setting, would either of these be detrimental? In particular, if I am using the backend to os.walk a location on the server's local disk, I'm not asking to have it abused by some ne'er-do-well am I?
EDIT: I've offered this as a bounty, because if it is not a safe or constructive practice I'd like to avoid using it for things that I'd want to like push to Heroku or just in general publicly serve for family, etc. It just seems like decorating every viable route with app.route is a waste of time.
There isn't anything really wrong with your solution, in my opinion. The problem is that with this kind of setup the things you can do are pretty limited.
I'm not sure if you simplified your code to show here, but if all you are doing in your view function is to gather some data and then select one of a few templates to render it then you might as well render the whole thing in a single page and maybe use a Javascript tab control to divide it up in sections on the client.
If each template requires different data, then the logic that obtains and processes the data for each template will have to be in your view function, and that is going to look pretty messy because you'll have a long chain of if statements to handle each template. Between that and separate view functions per template I think the latter will be quicker, even more so if you also consider the maintenance effort.
Update: based on the conversion in the comments I stand by my answer, with some minor reservations.
I think your solution works and has no major problems. I don't see a security risk because you are validating the input that comes from the client before you use it.
You are just using Flask to serve files that can be considered static if you ignore the navigation bar at the top. You should consider compiling the Flask app into a set of static files using an extension like Frozen-Flask, then you just host the compiled files with a regular web server. And when you need to add/remove routes you can modify the Flask app and compile it again.
Another thought is that your Flask app structure will not scale well if you need to add server-side logic. Right now you don't have any logic in the server, everything is handled by jQuery in the browser, so having a single view function works just fine. If at some point you need to add server logic for these pages then you will find that this structure isn't convenient.
I hope this helps.
I assume based on your code that all the routes have a corresponding template file of the same name (destination to destination.html) and that the goodURL menu bar is changed manually. An easier method would be to try to render the template at request and return your 404 page if it doesn't exist.
from jinja2 import TemplateNotFound
from werkzeug import secure_filename
....
#app.route('/<destination>')
def goThere(destination):
destTemplate = secure_filename("%s.html" % destination)
try:
return render_template(destTemplate, goodURLS=goodURLS)
except TemplateNotFound:
abort(404)
#app.errorhandler(404)
def notFound(e):
return render_template('/notFound.html'), 404
This is adapted from the answer to Stackoverflow: How do I create a 404 page?.
Edit: Updated to make use of Werkzeug's secure_filename to clean user input.

Resources