Python - nested lists q

Python - nested lists q - python-3.x

Pasted my code below. I'm trying to isolate the "name" value of a list inside the nested list the api replies with but its not detected as a nested list for some reason. My response or what the decode variable prints is:
[{"id":"a981fdae11a949b2bb3c78cc2c7f820d","name":"Juice"},{"id":"ec07182556b444d38e9af48bf74bde82","name":"Dog"},{"id":"2aca11f3fa364adf83183e10e8794c46","name":"csheim","legacy":true}]
From the above output I need to isolate / print the name "csheim" as its sub list contains "legacy : true" but as I said above, I get this when even try to detect sub/nested lists inside my list so I don't know how to go about this. Error: print(decode[1][1])
IndexError: string index out of range
import requests
import time
import socket
import json
import string
host_ip = socket.gethostname()
print("")
print(socket.gethostbyname(host_ip))
print("")
time.sleep(2)
payload = ["juice", "csheim", "dog"]
r = requests.post('https://api.example.com', json=payload)
decode = r.text
print(decode)
print(decode[1][1])

You should parse it with json library , then treat that like an array of dictionaries.
Try this :
import requests
import time
import socket
import json
import string
host_ip = socket.gethostname()
print("")
print(socket.gethostbyname(host_ip))
print("")
time.sleep(2)
payload = ["juice", "csheim", "dog"]
r = requests.post('https://api.example.com', json=payload)
decode = json.loads(r.text)
for item in decode :
if "legacy" in item :
print(item["name"])

Related

How to use BeautifulSoup to retrieve the URL of a tarfile

The following webpage contains all the source code URLs for the LFS project:
https://linuxfromscratch.org/lfs/view/systemd/chapter03/packages.html
I've wriiten some python3 code to retrieve all these URLs from that page:
#!/usr/bin/env python3
from requests import get
from bs4 import BeautifulSoup
import re
import sys, os
#url=sys.argv[1]
url="https://linuxfromscratch.org/lfs/view/systemd/chapter03/packages.html"
exts = (".xz", ".bz2", ".gz", ".lzma", ".tgz", ".zip")
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a', href=True):
if link.get('href'):
for anhref in link.get('href').split():
if os.path.splitext(anhref)[-1] in exts:
print((link.get('href')))
What I would like to do is input a pattern, say:
pattern = 'iproute2'
and then print the line that contains the iproute2 tarfile
which happens to be:
https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-5.12.0.tar.xz
I've tried using match = re.search(pattern, text) and it finds the correct line but if I print match I get:
<re.Match object; span=(43, 51), match='iproute2'>
How do I get it to print the actual URL?

You can use .string property (returns the string passed into the function).
Code Example
txt="https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-5.12.0.tar.xz"
pattern = 'iproute2'
match = re.search(pattern, txt)
if match: # this condition is used to avoid NoneType error
print(match.string)
else:
print('No Match Found')

Trying to scrape multiple pages sequentially using loop to change url

First off, sorry... I am sure that this is a common problem but i did not find the solution anywhere eventhough i searched for a while.
I am trying to create list by scraping data from classicdb. The two problems i have is.
The scraping as written in the try loop does not work in inside the for loop but on its own it works. Currently it just returns the 0 even though there should be values to return.
The output that i get from the try loop gernerates new lists but i want to just get the value and append it later.
I have tried the try function outside the for loop and there it worked.
I also saw some solutions where a while true was used but that did not work for me.
from lxml.html import fromstring
import requests
import traceback
import time
from bs4 import BeautifulSoup as bs
Item_name=[]
Sell_Copper=[]
items= [47, 48]
url = 'https://classic.wowhead.com/item='
fails=[]
for i in items:
time.sleep(5)
url1=(url+str(i))
session = requests.session()
response = session.get(url1)
soup = bs(response.content, 'lxml')
name=soup.select_one('h1').text
print(name)
#get the buy prices
try:
copper = soup.select_one('li:contains("Sells for") .moneycopper').text
except Exception as e:
copper=str(0)
The expected result would be that i get one value in gold and a list in P_Gold. In this case:
copper='1'
Sell_copper=['1','1']

You don't need a sleep. It needs to be div:contains and the search text needs changing
import requests
from bs4 import BeautifulSoup as bs
Item_name=[]
Sell_Copper=[]
items= [47, 48]
url = 'https://classic.wowhead.com/item='
fails=[]
with requests.Session() as s:
for i in items:
response = s.get(url + str(i))
soup = bs(response.content, 'lxml')
name = soup.select_one('h1').text
print(name)
try:
copper = soup.select_one('div:contains("Sell Price") .moneycopper').text
except Exception as e:
copper=str(0)
print(copper)

Python3 Pickle hitting recursion limit

I have the following block of code that when executed on my Ubuntu computer using Python3 hits the recursion error for pickling. I don't understand why since the object to be pickled is not particularly complex and doesn't involve any custom objects. In fact, it is only a list of some 500 elements (approximately); each element of the list is just a string. It seems to me that I should be able to serialize this object without issue. Why am I hitting a recursion limit error? I know I could up the recursion limit with import sys and sys.setrecursionlimit() but I am frankly surprised I have to do that for such a trivial object.
from urllib import request
from bs4 import BeautifulSoup
import pickle
def get_standard_and_poors_500_constituents():
# URL request, URL opener, read content.
req = request.Request(
"http://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
)
opener = request.urlopen(req)
# Convert bytes to UTF-8.
content = opener.read().decode()
soup = BeautifulSoup(content, "lxml")
# HTML table we actually need is the first.
tables = soup.find_all("table")
external_class = tables[0].findAll("a", {"class":"external text"})
c = [ext.string for ext in external_class if not "reports" in ext]
return c
sp500_constituents = get_standard_and_poors_500_constituents()
spdr_etf = "SPY"
sp500_index = "^GSPC"
def main():
import datetime as dt
today = dt.datetime.today().date()
fname = "sp500_constituents_" + str(today) + ".pkl"
with open(fname, "wb") as f:
pickle.dump(sp500_constituents, f)
if __name__ == "__main__":
main()

How convert an mbox to a JSON structure?

I am trying to convert an mbox to a JSON structure suitable for import into MongoDB i.e.
I am using mining social web second edition mailbox chapter but its not working properly.
I am trying to convert an mbox to a JSON structure suitable for import into MongoDB i.e.
I am using mining social web second edition mailbox chapter but its not working properly.
import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse
MBOX = 'resources/ch06-mailboxes/data/enron.mbox'
OUT_FILE = MBOX + '.json'
def cleanContent(msg):
# Decode message from "quoted printable" format, but first
# re-encode, since decodestring will try to do a decode of its own
msg = quopri.decodestring(msg.encode('utf-8'))
# Strip out HTML tags, if any are present.
# Bail on unknown encodings if errors happen in BeautifulSoup.
try:
soup = BeautifulSoup(msg)
except:
return ''
return ''.join(soup.findAll(text=True))
# There's a lot of data to process, and the Pythonic way to do it is with a
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object
# serialization.
class Encoder(json.JSONEncoder):
def default(self, o): return list(o)
# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)
def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')
# The To, Cc, and Bcc fields, if present, could have multiple items.
# Note that not all of these fields are necessarily defined.
for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
.replace(' ', '').decode('utf-8', 'ignore').split(',')
for part in msg.walk():
json_part = {}
if part.get_content_maintype() != 'text':
print >> sys.stderr, "Skipping MIME content in JSONification
({0})".format(part.get_content_maintype())
continue
json_part['contentType'] = part.get_content_type()
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
json_msg['parts'].append(json_part)
# Finally, convert date from asctime to milliseconds since epoch using the
# $date descriptor so it imports "natively" as an ISODate object in MongoDB
then = parse(json_msg['Date'])
millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
json_msg['Date'] = {'$date' : millis}
return json_msg
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
# Write each message out as a JSON object on a separate line
# for easy import into MongoDB via mongoimport
f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
if msg != None:
f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()
print "All done"
getting error:
80 # for easy import into MongoDB via mongoimport
81
---> 82 f = open(OUT_FILE, 'w')
83 for msg in gen_json_msgs(mbox):
84 if msg != None:
IOError: [Errno 13] Permission denied: 'resources/ch06-mailboxes/data/enron.mbox.json'

The code you mentioned became obsolete in Third Edition of Mining Social Web
I tried making a workable script that not just converts MBOX to JSON, but even extracts the Attachments to usable formats.
Link to the repo -
https://github.com/PS1607/mbox-to-json
Read the README file for usage instructions.

It seems that your problem is related to user permissions instead of Python. Line 82 tries to open a file in the "data" folder, but permission was denied. You should try executing your script using the sudo command from a terminal:
sudo python3 <your script name>
This should take care of the error you pointed out.
PS: Python 3 uses print as a function; line 88 should read
print('All done')

How to display proper output when using re.findall() in python?

I made a python script to get the latest stock price from the Yahoo Finance.
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
price = re.findall(b'<span id="yfs_l84_goog">(.+?)</span>',htmltext);
print(price);
It works smoothly but when I output the price it comes out like this[b'1,217.04']
This might be a petty issue to ask but I'm new to python scripting so please help me if you can.
I want to get rid of 'b'. If I remove 'b' from b'<span id="yfs_l84_goog">"then it shows this error.
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I want the output to be just
1,217.04

b'' is a syntax for bytes literals in Python. It is how you could define byte sequences in Python source code.
What you see in the output is the representation of the single bytes object inside the price list returned by re.findall(). You could decode it into a string and print it:
>>> for item in price:
... print(item.decode()) # assume utf-8
...
1,217.04
You could also write bytes directly to stdout e.g., sys.stdout.buffer.write(price[0]).
You could use an html parser instead of a regex to parse html:
#!/usr/bin/env python3
import cgi
from html.parser import HTMLParser
from urllib.request import urlopen
url = 'http://finance.yahoo.com/q?s=GOOG'
def is_price_tag(tag, attrs):
return tag == 'span' and dict(attrs).get('id') == 'yfs_l84_goog'
class Parser(HTMLParser):
"""Extract tag's text content from html."""
def __init__(self, html, starttag_callback):
HTMLParser.__init__(self)
self.contents = []
self.intag = None
self.starttag_callback = starttag_callback
self.feed(html)
def handle_starttag(self, tag, attrs):
self.intag = self.starttag_callback(tag, attrs)
def handle_endtag(self, tag):
self.intag = False
def handle_data(self, data):
if self.intag:
self.contents.append(data)
# download and convert to Unicode
response = urlopen(url)
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html = response.read().decode(params['charset'])
# parse html (extract text from the price tag)
content = Parser(html, is_price_tag).contents[0]
print(content)
Check whether yahoo provides API that doesn't require web scraping.

Okay after searching for a while. I found the solution.Works fine for me.
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
pattern = re.compile('<span id="yfs_l84_goog">(.+?)</span>');
price = pattern.findall(str(htmltext));
print(price);

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python - nested lists q - python-3.x

Related

How to use BeautifulSoup to retrieve the URL of a tarfile

Trying to scrape multiple pages sequentially using loop to change url

Python3 Pickle hitting recursion limit

How convert an mbox to a JSON structure?

How to display proper output when using re.findall() in python?

Categories

Resources