I need to pull a specific string from a URL path - string

I am pulling the following URL from a JSON on the internet.
Example of the string I am working with:
http://icons.wxug.com/i/c/k/nt_cloudy.gif
I need to get just the "nt_cloudy" from the above in order to write the img (already stored) to an epaper display for a weather app. I have tried re.split() but only ever get the full string back, no matter what I split on.
Everything else works, if I manually enter the filename, I can display the image, however the weather conditions change, so I need to pull the name from the JSON. Again, it is only locating the specific string within the full string I am stuck on.
imgurl = weatherinfo['current_observation']['icon_url'] # http://icons.wxug.com/i/c/k/nt_cloudy.gif
img_condition = re.split('\/ |\// |.', imgurl)
image_1 = "/home/pi/epaper/python2/icons/" + img_condition + ".bmp"

Please check this,
import re
imgurl = weatherinfo['current_observation']['icon_url'] # http://icons.wxug.com/i/c/k/nt_cloudy.gif
img_condition = re.split('\/', imgurl)[-1]
image_1 = "/home/pi/epaper/python2/icons/" + img_condition[:-4] + ".bmp"

If you are confident that the path will always end with the image filename, and won't have a query string after it (e.g., nt_cloudy.gif?foo=bar&x=y&...) you can just use the filesystem path functions from Python's os.path standard module.
https://docs.python.org/3/library/os.path.html
#!/usr/bin/env python
import os
URL = 'http://icons.wxug.com/i/c/k/nt_cloudy.gif'
FILENAME = os.path.basename(URL)
If you are trying to decode a URL that might include a query string, you may prefer to use the urllib.parse module.
https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse
I won't go into detail about why your regular expression isn't working the way you expect, because honestly, hand-crafting regular expressions is overkill for this use-case.

You could use below regular expresion:
let regex = /(\w+)\.gif/g.exec("http://icons.wxug.com/i/c/k/nt_cloudy.gif")
if(regex != null && regex.length == 2)
console.log(regex[1]);
Find the reference here.

Related

How to switch the base of a path using pathlib?

I am trying to get a part of a path by removing the base, currently this is what I'm doing:
original = '/tmp/asd/asdqwe/file'
base = '/tmp/asd/'
wanted_part = original.strip(base)
Unfortunately, instead of getting 'asdqwe/file' I'm getting 'qwefile', for some reason strip works weird and I don't get it.
The best solution for my problem would be using pathlib.Path because my function gets its proprieties as paths, and the return value converting the trimmed string into Path after adding a new base path.
But if no pathlib solution is available a string one would also be great, currently I'm dealing with a weird bug...
You are misinterpreting how str.strip works. The method will remove all characters specified in the argument from the "edges" of the target string, regardless of the order in which they are specified:
original = '/tmp/asd/asdqwe/file'
base = '/tmp/asd/'
wanted_part = original.strip(base)
print(wanted_part)
# qwe/file
What you would like to do is probably a slicing:
wanted_part = original[len(base):]
print(wanted_part)
# asdqwe/file
Or, using pathlib:
from pathlib import Path
original = Path('/tmp/asd/asdqwe/file')
base = Path('/tmp/asd/')
wanted_part = original.relative_to(base)
print(wanted_part)
# asdqwe/file
strip will remove a sequnce of chars, not a string prefix or suffix, so it will keep removing anychars in the sequence you passed. Instaed you can test if the original starts with your base and if it does then just take the remaining chars of the string which are the chars after the length of the base.
original = '/tmp/asd/asdqwe/file'
base = '/tmp/asd/'
if original.startswith(base):
wanted_part = original[len(base):]
print(wanted_part)
OUTPUT
asdqwe/file

Extract data from embedded script tag in html

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)
You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])

Problem with multivariables in string formatting

I have several files in a folder named t_000.png, t_001.png, t_002.png and so on.
I have made a for-loop to import them using string formatting. But when I use the for-loop I got the error
No such file or directory: '/file/t_0.png'
This is the code that I have used I think I should use multiple %s but I do not understand how.
for i in range(file.shape[0]):
im = Image.open(dir + 't_%s.png' % str(i))
file[i] = im
You need to pad the string with leading zeroes. With the type of formatting you're currently using, this should work:
im = Image.open(dir + 't_%03d.png' % i)
where the format string %03s means "this should have length 3 characters and empty space should be padded by leading zeroes".
You can also use python's other (more recent) string formatting syntax, which is somewhat more succinct:
im = Image.open(f"{dir}t_{i:03d}")
You are not padding the number with zeros, thus you get t_0.png instead of t_000.png.
The recommended way of doing this in Python 3 is via the str.format function:
for i in range(file.shape[0]):
im = Image.open(dir + 't_{:03d}.png'.format(i))
file[i] = im
You can see more examples in the documentation.
Formatted string literals are also an option if you are using Python 3.6 or a more recent version, see Green Cloak Guy's answer for that.
Try this:
import os
for i in range(file.shape[0]):
im = Image.open(os.path.join(dir, f't_{i:03d}.png'))
file[i] = im
(change: f't_{i:03d}.png' to 't_{:03d}.png'.format(i) or 't_%03d.png' % i for versions of Python prior to 3.6).
The trick was to specify a certain number of leading zeros, take a look at the official docs for more info.
Also, you should replace 'dir + file' with the more robust os.path.join(dir, file), which would work regardless of dir ending with a directory separator (i.e. '/' for your platform) or not.
Note also that both dir and file are reserved names in Python and you may want to rename your variables.
Also check that if file is a NumPy array, file[i] = im may not be working.

Double backslashes for filepath_or_buffer with pd.read_csv

Python 3.6, OS Windows 7
I am trying to read a .txt using pd.read_csv() using relative filepath. So, from pd.read_csv() API checked out that the filepath argument can be any valid string path.
So, in order to define the relative path I use pathlib module. I have defined the relative path as:
df_rel_path = pathlib.Path.cwd() / ("folder1") / ("folder2") / ("file.txt")
a = str(df_rel_path)
Finally, I just want to use it to feed pd.read_csv() as:
df = pd.read_csv(a, engine = "python", sep = "\s+")
However, I am just getting an error stating "No such file or directory: ..." showing double backslashes on the folder path.
I have tried to manually write the path on pd.read_csv() using a raw string, that is, using r"relative/path". However, I am still getting the same result, double backslashes. Is there something I am overlooking?
You can get what you want by using os module
df_rel_path = os.path.abspath(os.path.join(os.getcwd(), "folder1", "folder2"))
This way the os module will deal with the joining the path parts with the proper separator. You can omit os.path.abspath if you read a file that's within the same directory but I wrote it for the sake of completeness.
For more info, refer to this SO question: Find current directory and file's directory
You need a filename to call pd.read_csv. In the example 'a' is a only the path and does not point to a specific file. You could do something like this:
df_rel_path = pathlib.Path.cwd() / ("folder1") / ("folder2")
a = str(df_rel_path)
df = pd.read_csv(a+'/' +'filename.txt')
With the filename your code works for me (on Windows 10):
df_rel_path = pathlib.Path.cwd() / ("folder1") / ("folder2")/ ("file.txt")
a = str(df_rel_path)
df = pd.read_csv(a)

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Resources