Extract text from html document tag

Extract text from html document tag - python-3.x

I am trying to extract text from these documents(i.e doc1, doc2.
I just need text inside Item 1 header.
What I tried so far is shown below
soup = BS(response.text,'html.parser')
startid = BS(response.css('tr:contains("Item\xa01"), tr:contains("Item 1."), *:contains("ITEM 1")')[0].css('a').get('')).find('a').attrs
endid = BS(response.css('tr:contains("Item\xa02"), tr:contains("Item 2."),*:contains("ITEM 2")')[0].css('a').get('')).find('a').attrs
html=''
for tag in soup.select('a',startid)[0].parent.next_siblings:
if soup.select('a',endid)[0].parent == tag:
break
else:
html += str(tag)
h = html2text.HTML2Text()
h.ignore_links = True
print(h.handle(html))
I just wanted the text under Item 1 portion.

If you run:
r = requests.get('https://www.sec.gov/Archives/edgar/data/0000001800/000104746915001377/a2222655z10-k.htm')
print(r.text[1532:(1532 + 571)])
The output is:
To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.</p>\n\n<p>Please declare your traffic by updating your user agent to include company specific information.</p>\n\n\n<p>For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit <a href="https://www.sec.gov/developer" '
If you look at https://www.sec.gov/developer in links off to https://www.sec.gov/edgar/sec-api-documentation.
So for 0000001800 you should be trying https://data.sec.gov/submissions/CIK0000001800.json which contains...
{"cik":"1800","entityType":"operating","sic":"2834
","sicDescription":"Pharmaceutical Preparations","
insiderTransactionForOwnerExists":1,"insiderTransa
ctionForIssuerExists":1,"name":"ABBOTT LABORATORIE
S","tickers":["ABT"],"exchanges":["NYSE"],"ein":"3
60698440","description":"","website":"","investorW
ebsite":"","category":"Large accelerated filer","f
iscalYearEnd":"1231","stateOfIncorporation":"IL","
stateOfIncorporationDescription":"IL","addresses":
{"mailing":{"street1":"100 ABBOTT PARK ROAD","stre
et2":null,"city":"ABBOTT PARK","stateOrCountry":"I
L","zipCode":"60064-3500","stateOrCountryDescripti
on":"IL"},"business":{"street1":"100 ABBOTT PARK R
OAD","street2":null,"city":"ABBOTT PARK","stateOrC
ountry":"IL","zipCode":"60064-3500","stateOrCountr
yDescription":"IL"}},"phone":"2246676100","flags":
"","formerNames":[],"filings":{"recent":{"accessio
nNumber":["0001415889-21-004019","0001415889-21-00
4018","0001415889-21-003917","0001415889-21-003804
","0001104659-21-100055","0001415889-21-003773","0
001415889-21-003748","0001104659-21-094680","00014
15889-21-003516","0001415889-21-003514","000141588
9-21-003513","0001415889-21-003512","0001415889-21
-003509","0001415889-21-003503","0001415889-21-003
428","0001415889-21-003425","0001415889-21-003423"
,"0001415889-21-003418","0001104659-21-086325","00
01415889-21-002958","0001415889-21-002831","000141
5889-21-002830","0001104659-21-0763........

Related

gdata not able to find publicly shared and published google sheet [duplicate]

I have an app that opens the json version of a spreadsheet that I've published to the web. I used the instructions on this website: https://www.freecodecamp.org/news/cjn-google-sheets-as-json-endpoint/
It's been working fine for a couple months, but today I realized that the url of my json file is no longer working since yesterday. It gives the message, "Sorry, unable to open the file at this time. Please check the address and try again." The regular link to view the spreadsheet as a webpage still works though.
Did Google drop support for this feature? Is there another way to get the data of a spreadsheet in json format through a URL? I started looking into the Google Developer API, but it was really confusing.

You are using the JSON Alt Type variant of the Google Data protocol. This protocol is dated and appears to no longer work reliably. The GData API Directory tells:
Google Spreadsheets Data API: GData version is still live. Replaced by the Google Sheets API v4.
Google Sheets API v4 is a modern RESTful interface that is typically used with a client library to handle authentication and batch processing of data requests. If you do not want to do a full-blown client implementation, David Kutcher offers the following v4 analog for the GData JSON Alt Type, using jQuery:
GData (old version, not recommended):
var url = 'https://spreadsheets.google.com/feeds/list/' +
spreadsheet_id + '/' + tab_ordinal + '/public/values?alt=json';
($.getJSON(url, 'callback=?')).success(function(data) {
// ...
};
V4 (new version, recommended):
var url = 'https://sheets.googleapis.com/v4/spreadsheets/' +
spreadsheet_id + '/values/' + tab_name +
'?alt=json&key=' + api_key;
($.getJSON(url, 'callback=?')).success(function(data) {
// ...
};
...where:
spreadsheet_id is the long string of letters and numbers in the address of the spreadsheet — it is the bit between /d/ and /edit
tab_ordinal is number of the sheet — the first sheet that appears in the tab bar is sheet number 1, the second one is 2, and so on
tab_name is the name of the sheet, i.e., the name you see in the tab bar at the bottom of the window when you have the spreadsheet open for editing
api_key is the API key you get from from Google Cloud Platform console
Note that the JSON output format differs between the two versions.
With the GData pattern, the spreadsheet needs to be shared as File > Share > Publish to the web.
With the V4 pattern, the spreadsheet needs to be shared as File > Share > Share with others > anyone with the link can view.

As of March 2022:
If you dont want to create a key you can use this URL format:
https://docs.google.com/spreadsheets/d/{spreadsheetId}/gviz/tq
which downloads a json.txt file of the format
google.visualization.Query.setResponse({json});
From that you would have to slice out the json
-OR --
Just configure a key as per the Official docs.
Go to Google Console and create a project (or use an existing one)
Goto Credenetials page and create a API Key
Include Sheets API from library
And Voila!
You can now get json using URL Format:
https://sheets.googleapis.com/v4/spreadsheets/{spreadsheetId}/values/{sheetName}?alt=json&key={theKey}
Edit: The Sheet should be public and Anyone with link can view

Without jQuery ...
var url = 'https://docs.google.com/spreadsheets/d/'+id+'/gviz/tq?tqx=out:json&tq&gid='+gid;
with id of the spreadsheet and gid of the sheet
https://codepen.io/mikesteelson/pen/wvevppe
example :
var id = '______your_speadsheet_id________';
var gid = '0';
var url = 'https://docs.google.com/spreadsheets/d/'+id+'/gviz/tq?tqx=out:json&tq&gid='+gid;
fetch(url)
.then(response => response.text())
.then(data => document.getElementById("json").innerHTML=myItems(data.substring(47).slice(0, -2))
);
function myItems(jsonString){
var json = JSON.parse(jsonString);
var table = '<table><tr>'
json.table.cols.forEach(colonne => table += '<th>' + colonne.label + '</th>')
table += '</tr>'
json.table.rows.forEach(ligne => {
table += '<tr>'
ligne.c.forEach(cellule => {
try{var valeur = cellule.f ? cellule.f : cellule.v}
catch(e){var valeur = ''}
table += '<td>' + valeur + '</td>'
}
)
table += '</tr>'
}
)
table += '</table>'
return table
}

gdata is the older version of Sheets API and it's shut down. See Google's announcement here https://cloud.google.com/blog/products/g-suite/migrate-your-apps-use-latest-sheets-api

Kodi addons : how to correctly set an URL using xbmcplugin.addDirectoryItems and xbmcgui.ListItem?

I'm trying to update a plugin for Kodi 19 (and Python3).
But! Hell! Their documentation is a mess, and when you search the internet, a lot of code is outdated.
I cannot understand how correctly create a virtual folder with items using xbmcplugin.addDirectoryItems.
here's my (simplified) code:
this is my KODI menu function
def menu_live():
#this is were I get my datas (from internet)
datas = api.get_live_videos()
listing = datas_to_list(datas)
sortable_by = (xbmcplugin.SORT_METHOD_DATE,
xbmcplugin.SORT_METHOD_DURATION)
xbmcplugin.addDirectoryItems(common.plugin.handle, listing, len(listing))
xbmcplugin.addSortMethod(common.plugin.handle, xbmcplugin.SORT_METHOD_LABEL)
xbmcplugin.endOfDirectory(common.plugin.handle)
this builds a list of items for the virtual folder
def datas_to_list(datas):
list_items = []
if datas and len(datas):
for data in datas:
li = data_to_listitem(data)
url = li.getPath()
list_items.append((url, li, True))
return list_items
this create a xbmcgui.ListItem for our listing
def data_to_listitem(data):
#here I parse my data to build a xbmcgui.ListItem
label = ...
url = ...
...
list_item = xbmcgui.ListItem(label)
list_item.setPath(url)
return list_item
I don't understand well how to interact with the media url.
It seems that it can be defined within xbmcgui.ListItem using
list_item.setPath(url)
which seems ok to me (an url is set to the item itself)
but then, it seems that you also need to set the URL when adding the item to the list,
li = data_to_listitem(data)
list_items.append((url, li, True))
This looks weird since it means you have to know the URL outside the function that builds the item.
So currently, my workaround is
li = data_to_listitem(data)
url = li.getPath() #I retrieve the URL defined in the above function
list_items.append((url, li, True))
That code works. But the question is: if I can define an URL on the ListItem using setPath(), then why should I also fill that URL when appending the ListItem to my listing list_items.append((url, li, True)) ?
Thanks a lot !

I'm not exactly sure what your question is. But Video/audio add-on development is thoroughly explained in these guides: https://kodi.wiki/view/HOW-TO:Audio_addon, https://kodi.wiki/view/Audio-video_add-on_tutorial and https://kodi.wiki/view/HOW-TO:Video_addon. Have a look at them, especially the video-add-on guide (as pointed out by Roman), and try to adapt to your case.
Edit
But the question is: if I can define an URL on the ListItem using setPath(), then why should I also fill that URL when appending the
ListItem to my listing?
I'm far from an expert, but from my understanding and in the context of https://kodi.wiki/view/HOW-TO:Video_addon tutorial, the url in
list_items.append((url, li, is_folder))
is used to route your plugin to your playback function, as well as passing arguments to it (e.g. video url and possibly other useful stuff needed for playback). That is, the list item passed here doesn't need to have its path set.
ListItem.setPath(video_url)
on the other hand, is for resolving the video url and start the playback after you have selected an item.

Scrape info from a span title

My html looks like this:
<h3>Current Guide Price <span title="92"> 92
</span></h3>
The info I am trying to get is the 92.
here is another html page where i need to get the same data:
<h3>Current Guide Price <span title="4,161"> 4,161
</span></h3>
I would need to get the 4,161 from this page.
here is the link to the page for reference:
http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=1613
What I have tried:
/h3/span[#title="92"]#title
/h3/span[#title="92"]/text()
/div[#class="stats"]/h3/span[#title="4,161"]#title
since the info I need is in the actual span tag, it is hard to grab the data in a dynamic way that I can use for many different pages.

from lxml import html
import requests
baseUrl = 'http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=2355'
page = requests.get(baseUrl)
tree = html.fromstring(page.content)
price = tree.xpath('//h3/span')
price2 = tree.xpath('//h3/span/#title')
for p in price:
print(p.text.strip())
for p2 in price2:
print(p2)
The output is 92 in both cases.

Insert values into API request dynamically?

I have an API request I'm writing to query OpenWeatherMap's API to get weather data. I am using a city_id number to submit a request for a unique place in the world. A successful API query looks like this:
r = requests.get('http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id=4456703&units=imperial')
The key part of this is 4456703, which is a unique city_ID
I want the user to choose a few cities, which then I'll look through a JSON file for the city_ID, then supply the city_ID to the API request.
I can add multiple city_ID's by hard coding. I can also add city_IDs as variables. But what I can't figure out is if users choose a random number of cities (could be up to 20), how can I insert this into the API request. I've tried adding lists and tuples via several iterations of something like...
#assume the user already chose 3 cities, city_ids are below
city_ids = [763942, 539671, 334596]
r = requests.get(f'http://api.openweathermap.org/data/2.5/groupAPPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_ids}&units=imperial')
Maybe a list is not the right data type to use?
Successful code would look something like...
r = requests.get(f'http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_id1},{city_id2},{city_id3}&units=imperial')
Except, as I stated previously, the user could choose 3 cities or 10 so that part would have to be updated dynamically.

you can use some string methods and list comprehensions to append all the variables of a list to single string and format that to the API string as following:
city_ids_list = [763942, 539671, 334596]
city_ids_string = ','.join([str(city) for city in city_ids_list]) # Would output "763942,539671,334596"
r = requests.get('http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_ids}&units=imperial'.format(city_ids=city_ids_string))
hope it helps,
good luck

How to display search box in all the frontend page sidebar typo3

We have developed a site with typo3 v8.7.11. We want to display the search box in the sidebar section, for this we installed the indexed_search extension. B
How to display a search box in all the frontend page sidebar section?

Edit:
The search and form action of the SearchController are both non-cacheable. This means that you would place a non-cacheable plugin on each of your pages, if you used my old answer. This harms performance and could have other side-effects.
Nowadays I usually simply include a search form on each of my pages by including this in my Fluid Template:
<form action="{f:uri.action(controller: 'Search', action: 'search', extensionName: 'indexedsearch', pluginName: 'Pi2', pageUid: searchPid)}" method="POST" role="search">
<input type="text" name="tx_indexedsearch_pi2[search][sword]" spellcheck="false" autocomplete="off" />
<button type="submit">Search</button>
</form>
I hand over the searchPid variable via TypoScript like this:
page.10.variables.searchPid = TEXT
page.10.variables.searchPid.value = <Pid where search results should be displayed>
Old answer:
My tip would be to create a TypoScript object that actually includes the plugin, like this:
lib.headerSearch = USER
lib.headerSearch {
userFunc = TYPO3\CMS\Extbase\Core\Bootstrap->run
extensionName = IndexedSearch
pluginName = Pi2
vendorName = TYPO3\CMS
switchableControllerActions {
Search {
1 = form
2 = search
}
}
features {
requireCHashArgumentForActionArguments = 0
}
view < plugin.tx_indexedsearch.view
view.partialRootPaths.10 = Path/To/Partial/
view.templateRootPaths.10 = Path/To/Template/
settings =< plugin.tx_indexedsearch.settings
}
Then, in your template, include it like this
<f:cObject typoscriptObjectPath="lib.headerSearch" />
Note that you should create a new "Search.html" Template in Path/To/Template/Search/ for this TS-Plugin, so that it does not interfere with the regular plugin. Also, be careful if you include the search slot on the same page as the search Plugin itself.

you have multiple options:
copy the HTML of the form from the search plugin in the normal content and insert it in your page-(html-)template.
create a special BE-column, insert the search-plugin into this column and render this column inherited in all pages
make a special page not visible in FE, where you insert the search-plugin and include this special CE in the rendering of every page (use a CONTENT object in typoscript to select that special CE)
include and configure the plugin in typoscript. (see answer of Thomas Löffler)
I prefer option 2 as it is most flexible and does not need any special page or content IDs, which might change with time (option 3). It also can handle any kind of CE.
Option 1 needs manual fixing if there are changes in the plugin rendering after an update for example.
Option 4 is not possible for each plugin or CEs at all to inherit. If you can configure the plugin with typoscript it is a fine option because you do not need any record (from tt_content)
for option 2:
temp.inheritedContent = CONTENT
temp.inheritedContent {
table = tt_content
select.orderBy = sorting
// -- use your own column id: --
select.where = colPos = 100
select.languageField = sys_language_uid
slide = -1
}

Use a TYPO3 extension, which can be a copy (fork) of the newly developed version of macina_searchbox
Template Module: Add "Macina Searchbox" under "include static from extensions" .
Use this or a similar TypoScript to include it, where '6' in this example is the search page. Use your own page id instead.
Constants:
lib.macina_searchbox {
pidSearchpage = 6
}
Setup:
10 = TEMPLATE
10.template = FILE
10.template.file = fileadmin/template/template.html
10.workOnSubpart = DOKUMENT
10.marks {
SUCHE < lib.macina_searchbox
LOGO = TEXT
LOGO.value = <img src="fileadmin/template/img/logo.png">
NAVI= HMENU
NAVI {
Then you can edit the Fluid template files in the folders below macina_searchbox/Resources/Private/ to modify the output of the searchbox. This method is necessary in order that the search result list will not be shown on the page. You must instead insert an Indexed Search plugin on your search page, which has id=6 in this example. SUCHE is the marker in the main template of the website. Use your own marker.

The easiest way is to copy the given plugin from indexed_search to a variable you use in your template.
When you e.g. use FLUIDTEMPLATE:
page.10 = FLUIDTEMPLATE
page.10.variable.searchBox < plugin.tx_indexedsearch
After that you can assign a separate template and make other modifications by changing page.10.variable.searchBox with the possible configuration here: https://docs.typo3.org/typo3cms/extensions/indexed_search/8.7/Configuration/Index.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extract text from html document tag - python-3.x

Related

gdata not able to find publicly shared and published google sheet [duplicate]

Kodi addons : how to correctly set an URL using xbmcplugin.addDirectoryItems and xbmcgui.ListItem?

Scrape info from a span title

Insert values into API request dynamically?

How to display search box in all the frontend page sidebar typo3

Categories

Resources