Scrapy: Field Name Derived from Page Content

Scrapy: Field Name Derived from Page Content - python-3.x

I am looking at pages that are structured in the following way, though the exact elements may not be a table. In general, there are key-value pairs where the count of keys are limited up to 3 per page (but not necessarily in a particular order) and the keys vary from page to page (and I otherwise have no way to know what all of the keys may be without pre-scraping every possible page). Also, there should not be repeats of a key in the same page (e.g., A -> 1, B -> 2, A -> 3). I don't have issues isolating the keys, values from the page using XPath, just on storing and exporting the values from my Spider.
Approach 1
If I use the dictionary approach with something like this pseudocode:
for th, td in table:
item[th.text()] = td.text()
Then the result would only show values for A, B, C because those values exist in the first page processed and only the headers and values for the first request are maintained.
Approach 2
If I use the scrapy.item.Item() and scrapy.item.Field() approach with something like this:
class MyItem(Item):
A = Field()
B = Field()
C = Field()
Then I have no way of declaring a value for the unknown values (shown as ...). And I'll receive a KeyError when trying to set the value (either directly or using an ItemLoader.add_value()).
I am using Python 3.8 and Scrapy 2.4.1.

Related

Data appended to multiple dict values instead of one

driver_data_form = {
'forc_day_off':[],
'pref_day_off':[],
'pref_shift':{"day"+str(i):None for i in range(1,15)},
'route_data':[]
}
So I am creating the dict driver_data (seen below) by using driver_data_form (seen above)
driver_data = {str(i):driver_data_form for i in range(1,12)}
and accordingly populating it :
loop_list = [str(i) for i in range(1,13)]
1 for specific_driver in loop_list:
2 for driver in forced_day_off_data:
3 for day in driver:
4 if driver[day]=='1' and day != "driverid":
5 driver_data[specific_driver]['forc_day_off'].append(day)
forced_day_off_data looks like:
But for some reason, after the above loop is executed once (lines 2-5), and by placing a break point in line 2, I am getting all 11 values of my driver_data[forc_day_off] dictionary populated, instead of only the first one. It appears that the values of the first key are copied to all the rest of the values:
I debugged this piece of code many times and this behavior makes no sence to me? What could be causing this and how can I fix it?

The problem with your code is that python is using references to dicts and lists. When you do this
driver_data = {str(i):driver_data_form for i in range(1,12)}
It basically sets the same dict reference for all your keys so when you change one value you actually update for all the other keys since it's the same dict
For your code to work you need to do this:
driver_data = {str(i):{
'forc_day_off':[],
'pref_day_off':[],
'pref_shift':{"day"+str(j):None for j in range(1,15)},
'route_data':[]
} for i in range(1,12)}
This way you create a new dict for each element and you will update only the specific dict.
See this this link to better understand the difference.

loc not functioning correctly in pandas

I have a Dataframe, which has a bunch of ID name pairs in it. I create it by doing the following:
market_df = pd.DataFrame(markets_info['markets'])
market_df.astype(dict(id=int, name=str))
I received ID numbers from a process and I need to grab the associated name to that ID. I have tried creating an index on the ID and then parsing it, but that doesn't seem to set the ID correctly.
I now am trying to do the following: exch_name = MARKET_IDS.loc[MARKET_IDS['id'] == exchange_id, 'name']
I have verified that exchange_id is also of type int.
What am I missing here?

I don't know if this is because you left out some crucial information from this, but from what it sounds like in your post you're not really altering market_df at all, as your second line is not an assignment. It should read market_df = market_df.astype(dict(id=int, name=str))

Insert values into API request dynamically?

I have an API request I'm writing to query OpenWeatherMap's API to get weather data. I am using a city_id number to submit a request for a unique place in the world. A successful API query looks like this:
r = requests.get('http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id=4456703&units=imperial')
The key part of this is 4456703, which is a unique city_ID
I want the user to choose a few cities, which then I'll look through a JSON file for the city_ID, then supply the city_ID to the API request.
I can add multiple city_ID's by hard coding. I can also add city_IDs as variables. But what I can't figure out is if users choose a random number of cities (could be up to 20), how can I insert this into the API request. I've tried adding lists and tuples via several iterations of something like...
#assume the user already chose 3 cities, city_ids are below
city_ids = [763942, 539671, 334596]
r = requests.get(f'http://api.openweathermap.org/data/2.5/groupAPPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_ids}&units=imperial')
Maybe a list is not the right data type to use?
Successful code would look something like...
r = requests.get(f'http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_id1},{city_id2},{city_id3}&units=imperial')
Except, as I stated previously, the user could choose 3 cities or 10 so that part would have to be updated dynamically.

you can use some string methods and list comprehensions to append all the variables of a list to single string and format that to the API string as following:
city_ids_list = [763942, 539671, 334596]
city_ids_string = ','.join([str(city) for city in city_ids_list]) # Would output "763942,539671,334596"
r = requests.get('http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_ids}&units=imperial'.format(city_ids=city_ids_string))
hope it helps,
good luck

python3 wx.TreeCtrl - how to iterate through several levels

I have a treectrl structure which is populated from an external search of an open data set hosted by our municipal government. The data pertains to business licenses and is requested using Pandas and Sodapy. The tree is populated as follows:
for index, row in results_df.iterrows():
tradename = row['tradename']
address = row['address']
licTypes = row['licencetypes']
comm = row['comdistnm']
jobSts = row['jobstatusdesc']
jobCrt = row['jobcreated']
lng = row['longitude']
lng = str(lng)
lat = row['latitude']
lat = str(lat)
# Populate Tree Controls with DataFrame values
trdName = self.thrTree.AppendItem(root, tradename)
self.thrTree.AppendItem(trdName, address)
self.thrTree.AppendItem(trdName, licTypes)
self.thrTree.AppendItem(trdName, comm)
self.thrTree.AppendItem(trdName, jobSts)
self.thrTree.AppendItem(trdName, jobCrt)
self.thrTree.AppendItem(trdName, lng)
self.thrTree.AppendItem(trdName, lat)
This will result in a final structure of root, then node 1 with business name, and when expanded, contains all the information listed above, so I'm assuming root level, then child node 1, then child.child of node 1? Not even sure how the second second indented nodes are called. (I've heard the term leaf for the third level used before) But I digress; what I am interested in is grabbing the Latitude and Longitude of where the business is located, then allowing the user to map the location if they choose. I bind a wx.EVT_TREE_ITEM_ACTIVATED so that when the user double clicks on a business name to get the details, I want to grab the items displayed. This is how I am currently trying to iterate through the child nodes.
item = self.thrTree.GetSelection()
while self.thrTree.GetItemParent(item):
piece = self.thrTree.GetItemText(item)
tmpHldr.insert(0, piece)
item = self.thrTree.GetItemParent(item)
Looking at item, it appears to be collecting all the business names under root, and ignoring the third level items of interest.
What do I need to do to go deeper within the tree to grab the details under the business clicked on, and not just the list of business names under the root item, which is called 'Search Results'?
Thanks!

#YYC_Code,
Did you look here?
This has GetFirstChild()/GetNextChild() pair functions that you can use to iterate. It also has ItemHasChildren() function which you can use to verify if the item has any children and use the pair mentioned above if it does.
EDIT:
[quote]
For this enumeration function you must pass in a ‘cookie’ parameter which is opaque for the application but is necessary for the library to make these functions reentrant (i.e. allow more than one enumeration on one and the same object simultaneously). The cookie passed to GetFirstChild and GetNextChild should be the same variable.
[/quote]
You need to make sure that the cookie parameter is the same during the iteration.
You should also do this:
[quote]
Returns an invalid tree item (i.e. wx.TreeItemId.IsOk returns False) if there are no further children.
[/quote]

Filtering Haystack (SOLR) results by django_id

With Django/Haystack/SOLR, I'd like to be able to restrict the result of a search to those records within a particular range of django_ids. Getting these IDs is not a problem, but trying to filter by them produces some unexpected effects. The code looks like this (extraneous code trimmed for clarity):
def view_results(request,arg):
# django_ids list is first calculated using arg...
sqs = SearchQuerySet().facet('example_facet') # STEP_1
sqs = sqs.filter(django_id__in=django_ids) # STEP_2
view = search_view_factory(
view_class=SearchView,
template='search/search-results.html',
searchqueryset=sqs,
form_class=FacetedSearchForm
)
return view(request)
At the point marked STEP_1 I get all the database records. At STEP_2 the records are successfully narrowed down to the number I'd expect for that list of django_ids. The problem comes when the search results are displayed in cases where the user has specified a search term in the form. Rather than returning all records from STEP_2 which match the term, I get all records from STEP_2 plus all from STEP_1 which match the term.
Presumably, therefore, I need to override one/some of the methods in for SearchView in haystack/views.py, but what? Can anyone suggest a means of achieving what is required here?

After a bit more thought, I found a way around this. In the code above, the problem was occurring in the view = search_view_factory... line, so I needed to create my own SearchView class and override the get_results(self) method in order to apply the filtering after the search has been run with the user's search terms. The result is code along these lines:
class MySearchView(SearchView):
def get_results(self):
search = self.form.search()
# The ID I need for the database search is at the end of the URL,
# but this may have some search parameters on and need cleaning up.
view_id = self.request.path.split("/")[-1]
view_query = MyView.objects.filter(id=view_id.split("&")[0])
# At this point the django_ids of the required objects can be found.
if len(view_query) > 0:
view_item = view_query.__getitem__(0)
django_ids = []
for thing in view_item.things.all():
django_ids.append(thing.id)
search = search.filter_and(django_id__in=django_ids)
return search
Using search.filter_and rather than search.filter at the end was another thing which turned out to be essential, but which didn't do what I needed when the filtering was being performed before getting to the SearchView.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrapy: Field Name Derived from Page Content - python-3.x

Related

Data appended to multiple dict values instead of one

loc not functioning correctly in pandas

Insert values into API request dynamically?

python3 wx.TreeCtrl - how to iterate through several levels

Filtering Haystack (SOLR) results by django_id

Categories

Resources