bing search api v5 "__next" replacement?

bing search api v5 "__next" replacement? - azure

i'm working on the migration of my webstite form the Bing Azure API (v2) to the new Bing V5 search API.
On the old API, an object use this "__next" to tell if there's something else after him or not.
But on the new API the json do not return this anymore.
I'm working on upgrading my pagination and i don't know how to do it without this element.
Anyone know what replace this in the new API ?
I can't find any information on their migration guide or in the new V5 API guide.
Thanks.

John is right. You use the count and offset params in conjunction with the totalEstimatedMatches from the value in the json of the first object returned.
Example: Imagine you love rubber-duckies so much that you want every single webpage in existance that contains the term 'rubber-ducky.' WELL TOUGH LUCK BC THATS NOT HOW THE INTERNET WORKS. Don't kill yourself yet however, Bing knows a lot about webpages containing 'rubber-ducky,' and all you'll need to do is paginate through the 'rubber-ducky'-related sites that Bing knows about and rejoice.
First, we need to tell the API that we want "some" results by passing 'rubber-ducky' to it(the value of "some" is defined by the count param, 50 is the max).
Next, we'll need to look in the first JSON object returned; this will tell us how many 'rubber-ducky' sites that Bing knows about in a field called totalEstimatedMatches.
Since we have an insatiable hunger for rubber-ducky-related websites, we're going to set up a while-loop that alternates b/w querying and incrementing offset and that does not stop until totalEstimatedMatches and offset are count distance apart.
Here's some python code for clarification:
>>> import SomeMagicalSearcheInterfaceThatOnlyNeeds3Params as Searcher
>>>
>>> SearcherInstance = Searcher()
>>> SearcherInstance.q = 'rubber-ducky'
>>> SearcherInstance.count = 50
>>> SearcherInstance.offset = 0
>>> SearcherInstance.totalEstimatedMatches = 0
>>>
>>> print SearcherInstance.preview_URL
'https://api.cognitive.microsoft.com/bing/v5.0/images/search?q=rubber%2Dducky&count=50&offset=0'
>>>
>>> json_return_object = SearcherInstance.search_2_json()
>>>
>>> ## Python just treats JSON as nested dictionaries.
>>> tem = json_return_object['webPages']['totalEstimatedMatches']
>>> print tem
9500000
>>> num_links_returned = len(json_return_object['webPages']['value'])
>>> print num_links_returned
50
>>>
>>> ## We'll set some vals manually then make our while loop.
>>> SearcherInstance.offset += num_links_returned
>>> SearcherInstance.totalEstimatedMatches = tem
>>>
>>> a_dumb_way_to_store_this_much_data = []
>>>
>>> while SearcherInstance.offset < SearcherInstance.totalEstimatedMatches:
>>> json_response = SearcherInstance.search_2_json()
>>> a_dumb_way_to_store_this_much_data.append(json_response)
>>>
>>> actual_count = len(json_return_object['webPages']['value'])
>>> SearcherInstance.offset += min(SearcherInstance.count, actual_count)
Hope this helps a bit.

You should read the totalEstimatedMatches value the first time you call the API, then use the &count and &offset parameters to page through the results as described here: https://msdn.microsoft.com/en-us/library/dn760787.aspx.

Related

Trouble using regex patterns any Python to find content in a document

I have a list of regex expressions that I want to find in certain docs.
x = ['\bin\sapp\sdata\b','\bin\sapp\sdata\b','\benough\sdata\b']
The patterns repeat themselves so I converted them to a set (see the first and second values in the list)
y = set(x)
When I try to find them in a specific doc it doesn't find them since it doesn't take them as a repr version:
import pandas as pd
import re
results = list()
doc = 'they wanted in app data and we did not provide it'
for value in y:
results.append(re.findall(pattern = value,string=doc))
results = list(filter(None, results))
results
How do I overcome this?
Thanks

The problem was with the python 3.7 version. The error I got was "bad escape \l at position 0" Once I changed the re to regex it worked perfectly fine, even with the "messed up coding

python pop() for list with transfered value [duplicate]

This question already has answers here:
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 2 years ago.
I want to retain the original list while manipulating it i.e I'm using it in a loop and have to perform some operations each iteration so need to reset the value of a list. Initially, I thought it was a problem with my loops but I have narrowed it down to.
inlist=[1,2,3]
a=inlist
a.pop(0)
print(a)
print(inlist)
gives an output of
[2,3]
[2,3]
Why am I not getting
[2,3]
[1,2,3]
It is applying pop to both a and inlist.

Let me explain with Interactive console:
>>> original_list = [1, 2, 3, 4] # source
>>> reference = original_list # 'aliasing' to another name.
>>> reference is original_list # check if two are referencing same object.
True
>>> id(reference) # ID of referencing object
1520121182528
>>> id(original_list) # Same ID
1520121182528
To create new list:
>>> copied = list(original_list)
>>> copied is original_list # now referencing different object.
False
>>> id(copied) # now has different ID with original_list
1520121567616
There's multiple way of copying lists, for few examples:
>>> copied_slicing = original_list[::]
>>> id(copied_slicing)
1520121558016
>>> import copy
>>> copied_copy = copy.copy(original_list)
>>> id(copied_copy)
1520121545664
>>> copied_unpacking = [*original_list]
>>> id(copied_unpacking)
1520123822336
.. and so on.
Image from book 'Fluent Python' by Luciano Ramalho might help you understand what's going on:
Rather than 'name' being a box that contains respective object, it's a Post-it stuck at object in-memory.

Try doing it this way
a = [1,2,3]
b=[]
b.extend(a)
b.pop(0)
Although what you are doing makes sense but what is happening is that you are just assigning another variable to the same list, which is why both are getting affected. However if you define b(in my case) as an empty list and then assign it, you are then making a copy as compared to another variable pointing to the same list.

Cumulatively add values to python dictionary

Suppose ,I have a dictionary
key={'a':5}
Now ,I want to add values to it cumulatively without overwriting the current value but adding on to it.How to do it?
I am giving an instance:
for post in doc['post']:
if 'wow' in post:
value=2
for reactor in post['wow']['reactors']:
dict_of_reactor_ids.update({reactor['img_id']:value})
if 'sad' in post:
value=2
for reactor in post['sad']['reactors']:
dict_of_reactor_ids.update({reactor['img_id']:value})
Suppose if the dictionary is like this in first iteration
dict_of_reactor_ids={101:2,102:1}
and NOW I want to increase the value of 101 key by 3 ,then how to do that.
dict_of_reactor_ids={101:5,102:1}
Now in second iteration of post ,I want to add values to the current values in dictionary without overwriting the current value.
I have tried update method but I think it just updates the whole value instead of adding onto it.

Sounds like a typical case of Counter:
>>> from collections import Counter
>>> c = Counter()
>>> c["a"] += 1 # works even though "a" is not yet present
>>> c.update({"a": 2, "b": 2}) # possible to do multiple updates
{"a": 3, "b": 2}
In your case the benefit is that it works even when the key is not already in there (default value is 0), and it allows updates of multiple values at once, whereas update on a normal dict would overwrite the value as you've noticed.

You can also use defaultdict, it "defaults" when there is not yet an existing key-value pair and you still use the cumulative add +=:
from collections import defaultdict
dict_of_reactor_ids = defaultdict(int)
dict_of_reactor_ids[101] += 2
dict_of_reactor_ids[102] += 1
dict_of_reactor_ids['101'] += 3
print(dict_of_reactor_ids['101'])
5

Can't pull out the information from object using Beautiful Soup 4

I am working (for the first time) with scraping a website. I am trying to pull the latitude (in decimal degrees) from a website. I have managed to pull out the correct parent node that contains the information, but I am stuck on how to pull out the actual number from this. All of the searching I have done has only told me how to pull it out if I know the string (which I don't) or if the string is in a child node, which it isn't. Any help would be great.
Here is my code:
a_string = soup.find(string="Latitude in decimal degrees")
a_string.find_parents("p")
Out[46]: [<p><b>Latitude in decimal degrees</b><font size="-2">
(<u>see definition</u>)
</font><b>:</b> 35.7584895</p>]
test = a_string.find_parents("p")
print(test)
[<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font>
<b>:</b> 35.7584895</p>]
I need to pull out the 35.7584895 and save it as an object so I can append it to a dataset.
I am using Beautiful Soup 4 and python 3

The first thing to notice is that, since you have used the find_parents method (plural), test is a list. You need only the first item of it.
I will simulate your situation by doing this.
>>> import bs4
>>> HTML = '<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font><b>:</b> 35.7584895</p>'
>>> item_soup = bs4.BeautifulSoup(HTML, 'lxml')
The simplest way of recovering the textual content of this is to do this:
>>> item_soup.text
'Latitude in decimal degrees (see definition): 35.7584895'
However, you want the number. You can get this in various ways, two of which come to my mind. I assign the result of the previous statement to str so that I can manipulate the result.
>>> str = item_soup.text
One way is to search for the colon.
>>> str[1+str.rfind(':'):].strip()
'35.7584895'
The other is to use a regex.
>>> bs4.re.search(r'(\d+\.\d+)', str).groups(0)[0]
'35.7584895'

List, tuples or dictionary, differences and usage, How can I store info in python

I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?

It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.

A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).

Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

bing search api v5 "__next" replacement? - azure

You should read the totalEstimatedMatches value the first time you call the API, then use the &count and &offset parameters to page through the results as described here: https://msdn.microsoft.com/en-us/library/dn760787.aspx.

Related

Trouble using regex patterns any Python to find content in a document

python pop() for list with transfered value [duplicate]

Cumulatively add values to python dictionary

Can't pull out the information from object using Beautiful Soup 4

List, tuples or dictionary, differences and usage, How can I store info in python

Categories

Resources