I am currently working on obtaining data from a nested JSON response called "Result"
Now after reviewing the API documentation, they say that they only return 100 records per request, so which means if we have 425 records I would have to pass the request. get at least 4 times with:
/example
/example?$skip=100
/example?$skip=200
/example?$skip=400
After that is done it should write the response list in a csv file.I have parsed the response from the get to json.loads, I have converted the dictionary to list and created a for loop that writes whatever is in the "Result" dictionary.
My question is how can I create that it loops also the request.get and increments the url value to skip 100,200,300,400. Hope this makes sense
So after searching and searching the best way that worked for me was.
Creating a for loop with the number of times that it needs to loop over.
toSkip = (i+1) * 100
Concatenate the string with 'url string' + '?$Skip=' + str(toSkip)
Create a request with passing the authorization header
Parse it with json.loads
Write the result to a csv file or google sheets API
Related
When I use GPT3's playground, I often get results that are formatted with numbered lists and paragraphs like below:
Here's what the above class is doing:
1. It creates a directory for the log file if it doesn't exist.
2. It checks that the log file is newline-terminated.
3. It writes a newline-terminated JSON object to the log file.
4. It reads the log file and returns a dictionary with the following
- list 1
- list 2
- list 3
- list 4
However, when I directly use their API and extract the response from json result, I get the crammed text version that is very hard to read, something like this:
Here's what the above class is doing:1. It creates a directory for the log file if it doesn't exist.2. It checks that the log file is newline-terminated.3. It writes a newline-terminated JSON object to the log file.4. It reads the log file and returns a dictionary with the following-list 1-list 2-list 3- list4
My question is, how do people keep the formats from GPT results so they are displayed in a neater, more readable way?
Option 1: Edits endpoint
If you run test.py the OpenAI API will return the following completion:
test.py
import openai
openai.api_key = 'sk-xxxxxxxxxxxxxxxxxxxx'
response = openai.Edit.create(
model = 'text-davinci-edit-001',
input = 'I have three items:1. First item.2. Second item.3. Third item.',
instruction = 'Make numbered list'
)
content = response['choices'][0]['text']
print(content)
Option 2: Processing
Process the completion you get from the Completions endpoint by yourself (i.e., write Python code).
I have a set of more than 2000 issues on JIRA, and plan to copy all of those to a JSON file via Azure Data Factory. However, as Jira API only allows 100 issues per API link, i need to create multiple API links on dataset
(e.g: https://xxx.atlassian.net/rest/api/2/searchjql=order%20by%20created%20DESC&startAt=0&maxResults=100
https://xxx.atlassian.net/rest/api/2/searchjql=order%20by%20created%20DESC&startAt=101&maxResults=100
https://xxx.atlassian.net/rest/api/2/searchjql=order%20by%20created%20DESC&startAt=201&maxResults=100)
This is time-consuming and don't work out in the future when my request can be up to 4000.
If you know whether there's any ways to duplicate the copy activities, so that i just need to use one link with parameter (at "startAt") and can copy the data for the total 2000 requests, please help me
Thank you
Is there no parameter for start date in your API link?
if you have one, you can use databricks and do a code like this. In my example it would be sys_created_on
from datetime import datetime
from datetime import timedelta
#starting date is late 2019
curr = datetime.now()
start = datetime.now() + timedelta(days=-500) ##Whichever date you want to start from
base_url = https://xxx.atlassian.net/rest/api/2/searchjql=order%20by%20created%20DESC&sys_created_on="
#the loop iterates over a period of 5 days
while start<=curr:
end = start+ timedelta(days=5)
url = base_url+str(start)+"^sys_created_on<="+str(end)
#print(url)
a = requests.get(url, auth=('****', key))
c = a.json()
str_date = start.strftime('%Y%m%d%H%M%S')
str_date = str(str_date)
with open('/dbfs/mnt/****/bronze/****/Jirra_'+str_date[:8]+'.json', 'w') as outfile:
json.dump(c, outfile)
start = end
There are a couple of ways you can do this.
Your ADF can trigger an azure function which will loop through all the API calls, bundle the data together and write into a storage. You can refer to #Piccinin1992's response for the code or use your own.
You can do this using a webactivity and refer to the totalissues from the output of the webactivity. You will need to loop through this using an Until loop activity.
First add 3 variables in your pipeline.
Your webactivity url will be a concat of the api url with the startat variable as shown. The maxresults can be hardcoded since these will always be 100.
#concat('https://anunetfunc.azurewebsites.net/api/PaginatedAPI?startat=',variables('startat'),'&maxresults=100')
Now you need to set the tempvariable to #variables('startat'). This is because ADF dynamic content does allow you to include the variable that you are setting.
Next set the startat to temp+100. This is for the next page read.
#string(add(int(variables('tempvariable')),100))
Lastly, set the total issues variable to the totalissues from the output of the 1st webactivity.
#string(activity('Web1').output.totalissues)
With this, the loop will execute only as many times to fetch all the issues. This is just a skeleton to show you how to do this. You still need to add activities to either bunch up all the issues before writing into storage or write separate files for each API call.
Is it possible to obtain the url from Google search result page, given the keyword? Actually, I have a csv file that contains a lot of companies name. And I want there website which shows up on the top of search result in google, when I upload that csv file it fetch the company name/keyword and put it on the search field.
For eg: - stack overflow, this is one of the entry in my csv file and it should be fetched and put in the search field, and it should return the best match/first url from search result. Eg: - www.stackoverflow.com
And this returned result should be stored in the same file which I have uploaded and next to the keyword for it searched.
I am not aware much about these concepts, so any help will be very appreciated.
Thanks!
google package has one dependency on beautifulsoup which need to be installed first.
then install :
pip install google
search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0)
query : query string that we want to search for.
tld : tld stands for top level domain which means we want to search our result on google.com or google.in or some other domain.
lang : lang stands for language.
num : Number of results we want.
start : First result to retrieve.
stop : Last result to retrieve. Use None to keep searching forever.
pause : Lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. If the stop parameter is None the iterator will loop forever.
Below code is the solution for your question.
import pandas
from googlesearch import search
df = pandas.read_csv('test.csv')
result = []
for i in range(len(df['keys'])):
for j in search(df['keys'][i], tld="com", num=10, stop=1, pause=2):
result.append(j)
dict1 = {'keys': df['keys'], 'url': result}
df = pandas.DataFrame(dict1)
df.to_csv('test.csv')
Sample input format file image:
Output File Image:
I store some data in a excel that I extract in a JSON format. I also call some data with GET requests from some API I created. With all these data, I do some test (does the data in the excel = the data returned by the API?)
In my case, I may need to store in the excel the way to select the data from the API json returned by the GET.
for example, the API returns :
{"countries":
[{"code":"AF","name":"Afghanistan"},
{"code":"AX","name":"Ă…land Islands"} ...
And in my excel, I store :
excelData['countries'][0]['name']
I can retrieve the excelData['countries'][0]['name'] in my code just fine, as a string.
Is there a way to convert excelData['countries'][0]['name'] from a string to some code that actually points and get the data I need from the API json?
here's how I want to use it :
self.assertEqual(str(valueExcel), path)
#path is the string from the excel that tells where to fetch the data from the
# JSON api
I thought strings would be interpreted but no :
AssertionError: 'AF' != "excelData['countries'][0]['code']"
- AF
+ excelData['countries'][0]['code']
You are looking for the eval method. Try with this:
self.assertEqual(str(valueExcel), eval(path))
Important: Keep in mind that eval can be dangerous, since malicious code could be executed. More warnings here: What does Python's eval() do?
I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.