Incoherent results from feeds search through the API - search

I want to visualize in an earth map all feeds from the user 'airqualityegg'. In order to do this I wrote the following script with Python (if you are gonna try yourself, indent correctly the code in the text editor you use):
import json
import urllib
import csv
list=[]
for page in range(7):
url = 'https://api.xively.com/v2/feeds?user=airqualityegg&per_page=100page='+str(page)
rawData=urllib.urlopen(url)
#Loads the data in json format
dataJson = json.load(rawData)
print dataJson['totalResults']
print dataJson['itemsPerPage']
for entry in dataJson['results']:
try:
list2=[]
list2.append(entry['id'])
list2.append(entry['creator'])
list2.append(entry['status'])
list2.append(entry['location']['lat'])
list2.append(entry['location']['lon'])
list2.append(entry['created'])
list.append(list2)
except:
print 'failed to scrape a row'
def escribir():
abrir = open('all_users2_andy.csv', 'w')
wr = csv.writer(abrir, quoting=csv.QUOTE_ALL)
headers = ['id','creator', 'status','lat', 'lon', 'created']
wr.writerow (headers)
for item in list:
row=[item[0], item[1], item[2], item[3], item[4], item[5]]
wr.writerow(row)
abrir.close()
escribir()
I have included a call to 7 pages because the total numbers of feeds posted by this user are 684 (as you can see when writing directly in the browser 'https://api.xively.com/v2/feeds?user=airqualityegg')
The csv file that resulted from running this script does present duplicated rows, what might be explained for the fact that every time that a call is made to a page the order of results varies. Thus, a same row can be included in the results of different calls. For this reason I get less unique results that I should.
Do you know why might be that the results included in different pages are not unique?
Thanks,
María

You can try passing
order=created_at (see docs).
The problem is because by default order=updated_at, hence the chances are that results will appear different on each page.
You should also consider using the official Python library.

Related

how to print live data in a single location without printing in new lines with python?

Is there any way for printing live data which is being scraped from websites in a single location.
Consider the below example for understanding my question.
import requests
import time
print("Selling \t \tbuying")
for i in range(100):
time.sleep(5)
r = requests.get("https://api.binance.com/api/v/depth",params=dict(symbol="DOTUSDT"))
a = r.json()
b=a['asks'][0][0]
c=a['bids'][0][0]
print(b,"\t \t",c)
This program will print output for every 5 seconds new lines will print like this below
Selling buying
18.58900000 18.58800000
18.59100000 18.58800000
18.59500000 18.59400000
18.59300000 18.59200000
18.60000000 18.59900000
18.61500000 18.61400000
18.61900000 18.61700000
What i need is , it should not print new values in new lines. It should update with new values in same first line.

Stuck using pandas to build RPG item generator

I am trying to build a simple random item generator for a game I am working on.
So far I am stuck trying to figure out how to store and access all of the data. I went with pandas using .csv files to store the data sets.
I want to add weighted probabilities to what items are generated so I tried to read the csv files and compile each list into a new set.
I got the program to pick a random set but got stuck when trying to pull a random row from that set.
I am getting an error when I use .sample() to pull the item row which makes me think I don't understand how pandas works. I think I need to be creating new lists so I can later index and access the various statistics of the items once one is selected.
Once I pull the item I was intending on adding effects that would change the damage and armor and such displayed. So I was thinking of having the new item be its own list then use damage = item[2] + 3 or whatever I need
error is: AttributeError: 'list' object has no attribute 'sample'
Can anyone help with this problem? Maybe there is a better way to set up the data?
here is my code so far:
import pandas as pd
import random
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
def get_item():
item_class = [random.choices(df, weights=(45,40,15), k=1)] #this part seemed to work. When I printed item_class it printed one of the entire lists at the correct odds
item = item_class.sample()
print (item) #to see if the program is working
get_item()
I think you are getting slightly confused with lists vs list elements. This should work. I stubbed your dfs with simple ones
import pandas as pd
import random
# Actual data. Comment it out if you do not have the csv files
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
# My stubs -- uncomment and use this instead of the line above if you want to run this specific example
# df = [pd.DataFrame({'weapons' : ['w1','w2']}), pd.DataFrame({'armor' : ['a1','a2', 'a3']}), pd.DataFrame({'aether' : ['e1','e2', 'e3', 'e4']})]
def get_item():
# I removed [] from the line below -- choices() already returns a list of length 1
item_class = random.choices(df, weights=(45,40,15), k=1)
# I added [0] to choose the first element of item_class which is a list of length 1 from the line above
item = item_class[0].sample()
print (item) #to see if the program is working
get_item()
prints random rows from random dataframes that I setup such as
weapons
1 w2

Import Balance Sheet in an automatic organized manner from SEC to Dataframe

I am looking at getting the Balance Sheet data automatically and properly organized for any company using Beautiful Soup.
I am not planning on getting each variable but rather the whole Balance sheet. Originally, I was trying to do many codes to extract the URL for a particular company of my choice.
For Example, suppose I want to get the Balance Sheet data from the following URL:
URL1:'https://www.sec.gov/Archives/edgar/data/1418121/000118518520000213/aple20191231_10k.htm'
or from
URL2:'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000046/form8-k03312020earnings.htm'
I am trying to write a function (suppose it is known as get_balancesheet(URL) ) such that regardless of the URL you will get the Dataframe that contains the balance sheet in an organized manner.
# Import libraries
import requests
import re
from bs4 import BeautifulSoup
I wrote the following function that needs a lot of improvement
def Get_Data_Balance_Sheet(url):
page = requests.get(url)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.content)
futures1 = soup.find_all(text=re.compile('CONSOLIDATED BALANCE SHEETS'))
Table=[]
for future in futures1:
for row in future.find_next("table").find_all("tr"):
t1=[cell.get_text(strip=True) for cell in row.find_all("td")]
Table.append(t1)
# Remove list from list of lists if list is empty
Table = [x for x in Table if x != []]
return Table
Then I execute the following
url='https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm'
Tab=Get_Data_Balance_Sheet(url)
Tab
Note that this is not what I am planning for to have It is not simply putting it in a dataframe but we need to change it such that regardless of which URL we can get the Balance Sheet.
Well, this being EDGAR it's not going to be simple, but it's doable.
First things first - with the CIK you can extract specific filings of specific types made the CIK filer during a spacific period. So let say you are interested in Forms 10-K and 10-Q, original or amended (as in "FORM 10-K/A", for example), filed by this CIK filer from 2019 through 2020.
start = 2019
end = 2020
cik = 220000320193
short_cik = str(cik)[-6:] #we will need it later to form urls
First we need to get a list of filings meeting these criteria and load it into beautifulsoup:
import requests
from bs4 import BeautifulSoup as bs
url = f"https://www.sec.gov/cgi-bin/srch-edgar?text=cik%3D%{cik}%22+AND+form-type%3D(10-q*+OR+10-k*)&first={start}&last={end}"
req = requests.get(url)
soup = bs(req.text,'lxml')
There are 8 filings meeting the criteria: two Form 10-K and 6 Form 10-Q. Each of these filings has an accession number. The accession number is hiding in the url of each of these filings and we need to extract it to get to the actual target - the Excel file which contains the financial statements which are attached to each specific filing.
acc_nums = []
for link in soup.select('td>a[href]'):
target = link['href'].split(short_cik,1)
if len(target)>1:
acc_num = target[1].split('/')[1]
if not acc_num in acc_nums: #we need this filter because each filing has two forms: text and html, with the same accession number
acc_nums.append(acc_num)
At this point, acc_nums contains the accession number for each of these 8 filings. We can now download the target Excel file. Obviusly, you can loop through acc_num and download all 8, but let's say you are only looking for (randomly) the Excel file attached to the third filing:
fs_url = f"https://www.sec.gov/Archives/edgar/data/{short_cik}/{acc_nums[2]}/Financial_Report.xlsx"
fs = requests.get(fs_url)
with open('random_edgar.xlsx', 'wb') as output:
output.write(fs.content)
And there you'll have more than you'll ever want to know about Apple's financials at that point in time...

Scraping table data with BeautifulSoup or Pandas

I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.
How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?
url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657
I've tried using the methods in the following links and others not show here:
Beautiful Soup Scraping table
Scrape table with BeautifulSoup
Web scraping with BeautifulSoup
Some of the script I've tried:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table") # {"class": "xxxx"})
I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:
import pandas as pd
df_list = pd.read_html(url)
df_list
Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.
As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.
After getting the data script will parse the JSON data using json.loads library.
It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.
Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
def scrap_daily_water_level():
file_path = '' #Input File path here
file_name = 'daily_water_level_data.csv' #File name
#CSV headers
csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface']
list_of_water_readings = []
#Dynamic Params
WELL_NO = 'HARN0052657'
START_DATE = '1/1/1905'
END_DATE = '12/30/2050'
#API URL
URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'
response = requests.get(URL,verify=False) #GET API call
json_result = json.loads(response.text) #JSON loads to parse JSON data
print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
extracted_data = json_result['feature_list'] #Extracted data in JSON form
for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
list_of_water_readings.append({ #append and create list of data with headers for further usage
'Line #': idx + 1,
'GW Log Id' : item['gw_logid'],
'GW Site Id': item['gw_site_id'],
'Land Surface Elevation': item['land_surface_elevation'],
'Record Date': item['record_date'],
'Restrict to OWRD only': item['restrict_to_owrd_only'],
'Reviewed Status':item['reviewed_status'],
'Reviewed Status Description': item['reviewed_status_description'],
'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
})
#Create CSV and write data in to it.
with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode
csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
print('Writing CSV header now...')
csvwriter.writeheader() #Write headers in CSV file
for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
scrap_daily_water_level()

Navigating the html tree with BeautifulSoup and/or Selenium

I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.
https://iknow.jp/courses/566921
and save them in a dataFrame or a csv file.
I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:
import urllib
from bs4 import BeautifulSoup
url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})
But I get nothing, also when I search for the response field:
responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
print(word)
I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:
After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:
driver.find_elements_by_class_name()
driver.find_elements_by_xpath()
I get lists with different number of element, so it's not possible to easily creatre a dataframe.
# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200
The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the
<li class="item">
content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...
As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:
行く, いく, go, 日曜日は図書館に行きます。, にちようび は としょかん に いきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたし は なつやすみ に プール に いった。, I went to the pool during summer vacation.
The tags you are trying to scrape are not in the source code. Probably because the page is JavaScript rendered. Try this url to see yourself:
view-source:https://iknow.jp/courses/566921
The Python module Selenium solves this problem. If you would like I could write some code for you to start on.
Here is some code to start on:
from selenium import webdriver
url = 'https://iknow.jp/courses/566921'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
cues = driver.find_elements_by_class_name('cue')
cues = [cue.text for cue in cues]
responses = driver.find_elements_by_class_name('response')
responses = [response.text for response in responses]
texts = driver.find_elements_by_xpath('//*[#class="sentence-text"]/p[1]')
texts = [text.text for text in texts]
transliterations = driver.find_elements_by_class_name('transliteration')
transliterations = [transliteration.text for transliteration in transliterations]
translations = driver.find_elements_by_class_name('translation')
translations = [translation.text for translation in translations]
driver.close()
Note: You first need to install a webdriver. I choose chrome.
Here is a link: https://chromedriver.storage.googleapis.com/index.html?path=2.41/. Also add this to your path!
If you have any other questions let me know!

Resources