is there a method to collect data intelligently from website?

is there a method to collect data intelligently from website? - python-3.x

i want to get data from this link https://meshb.nlm.nih.gov/treeView
the problem is to get all the tree, we should click on + each time and for each line to get the children node of the tree,
but I want to display all the tree just on one click then i want to copy all the content.
Any ideas, please?

Well, it all depends what you mean by "intelligently". Not sure if that meets the criteria, but you might want to try this.
import json
import string
import requests
abc = string.ascii_uppercase
base_url = "https://meshb.nlm.nih.gov/api/tree/children/"
follow_url = "https://meshb.nlm.nih.gov/record/ui?ui="
tree = {}
for letter in abc[:1]:
res = requests.get(f"{base_url}{letter}").json()
tree[letter] = {
"Records": [i["RecordName"] for i in res],
"FollowURLS": [f"{follow_url}{i['RecordUI']}" for i in res],
}
print(json.dumps(tree, indent=2))
This prints:
{
"A": {
"Records": [
"Body Regions",
"Musculoskeletal System",
"Digestive System",
"Respiratory System",
"Urogenital System",
"Endocrine System",
"Cardiovascular System",
"Nervous System",
"Sense Organs",
"Tissues",
"Cells",
"Fluids and Secretions",
"Animal Structures",
"Stomatognathic System",
"Hemic and Immune Systems",
"Embryonic Structures",
"Integumentary System",
"Plant Structures",
"Fungal Structures",
"Bacterial Structures",
"Viral Structures"
],
"FollowURLS": [
"https://meshb.nlm.nih.gov/record/ui?ui=D001829",
"https://meshb.nlm.nih.gov/record/ui?ui=D009141",
"https://meshb.nlm.nih.gov/record/ui?ui=D004064",
"https://meshb.nlm.nih.gov/record/ui?ui=D012137",
"https://meshb.nlm.nih.gov/record/ui?ui=D014566",
"https://meshb.nlm.nih.gov/record/ui?ui=D004703",
"https://meshb.nlm.nih.gov/record/ui?ui=D002319",
"https://meshb.nlm.nih.gov/record/ui?ui=D009420",
"https://meshb.nlm.nih.gov/record/ui?ui=D012679",
"https://meshb.nlm.nih.gov/record/ui?ui=D014024",
"https://meshb.nlm.nih.gov/record/ui?ui=D002477",
"https://meshb.nlm.nih.gov/record/ui?ui=D005441",
"https://meshb.nlm.nih.gov/record/ui?ui=D000825",
"https://meshb.nlm.nih.gov/record/ui?ui=D013284",
"https://meshb.nlm.nih.gov/record/ui?ui=D006424",
"https://meshb.nlm.nih.gov/record/ui?ui=D004628",
"https://meshb.nlm.nih.gov/record/ui?ui=D034582",
"https://meshb.nlm.nih.gov/record/ui?ui=D018514",
"https://meshb.nlm.nih.gov/record/ui?ui=D056229",
"https://meshb.nlm.nih.gov/record/ui?ui=D056226",
"https://meshb.nlm.nih.gov/record/ui?ui=D056224"
]
}
}
If you want all of it, just remove [:1] from the loop. If there's no entry for a given letter on the page you'll get, well, an empty entry in the dictionary.
Obviously, you can dump the entire response, but that's just a proof of concept.

Try this, some parts are a bit tricky but it manages to give you the tree:
import requests as r
import operator
import string
link = 'https://meshb.nlm.nih.gov/api/tree/children/{}'
all_data = []
for i in string.ascii_uppercase:
all_data.append({'RecordName': i, 'RecordUI': '', 'TreeNumber': i, 'HasChildren': True})
res = r.get(link.format(i))
data_json = res.json()
all_data += data_json
# This request will get all the rest of the data at once, other than A-Z or A..-Z..
# This request takes time to load, depending on your network, it got like 3 million+ characters
res = r.get(link.format('.*'))
data_json = res.json()
all_data += data_json
# Sorting the data depending on the TreeNumber
all_data.sort(key=operator.itemgetter('TreeNumber'))
# Printing the tree using tabulations
for row in all_data:
l = len(row['TreeNumber'])
if l == 3:
print('\t', end='')
elif l > 3:
print('\t'*(len(row['TreeNumber'].split('.'))+1), end='')
print(row['RecordName'])

Related

Pagination not working in Python Session.put()

I am trying to upload a file to a website (that has an inbuilt API) using the following code. The code reads a list of medical codes/diagnoses codes etc. (1 column in a text file) and uploads it to the required page.
Issue:
After uploading the file, I noticed that the number pages is not coming out properly. There can be up to 4000 codes (lines) in the file. The code list page in the website will show 20 lines per page, which means, I would expect at least 200 pages to be there after uploading. This is not happening. I am not sure what is the mistake that I am doing.
Also, I am new to Python (primarily SAS) and have been working on automating bits and pieces of code. One such automation is this exercise. Here, the goal is to upload multiple files to the said URL. Today the team is uploading them one by one manually. With the knowledge I picked up from tutorials and other sources, I was able to come up with this.
import requests
import json
import os
import random
import pandas as pd
import time
token = os.environ.get("USER_TOKEN")
user_id = os.environ.get("USER_ID")
user_name = os.environ.get("USER_NAME")
headers = {"X-API-Key": token}
url = 'https://XXXXXXXXXXXX.com/api/code_lists'
session=requests.session()
cl = session.get(url, headers=headers).json()
def uploading_files(file,name,kind,coding_system,rand_id):
df = pd.read_table(file, converters={0:str}, header=None)
print("Came In")
CODES = df[0].astype('str').tolist()
codes = {"codes": CODES}
new_cl = {"_id": rand_id, "name": name, "project_group": "TEST BETA", "kind": kind,
"coding_system": coding_system, "user": user_id, "creator": user_name, "creation_method": "Upload", "is_category_mapping": False,
"assoc_users": [], "global": True, "readonly": False, "description": "", "num_codes": len(CODES)}
request_json = json.dumps(new_cl)
print(request_json)
codes_json = json.dumps(codes)
print(codes_json)
session.post(url, data=request_json)
session.put(url + '/' + rand_id, data=codes_json)
text_Files= os.listdir(r'C://Users//XXXXXXXXXXXXX//data')
for i in text_Files:
if ".txt" in i:
x=i.split("_")
file='C://Users//XXXXXXXXXXXXX//data//' + i
name=""
for j in i[:-4]:
if j!="_":
name+=j
elif j=="_":
name+=" "
kind=x[2]
coding_system=x[3][:-4]
rand_id = "".join(random.choice("0123456789abcdef") for i in range(24))
print("-------------START-----------------")
print("file : ", file)
print("name : ", name)
print("kind : ", kind)
print("coding system : ", coding_system)
print("Rand_Id : ", rand_id)
uploading_files(file, name, kind, coding_system, rand_id)
time.sleep(2)
print("---------------END---------------")
print("")
break ''' to upload only 1 file in the directory'''
Example data in the file (testfile.txt)
C8900
C8901
C8902
C8903
C8904
C8905
C8906
C8907
C8908
C8909
C8910
C8911
C8912
C8913
C8914
C8918
C8919
C8920
C8921
C8922
C8923
C8924
C8925
C8926
C8927
C8928
C8929
C8930
C8931
C8932
C8933
C8934
C8935
C8936
C9723
C9744
C9762
C9763
C9803
D0260
Sample Data Snapshot
Wrong Representation
Expected

How can I filter search results using Scrapy

I am new to scraping and I am trying to scrape data from this website https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml
When I try to get data without applying filters everything is working. But the data I need should be for a specific power plant and date. I am getting a hard time finding why I cannot apply the filters.
from scrapy.http import FormRequest
from ..items import EpiasscrapingItem
class EpiasSpider(scrapy.Spider):
name = 'epias'
start_urls =[
'https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml'
]
def parse(self, response):
return FormRequest.from_response(response, formdata = {
'j_idt205':'j_idt205',
'j_idt205:date1_input' : '20.03.2021',
'j_idt205:date2_input' : '20.03.2021',
'j_idt205:powerPlant_input' : '2614',
}, callback=self.start_scraping)
def start_scraping(self,response):
items = EpiasscrapingItem()
table_epias = response.css('.ui-datatable-odd')
for epias in table_epias:
date = epias.css('.ui-widget-content .TexAlCenter:nth-child(1)').css('::text').extract()
time = epias.css('.ui-widget-content .TexAlCenter:nth-child(2)').css('::text').extract()
biogas = epias.css('.ui-widget-content .TexAlCenter:nth-child(15)').css('::text').extract()
items['date'] = date
items['time'] = time
items['biogas'] = biogas
yield items```

You forgot to include javax.faces.ViewState and few other fields within parameters supposed to be sent with post requests. You can now change the value of date1_input, date2_input and powerPlant_input to fetch the relevant content. The following script should work:
class EpiasSpider(scrapy.Spider):
name = 'epias'
start_urls = [
'https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml'
]
post_url = 'https://seffaflik.epias.com.tr/transparency/uretim/gerceklesen-uretim/gercek-zamanli-uretim.xhtml'
def parse(self, response):
payload = {
'j_idt205': 'j_idt205',
'j_idt205:date1_input': '11.02.2021',
'j_idt205:date2_input': '20.03.2021',
'j_idt205:powerPlant_focus': '',
'j_idt205:powerPlant_input': '2336',
'j_idt205:goster': '',
'j_idt205:dt_rppDD': '24',
'javax.faces.ViewState': response.css(".ContainerIndent input[name='javax.faces.ViewState']::attr(value)").get()
}
yield scrapy.FormRequest(self.post_url,formdata=payload,callback=self.parse_content)
def parse_content(self,response):
for epias in response.css('.ui-datatable-odd'):
items = {}
date = epias.css('tr.ui-widget-content > .TexAlCenter:nth-child(1)::text').get()
time = epias.css('tr.ui-widget-content > .TexAlCenter:nth-child(2)::text').get()
total = epias.css('tr.ui-widget-content > .TexAlCenter:nth-child(3)::text').get()
items['date'] = date
items['time'] = time
items['total'] = total
yield items

Pygal bar chart says “No data”

I am trying to create a bar graph in pygal that uses the api for hacker news and charts the most active news based on comments. I posted my code below, but I cannot figure out why my graph keep saying "No data"??? Any suggestions? Thanks!
import requests
import pygal
from pygal.style import LightColorizedStyle as LCS, LightenStyle as LS
from operator import itemgetter
# Make an API call, and store the response.
url = 'https://hacker-news.firebaseio.com/v0/topstories.json'
r = requests.get(url)
print("Status code:", r.status_code)
# Process information about each submission.
submission_ids = r.json()
submission_dicts = []
for submission_id in submission_ids[:30]:
# Make a separate API call for each submission.
url = ('https://hacker-news.firebaseio.com/v0/item/' +
str(submission_id) + '.json')
submission_r = requests.get(url)
print(submission_r.status_code)
response_dict = submission_r.json()
submission_dict = {
'comments': int(response_dict.get('descendants', 0)),
'title': response_dict['title'],
'link': 'http://news.ycombinator.com/item?id=' + str(submission_id),
}
submission_dicts.append(submission_dict)
# Visualization
my_style = LS('#336699', base_style=LCS)
my_config = pygal.Config()
my_config.show_legend = False
my_config.title_font_size = 24
my_config.label_font_size = 14
my_config.major_label_font_size = 18
my_config.show_y_guides = False
my_config.width = 1000
chart = pygal.Bar(my_config, style=my_style)
chart.title = 'Most Active News on Hacker News'
chart.add('', submission_dicts)
chart.render_to_file('hn_submissons_repos.svg')

The values in the array passed to the add function need to be either numbers or dicts that contain the key value (or a mixture of the two). The simplest solution would be to change the keys used when creating submission_dict:
submission_dict = {
'value': int(response_dict.get('descendants', 0)),
'label': response_dict['title'],
'xlink': 'http://news.ycombinator.com/item?id=' + str(submission_id),
}
Notice that link has become xlink, this is one of the optional parameters that are defined in the Value Configuration section of the pygal docs.

how to create list of dictionary in this code?

I have some names and scores as follows
input = {
'Maths': dict(Mohsen=19, Sadegh=18, Hafez=15),
'Physics': dict(Sadegh=16, Hafez=17, Mohsen=17),
'Chemistry': dict(Hafez=13),
'Literature': dict(Sadegh=14),
'Biology': dict(Mohsen=16, Sadegh=10),
}
if a person don't have any lesson its score consider zero also get avrege of scores's person and sort final list by averge and i want to get an output like this.
answer = [
dict(Name='Sadegh', Literature=14, Chemistry=0, Maths=18, Physics=16, Biology=10, Average=11.6),
dict(Name='Mohsen', Maths=19, Physics=17, Chemistry=0, Biology=16, Literature=0, Average=10.4),
dict(Name='Hafez', Chemistry=13, Biology=0, Physics=17, Literature=0, Maths=15, Average=9),
]
how to do it?

Essentially, you have a dictionary, where the information is arranged based on subjects, where for each subject, you have student marks. You want to collection all information related to each student in separate dictionaries.
One of the approaches which can try, is as below:
Try converting the data which you have into student specific data and then you can calculate the Average of the Marks of all subjects for that student. There is a sample code below.
Please do note that, this is just a sample and you should be trying
out a solution by yourself. There are many alternate ways of doing it and you should explore them by yourself.
The below code works with Python 2.7
from __future__ import division
def convert_subject_data_to_student_data(subject_dict):
student_dict = {}
for k, v in subject_dict.items():
for k1, v1 in v.items():
if k1 not in student_dict:
student_dict[k1] = {k:v1}
else:
student_dict[k1][k] = v1
student_list = []
for k,v in student_dict.items():
st_dict = {}
st_dict['Name'] = k
st_dict['Average'] = sum(v.itervalues()) / len(v.keys())
st_dict.update(v)
student_list.append(st_dict)
print student_list
if __name__ == "__main__":
subject_dict = {
'Maths': dict(Mohsen=19, Sadegh=18, Hafez=15),
'Physics': dict(Sadegh=16, Hafez=17, Mohsen=17),
'Chemistry': dict(Hafez=13),
'Literature': dict(Sadegh=14),
'Biology': dict(Mohsen=16, Sadegh=10),
}
convert_subject_data_to_student_data(subject_dict)

sample_input = {
'Maths': dict(Mohsen=19, Sadegh=18, Hafez=15),
'Physics': dict(Sadegh=16, Hafez=17, Mohsen=17),
'Chemistry': dict(Hafez=13),
'Literature': dict(Sadegh=14),
'Biology': dict(Mohsen=16, Sadegh=10),
}
def foo(lessons):
result = {}
for lesson in lessons:
for user in lessons[lesson]:#dictionary
if result.get(user):
#print(result.get(user))
result.get(user).setdefault(lesson, lessons[lesson].get(user,0))
else:
result.setdefault(user, dict(name=user))
result.get(user).setdefault(lesson,lessons[lesson].get(user,0))
#return list(result.values())
return result.values()
#if name == '__main__':
print(foo(sample_input))

AvroTypeException: When writing in python3

My avsc file is as follows:
{"type":"record",
"namespace":"testing.avro",
"name":"product",
"aliases":["items","services","plans","deliverables"],
"fields":
[
{"name":"id", "type":"string" ,"aliases":["productid","itemid","item","product"]},
{"name":"brand", "type":"string","doc":"The brand associated", "default":"-1"},
{"name":"category","type":{"type":"map","values":"string"},"doc":"the list of categoryId, categoryName associated, send Id as key, name as value" },
{"name":"keywords", "type":{"type":"array","items":"string"},"doc":"this helps in long run in long run analysis, send the search keywords used for product"},
{"name":"groupid", "type":["string","null"],"doc":"Use this to represent or flag value of group to which it belong, e.g. it may be variation of same product"},
{"name":"price", "type":"double","aliases":["cost","unitprice"]},
{"name":"unit", "type":"string", "default":"Each"},
{"name":"unittype", "type":"string","aliases":["UOM"], "default":"Each"},
{"name":"url", "type":["string","null"],"doc":"URL of the product to return for more details on product, this will be used for event analysis. Provide full url"},
{"name":"imageurl","type":["string","null"],"doc":"Image url to display for return values"},
{"name":"updatedtime", "type":"string"},
{"name":"currency","type":"string", "default":"INR"},
{"name":"image", "type":["bytes","null"] , "doc":"fallback in case we cant provide the image url, use this judiciously and limit size"},
{"name":"features","type":{"type":"map","values":"string"},"doc":"Pass your classification attributes as features in key-value pair"}
]}
I am able to parse this but when I try to write on this as follows, I keep getting issue. What am I missing ? This is in python3. I verified it is well formated json, too.
from avro import schema as sc
from avro import datafile as df
from avro import io as avio
import os
_prodschema = 'product.avsc'
_namespace = 'testing.avro'
dirname = os.path.dirname(__file__)
avroschemaname = os.path.join( os.path.dirname(__file__),_prodschema)
sch = {}
with open(avroschemaname,'r') as f:
sch= f.read().encode(encoding='utf-8')
f.close()
proschema = sc.Parse(sch)
print("Schema processed")
writer = df.DataFileWriter(open(os.path.join(dirname,"products.json"),'wb'),
avio.DatumWriter(),proschema)
print("Just about to append the json")
writer.append({ "id":"23232",
"brand":"Relaxo",
"category":[{"123":"shoe","122":"accessories"}],
"keywords":["relaxo","shoe"],
"groupid":"",
"price":"799.99",
"unit":"Each",
"unittype":"Each",
"url":"",
"imageurl":"",
"updatedtime": "03/23/2017",
"currency":"INR",
"image":"",
"features":[{"color":"black","size":"10","style":"contemperory"}]
})
writer.close()
What am I missing here ?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

is there a method to collect data intelligently from website? - python-3.x

i want to get data from this link https://meshb.nlm.nih.gov/treeView the problem is to get all the tree, we should click on + each time and for each line to get the children node of the tree, but I want to display all the tree just on one click then i want to copy all the content. Any ideas, please?

Related

Pagination not working in Python Session.put()

How can I filter search results using Scrapy

Pygal bar chart says “No data”

how to create list of dictionary in this code?

AvroTypeException: When writing in python3

Categories

Resources