I want to download all links/ titles of papers from the web using rvest. I used the following script but it is not the list is zero. Any suggestions?
library(rvest)
1. Download the HTML and turn it into an XML file with read_html()
Papers <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")
2. Extract specific nodes with html_nodes()
Titles <- html_nodes(Papers, "span.optClickTitle")
You are close, try .optClickTitle instead of span.optClickTitle:
library(magrittr)
library(tibble)
library(rvest)
#> Lade nötiges Paket: xml2
url <- "https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false"
raw <- read_html(url)
parse_link <- function(x) {
tibble(title = html_text(x),
link = html_attr(x = x, name = "href"))
}
raw %>%
html_nodes(".optClickTitle") %>%
parse_link()
#> # A tibble: 60 x 2
#> title link
#> <chr> <chr>
#> 1 The Nature of Man https://ssrn.com/abst…
#> 2 The Dynamics of Crowdfunding: An Exploratory St… https://ssrn.com/abst…
#> 3 Some Simple Economics of the Blockchain https://ssrn.com/abst…
#> 4 "Some Simple Economics of the Blockchain\r\n\t\… https://ssrn.com/abst…
#> 5 "Some Simple Economics of the Blockchain\r\n\t\… https://ssrn.com/abst…
#> 6 Bitcoin: An Innovative Alternative Digital Curr… https://ssrn.com/abst…
#> 7 Piracy and Box Office Movie Revenues: Evidence … https://ssrn.com/abst…
#> 8 The sharing economy: Why people participate in … https://ssrn.com/abst…
#> 9 Consumer Acceptance and Use of Information Tech… https://ssrn.com/abst…
#> 10 What Makes Online Content Viral? https://ssrn.com/abst…
#> # ... with 50 more rows
Created on 2018-09-28 by the reprex package (v0.2.1)
Related
from cgitb import text
from bs4 import BeautifulSoup
import requests
website = 'https://www.marketplacehomes.com/rent-a-home/'
result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, 'html.parser')
lists = soup.find_all('div', class_=('tt-rental-row'))
for list in lists:
location = list.find('span', class_="renta;-adress")
beds = list.find('span', class_="renta;-beds")
baths = list.find('span', class_="renta;-beds")
availability = list.find('span', class_="rental-date-available")
info = [location, beds, baths, availability]
print(info)
If I try to run the last line of code, I get:
"IndentationError: expected an indented block"
If I try to run each indentation separately I get:
">>> location = list.find('span', class_="renta;-adress")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'list' has no attribute 'find'"
I'm new to Python and I'm kinda stuck, can anyone please help me?
Note: Your code never runs the for-loop cause your selection never matches the elements in HTML. They are generated dynamically based on data from another ressource and requests do not render websites like a browser, it only uses static contents from response.
Be aware not to use built-in keywords they will cause errors, especialy in your case list.find() will raise one cause the type object 'list' do not has an attribute called find. You could simply check these things using type()
type(soup)
-> its a bs4.BeautifulSoup
type(soup.find_all('div', class_=('tt-rental-row')))
-> its a bs4.element.ResultSet
type(list)
-> its a type
So how to get your goal?
You could also use pandas to directly create a DataFrame and slice it to your needs:
import pandas as pd
pd.read_json('https://app.tenantturner.com/listings-json/2679')
Output:
id dateActivated latitude longitude address city state zip photo title ... baths dateAvailable rentAmount acceptPets applyUrl btnUrl btnText virtualTour propertyType enableWaitlist
0 83600 8/22/2022 35.750499 -86.393972 4481 Jack Faulk St Murfreesboro TN 37127 https://ttimages.blob.core.windows.net/propert... 4481 Jack Faulk St ... 2.0 Now 2195 cats, small dogs, large dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/4481-jack... Schedule Viewing None Single Family False
1 100422 8/31/2022 30.277607 -95.472842 213 Skybranch Court Conroe TX 77304 https://ttimages.blob.core.windows.net/propert... 213 Skybranch Court ... 2.5 Now 2100 cats, small dogs, large dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/213-skybr... Schedule Viewing None Condo Unit False
2 106976 7/27/2022 28.274720 -82.298077 8127 Olive Brook Dr Wesley Chapel FL 33545 https://ttimages.blob.core.windows.net/propert... 8127 Olive Brook Dr ... 2.0 Now 2650 no pets https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/8127-oliv... Schedule Viewing None Single Family False
3 116188 8/15/2022 42.624023 -83.144614 735 Grace Ave Rochester Hills MI 48307 https://ttimages.blob.core.windows.net/propert... 735 Grace Ave ... 2.0 Now 1600 cats, small dogs, large dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/735-grace... Schedule Viewing None Single Family False
4 126846 8/22/2022 32.046455 -81.071181 1810 E 41st St Savannah GA 31404 https://ttimages.blob.core.windows.net/propert... 1810 E 41st St ... 1.0 Now 1395 small dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/1810-e-41... Schedule Viewing None Single Family True
...
91 rows × 22 columns
Example:
To show only specifc columns, simply pass a list of there names.
import pandas as pd
pd.read_json('https://app.tenantturner.com/listings-json/2679')[['address', 'city','state', 'zip', 'title', 'beds', 'baths','dateAvailable']]
Output
address beds baths dateAvailable
0 4481 Jack Faulk St 4 2.0 Now
1 213 Skybranch Court 3 2.5 Now
2 8127 Olive Brook Dr 3 2.0 Now
3 735 Grace Ave 3 2.0 Now
4 1810 E 41st St 3 1.0 Now
... ... ... ... ...
91 rows × 4 columns
Since the word list is a built-in keyword in python you can't use it as variable name try another name
for myList in lists:
location = myList.find('span', class_="renta;-adress")
beds = myList.find('span', class_="renta;-beds")
baths = myList.find('span', class_="renta;-beds")
availability = myList.find('span', class_="rental-date-available")
info = [location, beds, baths, availability]
print(info)
I am trying to normalize weight units in a string.
Eg:
1.SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre - SUCO MARACUJA COM GENGIBRE PCS 300 ML
2. OVOS CAIPIRAS ANA MARIA BRAGA 10UN - OVOS CAIPIRAS ANA MARIA BRAGA 10U
3. SUCO MARACUJA MAMAO PCS 300 Gram - SUCO MARACUJA MAMAO PCS 300 G
4. SUCO ABACAXI COM MACA PCS 300Milli litre - SUCO ABACAXI COM MACA PCS 300ML
The keyword table is :
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
I tried to take up these lists as a table but am having difficulty in comparing two dataframes or tables in python.
I tried the below code.
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
z='SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre'
#for row in mongo_docs:
#z = row['clean_hntproductname']
for x in unit:
for y in norm_unit:
if (re.search(r'\s'+x+r'$',z,re.I)):
# clean_hntproductname = t.lower().replace(x.lower(),y.lower())
# myquery3 = { "_id" : row['_id']}
# newvalues3 = { "$set": {"clean_hntproductname" : 'clean_hntproductname'} }
# ds_hnt_prod_data.update_one(myquery3, newvalues3)
I'm using Python(Jupyter) with MongoDb(Compass). Fetching data from Mongo and writing back to it.
From my understanding you want to:
Update all the rows in a table which contain the words in the unit array, to the ones in norm_unit.
(Disclaimer: I'm not familiar with MongoDB or Python.)
What you want is to create a mapping (using a hash) of the words you want to change.
Here's a trivial solution (i.e. not best solution but would probably point you in the right direction.)
unit_conversions = {
'Kilo': 'KG'
'Kilogram': 'KG',
'Gram': 'G'
}
# pseudo-code
for each row that you want to update
item_description = get the value of the string in the column
for each key in unit_conversion (e.g. 'Kilo')
see if the item_description contains the key
if it does, replace it with unit_convertion[key] (e.g. 'KG')
update the row
Given a dataframe as follows:
firstname lastname email_address \
0 Doug Watson douglas.watson#dignityhealth.org
1 Nick Holekamp nick.holekamp#rankenjordan.org
2 Rob Schreiner rob.schriener#wellstar.org
3 Austin Phillips austin.phillips#precmed.com
4 Elise Geiger egeiger#puracap.com
5 Paul Urick purick#diplomatpharmacy.com
6 Michael Obringer michael.obringer#lashgroup.com
7 Craig Heneghan cheneghan#west-ward.com
8 Kathy Hirst kathleen.hirst#sunovion.com
9 Stefan Bluemmers stefan.bluemmers#grunenthal.com
companyname
0 Dignity Health
1 Ranken Jordan Pediatric Bridge Hospital
2 WellStar Health System
3 Precision Medical Products, Inc.
4 puracap.com
5 Diplomat Specialty Pharmacy
6 Lash Group
7 West-Ward Pharmaceuticals
8 Sunovion Pharmaceuticals
9 Grünenthal Group
How could I create possible email addresses using common email patterns as such: firstlast#example.com, first.last#example.com, f.last#example.com, lastF#example.com, first_last#example.com, firstL#example.com, etc.
df['email1'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() + '#' + df.companyname.str.replace('\s+', '').str.lower() + '.com'
print(df['email1'])
Out:
0 doug.watson#dignityhealth.com
1 nick.holekamp#rankenjordanpediatricbridgehospi... --->problematic
2 rob.schreiner#wellstarhealthsystem.com
3 austin.phillips#precisionmedicalproducts,inc..com --->problematic
4 elise.geiger#puracap.com.com --->problematic
...
9995 terry.hanley#kempersportsmanagement.com
9996 christine.marks#geocomp.com
9997 darryl.rickner#doe.com
9998 lalit.sharma#lovelylifestyle.com
9999 parul.dutt#infibeam.com
Some of them seems quite problematic, anyone could help to solve this issue? Thanks a lot.
EDITED:
print(df) after applying #Sajith Herath's solution:
Out:
firstname lastname companyname \
0 Nick Holekamp Ranken ...
email
0 nick. ...
You can use a method to create permutations of username with different separators and define a max length that simplify the domain using company name as follows
import pandas as pd
import random
data = {"firstname":["Nick"],"lastname":["Holekamp"],"companyname":["Ranken \
Jordan Pediatric Bridge Hospital"]}
df = pd.DataFrame(data=data)
max_char = 5
emails = []
def simplify_domain(text):
if len(text)>max_char:
text = ''.join([c for c in text if c.isupper()])
return text.lower()
return text.replace("\s+","").lower()
def username_permutations(first_name,last_name):
# define separators
separators = [".", "_", "-"]
#lower case
combinations = list(map(lambda x:f"{first_name.lower()}{x} \
{last_name.lower()}",separators))
#append a random number to tail
n = random.randint(1, 100)
combinations.extend(list(map(lambda x:f"{x}{n}",combinations)))
return combinations
for index,row in df.iterrows():
usernames = username_permutations(row["firstname"],row["lastname"])
email_permutations = list(map(lambda x: f" \
{x}#{simplify_domain(row['companyname'])}.com",usernames))
emails.append(','.join(email_permutations))
df["email"] = emails
Final result will be nick.holekamp#rjpbh.com,nick_holekamp#rjpbh.com,nick-holekamp#rjpbh.com,nick.holekamp66#rjpbh.com,nick_holekamp66#rjpbh.com,nick-holekamp66#rjpbh.com
you can modify simplify_domain method to validate given string such as removing inc or .com values
I have gotten a very strange data. I have dictionary with keys and values where I want to use this dictionary to search if these keywords are ONLY starting and/or end of the text not middle of the sentence. I tried to create simple data frame below to show the problem case and python codes that I have tried so far. How do I get it go search for only starting or ending of the sentence? This one searches whole text sub-strings.
Code:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
Original Dataframe:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
Code Tried out:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
Error:
TypeError: 'in <string>' requires string as left operand, not bool
when I use this x = (i for i in d if i.startswith(k) in k)
Empty values if i tried this x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
when i use this x = (i.startswith(k) for i in d if i in k)
Results from Code above ... Create new field 'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
Trying to get an FINAL output like this:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
You need a matcher function which can accept flag and then call that twice to get the results for startswith and endswith.
def matcher(s, flag="start"):
if flag=="start":
for i in d:
if s.startswith(i):
return d[i]
else:
for i in d:
if s.endswith(i):
return d[i]
return None
df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]
The text_value column looks like:
0
1 Company
2 Company;Application
3 Application
Name: text_value, dtype: object
joined = "|".join(d.keys())
pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'
get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
df.text.str.replace(pat,get)
0
1 Company
2 Company;Application
3 Company;Application
Name: text, dtype: object
I am trying to use writeOGR to create a gpx file of points. writeOGR() will create a shp file with no error, but if I try to write a KML or GPX file I get this error. I'm using R 3.1.1 and rgdal 0.8-16 on Windows (I've tried it on 7 and 8, same issue).
writeOGR(points, driver="KML", layer="random_2014",dsn="C:/Users/amvander/Downloads")
Error in writeOGR(points, driver = "KML", layer = "random_2014", dsn = "C:/Users/amvander/Downloads") :
Creation of output file failed
It is in geographic coordinates, I already figured out that that was important
summary(points)
Object of class SpatialPointsDataFrame
Coordinates:
min max
x -95.05012 -95.04392
y 40.08884 40.09588
Is projected: FALSE
proj4string :
[+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0]
Number of points: 20
Data attributes:
x y ID
Min. :-95.05 Min. :40.09 Length:20
1st Qu.:-95.05 1st Qu.:40.09 Class :character
Median :-95.05 Median :40.09 Mode :character
Mean :-95.05 Mean :40.09
3rd Qu.:-95.05 3rd Qu.:40.09
Max. :-95.04 Max. :40.10
str(points)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 20 obs. of 3 variables:
.. ..$ x : num [1:20] -95 -95 -95 -95 -95 ...
.. ..$ y : num [1:20] 40.1 40.1 40.1 40.1 40.1 ...
.. ..$ ID: chr [1:20] "nvsanc_1" "nvsanc_2" "nvsanc_3" "nvsanc_4" ...
..# coords.nrs : num(0)
..# coords : num [1:20, 1:2] -95 -95 -95 -95 -95 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "x" "y"
..# bbox : num [1:2, 1:2] -95.1 40.1 -95 40.1
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:2] "x" "y"
.. .. ..$ : chr [1:2] "min" "max"
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slots
.. .. ..# projargs: chr "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
can anyone provide any guidance on how to get around this error?
Here are the files I used and the code.
https://www.dropbox.com/sh/r7kz3p68j58c189/AACH0U_PLH7Y6cZW1wdFLQTOa/random_2014
you already figured out these formats will only accept geographic coordinates (lat-long, not projected) and at least for GPX files there are very limited fields allowed, like "name" for the name, "ele" for elevation and "time" for date-time information. The #data fields in your file do not match those and thus cause an error.
It is possible to write those extra fields using
dataset_options="GPX_USE_EXTENSIONS=yes"
in that case they will be added as subclasses in an "extensions" field, but many simple gps receivers will not read or use those fields though. To create a very simple waypoint file with names use the following procedure.
#I will use your dataset points
#if not already re-project your points as lat-long
ll_points <- spTransform(points, CRS("+proj=longlat + ellps=WGS84"))
# use the ID field for the names
ll_points#data$name <- ll_points#data$ID
#Now only write the "name" field to the file
writeOGR(ll_points["name"], driver="GPX", layer="waypoints",
dsn="C:/whateverdir/gpxfile.gpx")
for me this executed and created a working gpx file that my gps accepted with displayed names.
I had some problems implementing the code for input from a simple data.frame and wanted to provide the full code for someone working with that kind of data (instead of from a shapefile). This is simply a very slightly modified answer from #Wiebe's answer, without having to go search out what #Auriel Fournier's original data looked like.Thank you #Wiebe - your answer helped me solve my problem too.
The data look like this:
dat
name Latitude Longitude
1 AP1402_C110617 -78.43262 45.45142
2 AP1402_C111121 -78.42433 45.47371
3 AP1402_C111617 -78.41053 45.45600
4 AP1402_C112200 -78.42115 45.53047
5 AP1402_C112219 -78.41262 45.50071
6 AP1402_C112515 -78.42140 45.43471
Code to get it into GPX for mapsource using writeOGR:
setwd("C:\\Users\\YourfileLocation")
dat <- structure(list(name = c("AP1402_C110617", "AP1402_C111121", "AP1402_C111617",
"AP1402_C112200", "AP1402_C112219", "AP1402_C112515"), Latitude = c(-78.4326169598409,
-78.4243276812641, -78.4105301310195, -78.4211498660601, -78.4126208020092,
-78.4214041610924), Longitude = c(45.4514150332163, 45.4737126348589,
45.4560042609868, 45.5304703938887, 45.5007103937952, 45.4347135938299
)), .Names = c("name", "Latitude", "Longitude"), row.names = c(NA,
6L), class = "data.frame")
library(rgdal)
library(sp)
dat <- SpatialPointsDataFrame(data=dat, coords=dat[,2:3], proj4string=CRS("+proj=longlat +datum=WGS84"))
writeOGR(dat,
dsn="TEST3234.gpx", layer="waypoints", driver="GPX",
dataset_options="GPX_USE_EXTENSIONS=yes")