companies = pd.read_csv("http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv", index_col = 0)
companies.head()
I'm getting this error please suggest what approaches should be tried.
"utf-8' codec can't decode byte 0xb7 in position 7"
Try encoding as 'latin1' on macOS.
companies = pd.read_csv("http://www.richardmuir.com/data/public/csv/CompaniesRevenueEmployees.csv",
index_col=0,
encoding='latin1')
Downloading the file and opening it in notepad++ shows it is ansi-encoded. If you are on a windows system this should fix it:
import pandas as pd
url = "http://www.richard-muir.com/data/public/csv/CompaniesRevenueEmployees.csv"
companies = pd.read_csv(url, index_col = 0, encoding='ansi')
print(companies)
If not (on windows), you need to research how to convert ansi-encoded text to something you can read.
See: https://docs.python.org/3/library/codecs.html#standard-encodings
Output:
Name Industry \
0 Walmart Retail
1 Sinopec Group Oil and gas
2 China National Petroleum Corporation Oil and gas
... ... ...
47 Hewlett Packard Enterprise Electronics
48 Tata Group Conglomerate
Revenue (USD billions) Employees
0 482 2200000
1 455 358571
2 428 1636532
... ... ...
47 111 302000
48 108 600000
Related
from cgitb import text
from bs4 import BeautifulSoup
import requests
website = 'https://www.marketplacehomes.com/rent-a-home/'
result = requests.get(website)
content = result.text
soup = BeautifulSoup(content, 'html.parser')
lists = soup.find_all('div', class_=('tt-rental-row'))
for list in lists:
location = list.find('span', class_="renta;-adress")
beds = list.find('span', class_="renta;-beds")
baths = list.find('span', class_="renta;-beds")
availability = list.find('span', class_="rental-date-available")
info = [location, beds, baths, availability]
print(info)
If I try to run the last line of code, I get:
"IndentationError: expected an indented block"
If I try to run each indentation separately I get:
">>> location = list.find('span', class_="renta;-adress")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'list' has no attribute 'find'"
I'm new to Python and I'm kinda stuck, can anyone please help me?
Note: Your code never runs the for-loop cause your selection never matches the elements in HTML. They are generated dynamically based on data from another ressource and requests do not render websites like a browser, it only uses static contents from response.
Be aware not to use built-in keywords they will cause errors, especialy in your case list.find() will raise one cause the type object 'list' do not has an attribute called find. You could simply check these things using type()
type(soup)
-> its a bs4.BeautifulSoup
type(soup.find_all('div', class_=('tt-rental-row')))
-> its a bs4.element.ResultSet
type(list)
-> its a type
So how to get your goal?
You could also use pandas to directly create a DataFrame and slice it to your needs:
import pandas as pd
pd.read_json('https://app.tenantturner.com/listings-json/2679')
Output:
id dateActivated latitude longitude address city state zip photo title ... baths dateAvailable rentAmount acceptPets applyUrl btnUrl btnText virtualTour propertyType enableWaitlist
0 83600 8/22/2022 35.750499 -86.393972 4481 Jack Faulk St Murfreesboro TN 37127 https://ttimages.blob.core.windows.net/propert... 4481 Jack Faulk St ... 2.0 Now 2195 cats, small dogs, large dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/4481-jack... Schedule Viewing None Single Family False
1 100422 8/31/2022 30.277607 -95.472842 213 Skybranch Court Conroe TX 77304 https://ttimages.blob.core.windows.net/propert... 213 Skybranch Court ... 2.5 Now 2100 cats, small dogs, large dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/213-skybr... Schedule Viewing None Condo Unit False
2 106976 7/27/2022 28.274720 -82.298077 8127 Olive Brook Dr Wesley Chapel FL 33545 https://ttimages.blob.core.windows.net/propert... 8127 Olive Brook Dr ... 2.0 Now 2650 no pets https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/8127-oliv... Schedule Viewing None Single Family False
3 116188 8/15/2022 42.624023 -83.144614 735 Grace Ave Rochester Hills MI 48307 https://ttimages.blob.core.windows.net/propert... 735 Grace Ave ... 2.0 Now 1600 cats, small dogs, large dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/735-grace... Schedule Viewing None Single Family False
4 126846 8/22/2022 32.046455 -81.071181 1810 E 41st St Savannah GA 31404 https://ttimages.blob.core.windows.net/propert... 1810 E 41st St ... 1.0 Now 1395 small dogs https://app.propertyware.com/pw/application/#/... https://app.tenantturner.com/qualify/1810-e-41... Schedule Viewing None Single Family True
...
91 rows × 22 columns
Example:
To show only specifc columns, simply pass a list of there names.
import pandas as pd
pd.read_json('https://app.tenantturner.com/listings-json/2679')[['address', 'city','state', 'zip', 'title', 'beds', 'baths','dateAvailable']]
Output
address beds baths dateAvailable
0 4481 Jack Faulk St 4 2.0 Now
1 213 Skybranch Court 3 2.5 Now
2 8127 Olive Brook Dr 3 2.0 Now
3 735 Grace Ave 3 2.0 Now
4 1810 E 41st St 3 1.0 Now
... ... ... ... ...
91 rows × 4 columns
Since the word list is a built-in keyword in python you can't use it as variable name try another name
for myList in lists:
location = myList.find('span', class_="renta;-adress")
beds = myList.find('span', class_="renta;-beds")
baths = myList.find('span', class_="renta;-beds")
availability = myList.find('span', class_="rental-date-available")
info = [location, beds, baths, availability]
print(info)
I'm trying to scrape http://www.moneycontrol.com/stocks/histstock.php?sc_id=BPC&mycomp=BPCL
to get price data .
So i followed the following
Opened up that link and fed in the dates(daily)
chrome->inspect->Network - obtained the Form details and found out that the URL for POST
Fed in the form data and hit POST .
I have multiple tickers for which i need the data.
Eg:
'AXISBANK': 'http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=AXISBANK',
'BAJAJ-AUTO': 'http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=BPCL',
But when i run the POST i get the same output even though the URLs i'm posting to are differnt.
What could i be missing?
Output:
running for http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=AXISBANK
Date Open High Low Close Volume
244 05-01-2016 881.3 905.00 881.3 900.65 1372748
245 04-01-2016 876.2 892.45 871.7 880.80 709103
246 01-01-2016 882.0 885.60 876.9 878.75 294006
running for http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=BPCL
Date Open High Low Close Volume
244 05-01-2016 881.3 905.00 881.3 900.65 1372748
245 04-01-2016 876.2 892.45 871.7 880.80 709103
246 01-01-2016 882.0 885.60 876.9 878.75 294006
This is the code i wrote to test it.
url='http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=AXISBANK'
url2='http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=BPCL'
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
data = {
'frm_dy':'01',
'frm_mth':'01',
'frm_yr':'2016',
'to_dy':'31',
'to_mth':'12',
'to_yr':'2016',
'hdn':'daily'
# 'x':'15',
# 'y':'14'
}
print('running for {}'.format(url))
test = requests.post(url,data=data) # Post the data
doc = bs(test.text,'html.parser')
tables = doc.find('table',{'class':'tblchart'})
tData = pd.read_html(str(tables),header=1) #You get a list
#Convert it to dataFrame
tData = tData[0].drop(columns=['(High-Low)','(Open-Close)'])
print(tData.tail(3))
import time
time.sleep(20) # Hopefully sleep works?
url = url2 # test only
print('running for {}'.format(url))
test = requests.post(url,data=data)
doc = bs(test.text,'html.parser')
tables = doc.find('table',{'class':'tblchart'})
tData = pd.read_html(str(tables),header=1) #You get a list
#Convert it to dataFrame
tData = tData[0].drop(columns=['(High-Low)','(Open-Close)'])
print(tData.tail(3))
I noticed that sc_id changed when i ran it directly from the URL vs when i looked at the 'Inspect'.
I dont know what sc_id is (sessions_ID?)
Im totally new to web scraping . SO i dont really know the gotchas or if i've hit any.
What could i be missing?
You have to set correctly the parameter sc_id= in the URL.
For AXIS Bank it's UTI10
For Bajaj Auto it's BA06
For example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_sc_id(name, full_name):
url = 'https://www.moneycontrol.com/stocks/autosuggest.php'
params = {'str': name}
return re.search(r'set_val\(\'{}\',\'(.*?)\'\)'.format(full_name), requests.get(url, params=params).text, flags=re.I)[1]
def get_table(sc_id, mycomp):
url = 'https://www.moneycontrol.com/stocks/hist_stock_result.php'
params = {
'ex':'B',
'sc_id': sc_id,
'mycomp': mycomp
}
data = {
'frm_dy':'01',
'frm_mth':'01',
'frm_yr':'2016',
'to_dy':'31',
'to_mth':'12',
'to_yr':'2016',
'hdn':'daily'
}
soup = BeautifulSoup(requests.post(url, data=data, params=params).content, 'html.parser')
return pd.read_html( str(soup.select_one('.tblchart')) )[0].droplevel(0, axis=1)
code = get_sc_id('AXIS', 'Axis Bank')
print('Axis Bank code: ', code)
print(get_table(code, 'Axis Bank'))
code = get_sc_id('BAJAJ', 'Bajaj Auto')
print('Bajaj Auto code:', code )
print(get_table(code, 'Bajaj Auto'))
Prints:
Axis Bank code: UTI10
Date Open High Low Close Volume (High-Low) (Open-Close)
0 30-12-2016 446.00 451.80 443.45 450.00 234037 8.35 -4.00
1 29-12-2016 447.00 447.00 437.80 444.15 267677 9.20 2.85
2 28-12-2016 437.45 447.85 436.00 439.50 251149 11.85 -2.05
3 27-12-2016 430.00 438.55 430.00 437.45 210857 8.55 -7.45
4 26-12-2016 432.15 436.00 427.00 431.75 405044 9.00 0.40
.. ... ... ... ... ... ... ... ...
242 07-01-2016 424.25 425.00 407.30 409.35 1441934 17.70 14.90
243 06-01-2016 439.70 439.70 429.80 430.80 730512 9.90 8.90
244 05-01-2016 439.00 440.00 433.65 436.35 726947 6.35 2.65
245 04-01-2016 448.85 448.85 437.40 439.25 743518 11.45 9.60
246 01-01-2016 450.00 452.70 445.80 449.80 433052 6.90 0.20
[247 rows x 8 columns]
Bajaj Auto code: BA06
Date Open High Low Close Volume (High-Low) (Open-Close)
0 30-12-2016 2655.55 2667.00 2627.25 2633.85 10377 39.75 21.70
1 29-12-2016 2621.00 2665.65 2611.50 2655.45 8704 54.15 -34.45
2 28-12-2016 2629.35 2653.00 2624.55 2631.60 6475 28.45 -2.25
3 27-12-2016 2563.00 2642.00 2563.00 2633.60 15491 79.00 -70.60
4 26-12-2016 2618.00 2618.35 2578.00 2596.70 7205 40.35 21.30
.. ... ... ... ... ... ... ... ...
242 07-01-2016 2470.00 2481.80 2407.25 2419.25 15962 74.55 50.75
243 06-01-2016 2495.00 2513.70 2475.00 2485.50 11975 38.70 9.50
244 05-01-2016 2518.00 2520.00 2480.00 2497.05 11967 40.00 20.95
245 04-01-2016 2507.90 2545.85 2480.65 2488.15 23077 65.20 19.75
246 01-01-2016 2530.00 2530.00 2512.15 2520.05 9055 17.85 9.95
[247 rows x 8 columns]
Hello I want to find the account text # in the title column, and save it in the new csv. Pandas can do it, I tried to make it but it didn't work.
This is my csv http://www.sharecsv.com/s/c1ed9790f481a8d452049be439f4e3d8/Newnormal.csv
this is my code:
import pandas as pd
data = pd.read_csv("Newnormal.csv")
data.dropna(inplace = True)
sub ='#'
data["Indexes"]= data["title"].str.find(sub)
print(data)
I want results like this
From, to, title Xavier5501,KudiiThaufeeq,RT #KudiiThaufeeq: Royal
Rape, Royal Harassment, Royal Cocktail Party, Royal Pedo, Royal
Bidding, Royal Maalee Bayaan, Royal Slavery..et
Thank you.
reduce records to only those that have an "#" in title
define new column which is text between "#" and ":"
you are left with some records where this leave NaN in to column. I've just filtered these out
df = pd.read_csv("Newnormal.csv")
df = df[df["title"].str.contains("#")==True]
df["to"] = df["title"].str.extract(r".*([#][A-Z,a-z,0-9,_]+[:])")
df = df[["from","to","title"]]
df[~df["to"].isna()].to_csv("ToNewNormal.csv", index=False)
df[~df["to"].isna()]
output
from to title
1 Xavier5501 #KudiiThaufeeq: RT #KudiiThaufeeq: Royal Rape, Royal Harassmen...
2 Suzane24979006 #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou...
3 sandeep_sprabhu #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou...
4 oliLince #Timothy_Hughes: RT #Timothy_Hughes: How to Get a Salesforce Th...
7 rismadwip #danielepermana: RT #danielepermana: Pak kasus covid per hari s...
... ... ... ...
992 Reptoid_Hunter #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha...
994 KPCResearch #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha...
995 GreySparkUK #VoxSmartGlobal: RT #VoxSmartGlobal: The #newnormal will see mo...
997 Gabboa10 #HuShameem: RT #HuShameem: One of #PGO_MV admin staff test...
999 wanjirunjendu #ntvkenya: RT #ntvkenya: AAK's Mugure Njendu shares insig...
Can anyone please suggest me how to extract tabular data from a PDF using python/java program for the below borderless table present in a pdf file?
This table might be difficult one for tabla. How about using guess=False, stream=True ?
Update: As of tabula-py 1.0.3, guess and stream should work together. No need to set guess=False to use stream or lattice option.
I solved this problem via tabula-py
conda install tabula-py
and
>>> import tabula
>>> area = [70, 30, 750, 570] # Seems to have to be done manually
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
) # `tabula` doc explains params very well
>>> page2
and I got this result
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
This is an iterable obj, so you can manipulate it via for row in page2:
Hope it help you
Tabula-py borderless table extraction:
Tabula-py has stream which on True detects table based on gaping.
from tabula convert_into
src_pdf = r"src_path"
des_csv = r"des_path"
convert_into(src_pdf, des_csv, guess=False, lattice=False, stream=True, pages="all")
I am trying to use SEC (U.S. Security and Exchange Commision data). The SEC provides useful data in a txtformat. I am using
Financial Statement Data Sets for the second quarter of 2017. You can find the data I use here.
I try to read the txtfiles into a pandas dataframe. I tried it the following ways:
sub = pd.read_fwf('sub.txt')
sub_1 = pd.read_csv('sub.txt')
I get no error with using Pandas' read_fwf function - but the output is utter rubbish. Here is the head of the dataframe:
adsh cik name sic countryba stprba cityba zipba bas1 bas2 baph countryma stprma cityma zipma mas1 mas2 countryinc stprinc ein former changed afs wksi fye form period fy fp filed accepted prevrpt detail instance nciks aciks Unnamed: 1
0 0000002178-17-000038\t2178\tADAMS RESOURCES & ... NaN
1 0000002488-17-000107\t2488\tADVANCED MICRO DEV... NaN
I do get an error when using read_csv: Error tokenizing data. C error: Expected 2 fields in line 7, saw 3
Any ideas on how tor read the data into a pandas dataframe?
It looks like the files are tab separated - that's why you're seeing \t in the results. pandas read_csv defaults to comma separated values, so you have to change the separator. This is controlled by the sep parameter. In addition, you will need to provide the proper encoding (errors are thrown when trying to read the num, pre, and tag files). Generally ISO-8859-1 is a good choice.
#import pandas
import pandas as pd
#read in the .txt file and choose a separator and encoding standard
df = pd.read_csv('sub.txt', sep='\t', encoding='ISO-8859-1')
#output the results
print(df)
adsh cik name \
0 0000002178-17-000038 2178 ADAMS RESOURCES & ENERGY, INC.
1 0000002488-17-000107 2488 ADVANCED MICRO DEVICES INC
2 0000002969-17-000019 2969 AIR PRODUCTS & CHEMICALS INC /DE/
3 0000002969-17-000024 2969 AIR PRODUCTS & CHEMICALS INC /DE/
4 0000003499-17-000010 3499 ALEXANDERS INC
5 0000003545-17-000043 3545 ALICO INC
6 0000003570-17-000073 3570 CHENIERE ENERGY INC