CSV file to Excel - python-3.x

I'm trying to scrape a website and get the output in an Excel file. I manage to create the Excel file but the columns are all messed up (please see the pictures).
How should I go about transferring the data correctly from the CSV file to the Excel file?
The code I used:
import requests
import pandas as pd
from bs4 import BeautifulSoup
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=34.05349000000007&lon=-118.24531999999999#.XsTs9RMzZTZ')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'seven-day-forecast-body')
items = week.find_all(class_='tombstone-container')
period_names = [item.find(class_='period-name').get_text() for item in items]
short_descriptions = [item.find(class_='short-desc').get_text() for item in items]
temperatures = [item.find(class_='temp').get_text() for item in items]
weather_stuff = pd.DataFrame(
{
'period' : period_names,
'short_descriptions' : short_descriptions,
'temperatures' : temperatures,
})
print(weather_stuff)
weather_stuff.to_csv('weather.csv')

A minimilistic working example:
df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
index=['row 1', 'row 2'],
columns=['col 1', 'col 2'])
df1.to_excel("output.xlsx")
# To specify the sheet name:
df1.to_excel("output.xlsx", sheet_name='Sheet_name_1')
Source: Documentation

Related

python code to loop though a list of postcodes and get the GP practices for those postcodes by scraping the yellow pages (Australia)

The code below gives me the following error:
ValueError: Length mismatch: Expected axis has 0 elements, new values have 1 elements
on the df.columns = ["GP Practice Name"] line.
I tried
import pandas as pd
import requests
from bs4 import BeautifulSoup
postal_codes = ["2000", "2010", "2020", "2030", "2040"]
places_by_postal_code = {}
def get_places(postal_code):
url = f"https://www.yellowpages.com.au/search/listings?clue={postal_code}&locationClue=&latitude=&longitude=&selectedViewMode=list&refinements=category:General%20Practitioner&selectedSortType=distance"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
places = soup.find_all("div", {"class": "listing-content"})
return [place.find("h2").text for place in places]
for postal_code in postal_codes:
places = get_places(postal_code)
places_by_postal_code[postal_code] = places
df = pd.DataFrame.from_dict(places_by_postal_code, orient='index')
df.columns = ["GP Practice Name"]
df = pd.DataFrame(places_by_postal_code.values(), index=places_by_postal_code.keys(), columns=["GP Practice Name"])
print(df)
and was expecting a list of GPs for the postcodes specified in the postal_codes variable.

How to save a list as a .csv file with python and pandas?

I have three lists:
title = ['title1', 'title2', 'title3']
emails = ['1#1', '1#2', '1#3']
links = ['http1', 'http2', 'http3']
I need to insert each of these lists to a column in the CSV file. In the next iteration, I erase the data from the lists and they are filled with new data which also needs to go into these columns without overwriting the previous data.
Example of table:
Titles
Emails
Links
title1
1#1
http1
title2
1#2
http2
title3
1#3
http3
title4
1#4
http4
title5
1#5
http5
title6
1#6
http6
I tried the following:
df.to_csv('content.csv', index=False, columns=list['Titles', 'Emails', 'Links'])
df['Titles'] = titles
df['Emails'] = emails
df['Links'] = links
But it returns the error:
pandas.errors.EmptyDataError: No columns to parse from file
You need to create a DataFrame object first, then assign the data and finally save it as a csv file:
import pandas as pd
titles = ["Mr", "Ms", "Mrs"]
emails = ["anton#domain.com", "lina#domain.com", "katharina#domain.com"]
links = ["https://www.cbc.ca", "https://www.srf.ch", "https://www.ard.de"]
df = pd.DataFrame(columns=["Titles", "Emails", "Links"])
df["Titles"] = titles
df["Emails"] = emails
df["Links"] = links
df.to_csv("content.csv", index=False)
You can also assign the data to the DataFrame object right away when creating it:
import pandas as pd
titles = ["Mr", "Ms", "Mrs"]
emails = ["anton#domain.com", "lina#domain.com", "katharina#domain.com"]
links = ["https://www.cbc.ca", "https://www.srf.ch", "https://www.ard.de"]
df = pd.DataFrame({
"Titles": titles,
"Emails": emails,
"Links": links
})
df.to_csv("content.csv", index=False)
And if you need to append to the csv file use mode="a":
df = pd.DataFrame({
"Titles": ["Ms"],
"Emails": ["courtney#domain.como"],
"Links": ["https://www.orf.at"]
})
# appends the new data to the csv file
df.to_csv("content.csv", mode="a", index=False, header=False)
You first need to create a dataframe and then assign to each column and then export to csv:
titles = ['title1', 'title2', 'title3']
emails = ['1#1','1#2','1#3']
links = ['http1', 'http2', 'http3']
#Create Dataframe
df = pd.DataFrame()
#assign columns
df['Titles'] = titles
df['Emails'] = emails
df['Links'] = links
# export to csv
df.to_csv('content.csv', index=False, columns=['Titles', 'Emails', 'Links'])

How to Extract Data from Graph from a web Page?

I am Trying to scrape graph data from the webpage: 'https://cawp.rutgers.edu/women-percentage-2020-candidates'
I tried bellow code to extract data from Graph:
import requests
from bs4 import BeautifulSoup
Res = requests.get('https://cawp.rutgers.edu/women-percentage-2020-candidates').text
soup = BeautifulSoup(Res, "html.parser")
Values= [i.text for i in soup.findAll('g', {'class': 'igc-graph'}) if i]
Dates = [i.text for i in soup.findAll('g', {'class': 'igc-legend-entry'}) if i]
print(Values, Dates) ## both list are empty
Data= pd.DataFrame({'Value':Values,'Date':Dates}) ## Returning an Empty Dataframe
I want to extract Date and Value from all the 4 bar Graphs. Please anyone suggest what i have to do here to extract the graph data, or is there any other method that i can try to extract the data. thanks;
This graph was located on this url : https://e.infogram.com/5bb50948-04b2-4113-82e6-5e5f06236538
You can find the infogram id (path of target url) directly on the original url if you look for div with class infogram-embed which has the value of attribute data-id:
<div class="infogram-embed" data-id="5bb50948-04b2-4113-82e6-5e5f06236538" data-title="Candidate Tracker 2020_US House_Proportions" data-type="interactive"> </div>
From this url, it loads a static JSON in javascript. You can use regex to extract it and parse the JSON structure to get row/column, and the different tables:
import requests
from bs4 import BeautifulSoup
import re
import json
original_url = "https://cawp.rutgers.edu/women-percentage-2020-candidates"
r = requests.get(original_url)
soup = BeautifulSoup(r.text, "html.parser")
infogram_url = f'https://e.infogram.com/{soup.find("div",{"class":"infogram-embed"})["data-id"]}'
r = requests.get(infogram_url)
soup = BeautifulSoup(r.text, "html.parser")
script = [
t
for t in soup.findAll("script")
if "window.infographicData" in t.text
][0].text
extract = re.search(r".*window\.infographicData=(.*);$", script)
data = json.loads(extract.group(1))
entities = data["elements"]["content"]["content"]["entities"]
tables = [
(entities[key]["props"]["chartData"]["sheetnames"], entities[key]["props"]["chartData"]["data"])
for key in entities.keys()
if ("props" in entities[key]) and ("chartData" in entities[key]["props"])
]
data = []
for t in tables:
for i, sheet in enumerate(t[0]):
data.append({
"sheetName": sheet,
"table": dict([(t[1][i][0][j],t[1][i][1][j]) for j in range(len(t[1][i][0])) ])
})
print(data)
Output:
[{'sheetName': 'Sheet 1',
'table': {'': '2020', 'Districts Already Filed': '435'}},
{'sheetName': 'All',
'table': {'': 'Filed', '2016': '17.8%', '2018': '24.2%', '2020': '29.1%'}},
{'sheetName': 'Democrats Only',
'table': {'': 'Filed', '2016': '25.1%', '2018': '32.5%', '2020': '37.9%'}},
{'sheetName': 'Republicans Only',
'table': {'': 'Filed', '2016': '11.5%', '2018': '13.7%', '2020': '21.3%'}}]

BeautifulSoup, Requests, Dataframe, extracting from <SPAN> and Saving to Excel

Python novice here again! 2 questions:
1) Instead of saving to multiple tabs (currently saving each year to a tab named after the year) how can I save all this data into one sheet in excel called "summary".
2) ('div',class_="sidearm-schedule-game-result") returns the format "W, 1-0". How can I split the "W, 1-0" into two columns, one containing "W" and the next column containing "1-0".
Thanks so much
import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
import openpyxl
import csv
year_id = ['2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019']
lehigh_url = 'https://lehighsports.com/sports/mens-soccer/schedule/'
results = []
with requests.Session() as req:
for year in range(2003, 2020):
print(f"Extracting Year# {year}")
url = req.get(f"{lehigh_url}{year}")
if url.status_code == 200:
soup = BeautifulSoup(url.text, 'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
date = row.find('div',class_="sidearm-schedule-game-opponent-date").text.strip()
name = row.find('div',class_="sidearm-schedule-game-opponent-name").text.strip()
opp = row.find('div',class_="sidearm-schedule-game-opponent-text").text.strip()
conf = row.find('div',class_="sidearm-schedule-game-conference-conference").text.strip()
try:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip()
except:
result = ''
df = pd.DataFrame([[year,date,name,opp,conf,result]], columns=['year','date','opponent','list','conference','result'])
sheet = sheet.append(df,sort=True).reset_index(drop=True)
results.append(sheet)
def save_xls(list_dfs, xls_path):
with ExcelWriter(xls_path) as writer:
for n, df in enumerate(list_dfs):
df.to_excel(writer,'%s' %year_id[n],index=False,)
writer.save()
save_xls(results,'lehigh.xlsx')
Instead of creating a list of dataframes, you can append each sheet into 1 dataframe and write that to file with pandas. Then to split into 2 columns, just use .str.split() and split on the comma.
import requests
import pandas as pd
from bs4 import BeautifulSoup
year_id = ['2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','2009','2008','2007','2006','2005','2004','2003']
results = pd.DataFrame()
for year in year_id:
url = 'https://lehighsports.com/sports/mens-soccer/schedule/' + year
print (url)
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
date = row.find('div',class_="sidearm-schedule-game-opponent-date").text.strip()
name = row.find('div',class_="sidearm-schedule-game-opponent-name").text.strip()
opp = row.find('div',class_="sidearm-schedule-game-opponent-text").text.strip()
conf = row.find('div',class_="sidearm-schedule-game-conference-conference").text.strip()
try:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip()
except:
result = ''
df = pd.DataFrame([[year,date,name,opp,conf,result]], columns=['year','date','opponent','list','conference','result'])
sheet = sheet.append(df,sort=True).reset_index(drop=True)
results = results.append(sheet, sort=True).reset_index(drop=True)
results['result'], results['score'] = results['result'].str.split(',', 1).str
results.to_excel('lehigh.xlsx')

how to fix the indexing error and to scrape the data from a webpage

I want to scrape data from a webpage from a wayback machine using pandas. I used string split to split some string if its present.
the URL for the webpage is this
Here is my code:
import pandas as pd
url = "https://web.archive.org/web/20140528015357/http://eciresults.nic.in/statewiseS26.htm"
dfs = pd.read_html(url)
df = dfs[0]
idx = df[df[0] == '\xa0Next >>'].index[0]
# Error mentioned in comment happens on the above line.
cols = list(df.iloc[idx-1,:])
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
df.to_csv('Chhattisgarh_cand.csv', index=False)
The expected output from that webpage must be in csv format like
You can use BeautifulSoup to extract the data. Panadas will help you to process the data in efficient way but its not ment for data extraction.
import pandas as pd
from bs4 import BeautifulSoup
import requests
response = requests.get('https://web.archive.org/web/20140528015357/http://eciresults.nic.in/statewiseS26.htm?st=S26')
soup = BeautifulSoup(response.text,'lxml')
table_data = []
required_table = [table for table in soup.find_all('table') if str(table).__contains__('Indian National Congress')]
if required_table:
for tr_tags in required_table[0].find_all('tr',{'style':'font-size:12px;'}):
td_data = []
for td_tags in tr_tags.find_all('td'):
td_data.append(td_tags.text.strip())
table_data.append(td_data)
df = pd.DataFrame(table_data[1:])
# print(df.head())
df.to_csv("DataExport.csv",index=False)
You can expect result like this in pandas dataframe,
0 1 ... 6 7
0 BILASPUR 5 ... 176436 Result Declared
1 DURG 7 ... 16848 Result Declared
2 JANJGIR-CHAMPA 3 ... 174961 Result Declared
3 KANKER 11 ... 35158 Result Declared
4 KORBA 4 ... 4265 Result Declared
The code below should get you the table on your url link ("Chhattisgarh Result Status") using a combination of BS and pandas; you can then save it as csv:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
url = "https://web.archive.org/web/20140528015357/http://eciresults.nic.in/statewiseS26.htm?st=S26"
response = urllib.request.urlopen(url)
elect = response.read()
soup = BeautifulSoup(elect,"lxml")
res = soup.find_all('table')
df = pd.read_html(str(res[7]))
df[3]

Resources