TypeError: while doing web scraping - python-3.x

I was just scraping data and want to make two columns of title and date but TypeError occurs
TypeError: from_dict() got an unexpected keyword argument 'columns'
CODE :
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'
while True:
response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})
for contents in content:
title_tag = contents.find('span',{'class':'title'})
title= title_tag.text[1:-1] if title_tag else 'N/A'
date_tag = contents.find('span',{'class':'meta'})
date = date_tag.text if date_tag else 'N/A'
hiv={title : date}
print(' title : ', title ,' \n date : ' ,date )
url_tag = soup.find('div',{'class':'pagination'})
if url_tag.get('href'):
url = 'https://timesofindia.indiatimes.com/' + url_tag.get('href')
print(url)
else:
break
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['title' ,'date'])
pandas is updated to version 0.23.4,then also error occurs.

The first thing I noticed is the construction of the dictionary is off. I'm assuming you want the dictionary of the entire title:date. The way as you have it now, will only keep the last.
Then when you do that, the index of the dataframe with be the key, and the values are the series/column. So technically there's only 1 column. I can create the two columns by resetting the index, then that index is put into a column that I rename 'title'
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'
response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})
hiv = {}
for contents in content:
title_tag = contents.find('span',{'class':'title'})
title= title_tag.text[1:-1] if title_tag else 'N/A'
date_tag = contents.find('span',{'class':'meta'})
date = date_tag.text if date_tag else 'N/A'
hiv.update({title : date})
print(' title : ', title ,' \n date : ' ,date )
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['date'])
hiv1 = hiv1.rename_axis('title').reset_index()
Output:
print (hiv1)
title date
0 I told my boyfriend I was HIV positive and thi... 01 Dec 2018
1 Pay attention to these 7 very common HIV sympt... 30 Nov 2018
2 Transfusion of HIV blood: Panel seeks time til... 2019-01-06T03:54:33Z
3 No. of pregnant women testing HIV+ dips; still... 01 Dec 2018
4 Busted:5 HIV AIDS myths 30 Nov 2018
5 Myths and taboos related to AIDS 01 Dec 2018
6 N/A N/A
7 Mumbai: Free HIV tests at six railway stations... 23 Nov 2018
8 HIV blood tranfusion: Tamil Nadu govt assures ... 2019-01-05T09:05:27Z
9 Autopsy performed on HIV+ve donor’s body at GRH 2019-01-03T07:45:03Z
10 Madras HC directs to videograph HIV+ve donor’s... 2019-01-01T01:23:34Z
11 HIV +ve Tamil Nadu teen who attempted suicide ... 2018-12-31T03:37:56Z
12 Another woman claims she got HIV-infected blood 2018-12-31T06:34:32Z
13 Another woman says she got HIV from donor blood 29 Dec 2018
14 HIV case: Five-member panel begins inquiry in ... 29 Dec 2018
15 Pregnant woman turns HIV positive after blood ... 26 Dec 2018
16 Pregnant woman contracts HIV after blood trans... 26 Dec 2018
17 Man attacks niece born with HIV for sleeping i... 16 Dec 2018
18 Health ministry implements HIV AIDS Act 2017: ... 11 Sep 2018
19 When meds don’t heal: HIV+ kids fight daily wa... 03 Sep 2018
I'm not quite sure why you're getting the error though. It doesn't make sense since you are using updated Pandas. Maybe uninstall Pandas and then re pip install it?
Otherwise I guess you could just do it in 2 lines and name the columns after converting to dataframe:
hiv1 = pd.DataFrame.from_dict(hiv, orient = 'index').reset_index()
hiv1.columns = ['title','date']

Related

How to remove space between numbers but leave space between names on the same column in a DataFrame Pandas

I would like to clean a Dataframe in such a way that only cells that contain numbers will not have empty spaces but cells with names remain the same.
Author
07 07 34
08 26 20
08 26 20
Tata Smith
Jhon Doe
08 26 22
3409243
here is my approach which is failing
df.loc[df["Author"].str.isdigit(), "Author"] = df["Author"].strip()
How can I handle this?
You might want to use regex.
import pandas as pd
import re
# Create a sample dataframe
import io
df = pd.read_csv(io.StringIO('Author\n 07 07 34 \n 08 26 20 \n 08 26 20 \n Tata Smith\n Jhon Doe\n 08 26 22\n 3409243'))
# Use regex
mask = df['Author'].str.fullmatch(r'[\d ]*')
df.loc[mask, 'Author'] = df.loc[mask, 'Author'].str.replace(' ', '')
# You can also do the same treatment by the following line
# df['Author'] = df['Author'].apply(lambda s: s.replace(' ', '') if re.match(r'[\d ]*$', s) else s)
Author
070734
082620
082620
Tata Smith
Jhon Doe
082622
3409243
How about this?
import pandas as pd
df = pd.read_csv('two.csv')
# remove spaces on copy
df['Author_clean'] = df['Author'].str.replace(" ","")
# try conversion to numeric if possible
df['Author_clean'] = df['Author_clean'].apply(pd.to_numeric, errors='coerce')
# fill missing vals with original strings
df['Author_clean'].fillna(df['Author'], inplace=True)
print(df.head(10))
Output:
Author Author_clean
0 07 07 34 70734.0
1 08 26 20 82620.0
2 08 26 20 82620.0
3 Tata Smith Tata Smith
4 Jhon Doe Jhon Doe
5 08 26 22 82622.0
6 3409243 3409243.0

Python Web-scraped text, print vertically

This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
url = ("https://overwatchleague.com/en-us/schedule?stage=regular_season&week=12")
driver = webdriver.Chrome('C:\\Program Files (x86)\\chromedriver.exe')
driver.get(url)
MatchScores = driver.find_elements_by_xpath('//*
[#id="__next"]/div/div/div[3]/div[3]/div[1]/div[2]/div[13]/div/section/div[3]')[0]
Results = MatchScores.text
print('-----------------------------')
print(Results)
When I run it, I get something like this:
FRI, JUL 02
FINAL
Paris Eternal
1
-
3
San Francisco Shock
MATCH DETAILS
FRI, JUL 02
FINAL
Washington Justice
0
-
3
Atlanta Reign
MATCH DETAILS
This continues for the other matches. Is there a way for me to print in so that it comes out like this?
FRI, JUL 02 FINAL Paris Eternal 1 - 3 San Francisco Shock
FRI, JUL 02 FINAL Washington Justice 0 - 3 Atlanta Reign
Any help would be appreciated, it would be a bonus if it could be printed without the "MATCH DETAILS" at the back.
You really just need to convert all newlines to spaces. This can be done with first splitting by newlines then join with spaces.
You can remove MATCH DETAILS by removing the last 13 characters.
Hence it is like this:
print(' '.join(Results.split('\n'))[:-13])

Extract Datetime information from a string in a DataFrame column

So I have the Edition Column which contains data in unevenly pattern, as some have ',' followed by the date and some have ',-' pattern.
df.head()
17 Paperback,– 1 Nov 2016
18 Mass Market Paperback,– 1 Jan 1991
19 Paperback,– 2016
20 Hardcover,– 24 Nov 2018
21 Paperback,– Import, 4 Oct 2018
How can I extract the date to a separate column. I tried using str.split() but can't find specific pattern to extract.Is there any method I could do it?
obj = df['Edition']
obj.str.split('((?:\d+\s+\w+\s+)?\d{4}$)', expand=True)
or
obj.str.split('[,–]+').str[0]
obj.str.split('[,–]+').str[-1] # date
Try using dateutil
from dateutil.parser import parse
df['Dt']=[parse(i, fuzzy_with_tokens=True)[0] for i in df['column']]

Create New DataFrame Columns Based on Year

I have a pandas DataFrame that contains NFL Quarterback Data from the 2015-2016 to the 2019-2020 Seasons. The DataFrame looks like this
Player Season End Year YPG TD
Tom Brady 2019 322.6 25
Tom Brady 2018 308.1 26
Tom Brady 2017 295.7 24
Tom Brady 2016 308.7 28
Aaron Rodgers 2019 360.4 30
Aaron Rodgers 2018 358.8 33
Aaron Rodgers 2017 357.9 35
Aaron Rodgers 2016 355.2 32
I want to be able to create new columns that contains the years' data I select and the last three years' data. For example if the year I select is 2019 the resulting DataFrame would be(SY stands for selected year:
Player Season End Year YPG SY YPG SY-1 YPG SY-2 YPG SY-3 TD
Tom Brady 2019 322.6 308.1 295.7 308.7 25
Aaron Rodgers 2019 360.4 358.8 357.9 355.2 30
This is how I am attempting to do it:
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']), 'YPG SY'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-1), 'YPG SY-1'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-2), 'YPG SY-2'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-3), 'YPG SY-3'] = NFL_Data['YPG']
However, when I run the code above, it doesn't fill out the columns appropriately. Most of the rows are 0. Am I approaching the problem the right way or is there a better way to attack it?
(Edited to include TD Column)
First step is to pivot your data frame.
pivoted = df.pivot_table(index='Player', columns='Season End Year', values='YPG')
Which yields
Season End Year 2016 2017 2018 2019
Player
Aaron Rodgers 355.2 357.9 358.8 360.4
Tom Brady 308.7 295.7 308.1 322.6
Then, you may select:
pivoted.loc[:, range(year, year-3, -1)]
2019 2018 2017
Player
Aaron Rodgers 360.4 358.8 357.9
Tom Brady 322.6 308.1 295.7
Or alternatively as suggested by Quang:
pivoted.loc[:, year:year-3:-1]

Parse CSV file with some dynamic columns

I have a CSV file that I receive once a week that is in the following format:
"Item","Supplier Item","Description","1","2","3","4","5","6","7","8" ...Linefeed
"","","","Past Due","Day 13-OCT-2014","Buffer 14-OCT-2014","Week 20-OCT-2014","Week 27-OCT-2014", ...LineFeed
"Part1","P1","Big Part","0","0","0","100","50", ...LineFeed
"Part4","P4","Red Part","0","0","0","35","40", ...LineFeed
"Part92","P92","White Part","0","0","0","10","20", ...LineFeed
...
An explanation of the data - Row 2 is dynamic data signifying the date parts are due. Row 3 begins the part numbers with description and number of parts due on a particular date. So looking at the above data: row 3 column7 shows that PartNo1 has 100 parts due on the Week of OCT 20 2014 and 50 due on the Week of OCT 27, 2014.
How can I parse this csv to show the data like this:
Item, Supplier Item, Description, Past Due, Due Date Amount Due
Part1 P1 Big Part 0 20 OCT 2014 100
Part1 P1 Big Part 0 27 OCT 2014 50
Part4 P4 Red Part 0 20 OCT 2014 35
Part4 P4 Red Part 0 27 OCT 2014 40
....
Is there a way to manipulate the format in Excel to rearrange the data like I need or what is the best method to resolve this?

Resources