Python Web-scraped text, print vertically

Python Web-scraped text, print vertically - python-3.x

This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
url = ("https://overwatchleague.com/en-us/schedule?stage=regular_season&week=12")
driver = webdriver.Chrome('C:\\Program Files (x86)\\chromedriver.exe')
driver.get(url)
MatchScores = driver.find_elements_by_xpath('//*
[#id="__next"]/div/div/div[3]/div[3]/div[1]/div[2]/div[13]/div/section/div[3]')[0]
Results = MatchScores.text
print('-----------------------------')
print(Results)
When I run it, I get something like this:
FRI, JUL 02
FINAL
Paris Eternal
1
-
3
San Francisco Shock
MATCH DETAILS
FRI, JUL 02
FINAL
Washington Justice
0
-
3
Atlanta Reign
MATCH DETAILS
This continues for the other matches. Is there a way for me to print in so that it comes out like this?
FRI, JUL 02 FINAL Paris Eternal 1 - 3 San Francisco Shock
FRI, JUL 02 FINAL Washington Justice 0 - 3 Atlanta Reign
Any help would be appreciated, it would be a bonus if it could be printed without the "MATCH DETAILS" at the back.

You really just need to convert all newlines to spaces. This can be done with first splitting by newlines then join with spaces.
You can remove MATCH DETAILS by removing the last 13 characters.
Hence it is like this:
print(' '.join(Results.split('\n'))[:-13])

Related

How to remove space between numbers but leave space between names on the same column in a DataFrame Pandas

I would like to clean a Dataframe in such a way that only cells that contain numbers will not have empty spaces but cells with names remain the same.
Author
07 07 34
08 26 20
08 26 20
Tata Smith
Jhon Doe
08 26 22
3409243
here is my approach which is failing
df.loc[df["Author"].str.isdigit(), "Author"] = df["Author"].strip()
How can I handle this?

You might want to use regex.
import pandas as pd
import re
# Create a sample dataframe
import io
df = pd.read_csv(io.StringIO('Author\n 07 07 34 \n 08 26 20 \n 08 26 20 \n Tata Smith\n Jhon Doe\n 08 26 22\n 3409243'))
# Use regex
mask = df['Author'].str.fullmatch(r'[\d ]*')
df.loc[mask, 'Author'] = df.loc[mask, 'Author'].str.replace(' ', '')
# You can also do the same treatment by the following line
# df['Author'] = df['Author'].apply(lambda s: s.replace(' ', '') if re.match(r'[\d ]*$', s) else s)
Author
070734
082620
082620
Tata Smith
Jhon Doe
082622
3409243

How about this?
import pandas as pd
df = pd.read_csv('two.csv')
# remove spaces on copy
df['Author_clean'] = df['Author'].str.replace(" ","")
# try conversion to numeric if possible
df['Author_clean'] = df['Author_clean'].apply(pd.to_numeric, errors='coerce')
# fill missing vals with original strings
df['Author_clean'].fillna(df['Author'], inplace=True)
print(df.head(10))
Output:
Author Author_clean
0 07 07 34 70734.0
1 08 26 20 82620.0
2 08 26 20 82620.0
3 Tata Smith Tata Smith
4 Jhon Doe Jhon Doe
5 08 26 22 82622.0
6 3409243 3409243.0

Extract Datetime information from a string in a DataFrame column

So I have the Edition Column which contains data in unevenly pattern, as some have ',' followed by the date and some have ',-' pattern.
df.head()
17 Paperback,– 1 Nov 2016
18 Mass Market Paperback,– 1 Jan 1991
19 Paperback,– 2016
20 Hardcover,– 24 Nov 2018
21 Paperback,– Import, 4 Oct 2018
How can I extract the date to a separate column. I tried using str.split() but can't find specific pattern to extract.Is there any method I could do it?

obj = df['Edition']
obj.str.split('((?:\d+\s+\w+\s+)?\d{4}$)', expand=True)
or
obj.str.split('[,–]+').str[0]
obj.str.split('[,–]+').str[-1] # date

Try using dateutil
from dateutil.parser import parse
df['Dt']=[parse(i, fuzzy_with_tokens=True)[0] for i in df['column']]

Count Occurrences for Objects in a Column of Lists for Really Large CSV File

I have a huge CSV file (8gb) containing multiple columns. One of the columns are a column of lists that looks like this:
YEAR WIN_COUNTRY_ISO3
200 2017 ['BEL', 'FRA', 'ESP']
201 2017 ['BEL', 'LTU']
202 2017 ['POL', 'BEL']
203 2017 ['BEL']
204 2017 ['GRC', 'DEU', 'FRA', 'LVA']
205 2017 ['LUX']
206 2017 ['BEL', 'SWE', 'LUX']
207 2017 ['BEL']
208 2017 []
209 2017 []
210 2017 []
211 2017 ['BEL']
212 2017 ['SWE']
213 2017 ['LUX', 'LUX']
214 2018 ['DEU', 'LUX']
215 2018 ['ESP', 'PRT']
216 2018 ['AUT']
217 2018 ['DEU', 'BEL']
218 2009 ['ESP']
219 2009 ['BGR']
Each of the 3-letter code represents a country. I would like to create a frequency table for each country so i can count the occurrences of each country in the entire column. Since the file is really large and my PC can't handle to load the whole CSV as dataframes, I try to read the file lazily and iterate through the line --> getting the last column and add the object in each row of the WIN_COUNTRY_ISO3 column (which happens to be the last column) to a set of dictionary.
import sys
from itertools import islice
n=100
i = 0
col_dict={}
with open(r"filepath.csv") as file:
for nline in iter(lambda: tuple(islice(file, n)), ()):
row = nline.splitline
WIN_COUNTRY_ISO3 = row[-1]
for iso3 in WIN_COUNTRY_ISO3:
if iso3 in col_dict.keys():
col_dict[iso3]+=1
else:
col_dict[iso3]=1
i+=1
sys.stdout.write("\rDoing thing %i" % i)
sys.stdout.flush()
print(col_dict)
However, this process takes a really long time. I tried through iterate through multiple lines by using the code
for nline in iter(lambda: tuple(islice(file, n)), ())
Q1:
However, this doesn't seem to work and python process the file one by one. Does anybody know the most any
efficient way for me to generate the count of each country for a really large file like mine?
The resulting table would look like this:
Country Freq
BEL 4543
FRA 4291
ESP 3992
LTU 3769
POL 3720
GRC 3213
DEU 3119
LVA 2992
LUX 2859
SWE 2802
PRT 2584
AUT 2374
BGR 1978
RUS 1770
TUR 1684
I would also like to create the frequency table by each year (in the YEAR column) if anybody can help me with this. Thank you.

Try this:
from collections import defaultdict
import csv
import re
result = defaultdict(int)
f = open(r"filepath.csv")
next(f)
for row in f:
data = re.sub(r'[\s\d\'\[\]]', '', row)
if data:
for x in data.split(','):
result[x] += 1
print(result)

If you can handle awk, here's one:
$ cat program.awk
{
while(match($0,/'[A-Z]{3}'/)) {
a[substr($0,RSTART+1,RLENGTH-2)]++
$0=substr($0,RSTART+RLENGTH)
}
}
END {
for(i in a)
print a[i],i
}
Execute it:
$ awk -f program.awk file
Output:
1 AUT
3 DEU
3 ESP
1 BGR
1 LTU
2 FRA
1 PRT
5 LUX
8 BEL
1 POL
1 GRC
1 LVA
2 SWE
$0 processes the whole record (row) of data, so it might include false hits from elsewhere in the record. You can enhance that with proper field separation but as it wasn't available I can't help any further. See gnu awk, FS and maybe FPAT in google.

TypeError: while doing web scraping

I was just scraping data and want to make two columns of title and date but TypeError occurs
TypeError: from_dict() got an unexpected keyword argument 'columns'
CODE :
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'
while True:
response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})
for contents in content:
title_tag = contents.find('span',{'class':'title'})
title= title_tag.text[1:-1] if title_tag else 'N/A'
date_tag = contents.find('span',{'class':'meta'})
date = date_tag.text if date_tag else 'N/A'
hiv={title : date}
print(' title : ', title ,' \n date : ' ,date )
url_tag = soup.find('div',{'class':'pagination'})
if url_tag.get('href'):
url = 'https://timesofindia.indiatimes.com/' + url_tag.get('href')
print(url)
else:
break
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['title' ,'date'])
pandas is updated to version 0.23.4,then also error occurs.

The first thing I noticed is the construction of the dictionary is off. I'm assuming you want the dictionary of the entire title:date. The way as you have it now, will only keep the last.
Then when you do that, the index of the dataframe with be the key, and the values are the series/column. So technically there's only 1 column. I can create the two columns by resetting the index, then that index is put into a column that I rename 'title'
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'
response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})
hiv = {}
for contents in content:
title_tag = contents.find('span',{'class':'title'})
title= title_tag.text[1:-1] if title_tag else 'N/A'
date_tag = contents.find('span',{'class':'meta'})
date = date_tag.text if date_tag else 'N/A'
hiv.update({title : date})
print(' title : ', title ,' \n date : ' ,date )
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['date'])
hiv1 = hiv1.rename_axis('title').reset_index()
Output:
print (hiv1)
title date
0 I told my boyfriend I was HIV positive and thi... 01 Dec 2018
1 Pay attention to these 7 very common HIV sympt... 30 Nov 2018
2 Transfusion of HIV blood: Panel seeks time til... 2019-01-06T03:54:33Z
3 No. of pregnant women testing HIV+ dips; still... 01 Dec 2018
4 Busted:5 HIV AIDS myths 30 Nov 2018
5 Myths and taboos related to AIDS 01 Dec 2018
6 N/A N/A
7 Mumbai: Free HIV tests at six railway stations... 23 Nov 2018
8 HIV blood tranfusion: Tamil Nadu govt assures ... 2019-01-05T09:05:27Z
9 Autopsy performed on HIV+ve donor’s body at GRH 2019-01-03T07:45:03Z
10 Madras HC directs to videograph HIV+ve donor’s... 2019-01-01T01:23:34Z
11 HIV +ve Tamil Nadu teen who attempted suicide ... 2018-12-31T03:37:56Z
12 Another woman claims she got HIV-infected blood 2018-12-31T06:34:32Z
13 Another woman says she got HIV from donor blood 29 Dec 2018
14 HIV case: Five-member panel begins inquiry in ... 29 Dec 2018
15 Pregnant woman turns HIV positive after blood ... 26 Dec 2018
16 Pregnant woman contracts HIV after blood trans... 26 Dec 2018
17 Man attacks niece born with HIV for sleeping i... 16 Dec 2018
18 Health ministry implements HIV AIDS Act 2017: ... 11 Sep 2018
19 When meds don’t heal: HIV+ kids fight daily wa... 03 Sep 2018
I'm not quite sure why you're getting the error though. It doesn't make sense since you are using updated Pandas. Maybe uninstall Pandas and then re pip install it?
Otherwise I guess you could just do it in 2 lines and name the columns after converting to dataframe:
hiv1 = pd.DataFrame.from_dict(hiv, orient = 'index').reset_index()
hiv1.columns = ['title','date']

Sum IFs of total count without recounting Multiple instances, only the closest date prior to the AS OF DATE

I need a formula that will SUM the amount of, let's say, animal types AS OF DATE given WITHOUT adding the previous animal type count, only for the closest date prior to or on the AS OF DATE. Different animal types maybe added to or taken away. So list is not set.
I prefer not to do this in VBA or with a Pivot Table, But any help will be appreciated.
A B C
DATE ANIMAL TYPE COUNT
JAN 01 DOG 1
JAN 02 CAT 2
JAN 04 Fish 1
JAN 12 DOG 2
JAN 20 CAT 3
FEB 01 PIG 1
FEB 02 CAT 2
AS OF DATE TOTAL ANIMALS
JAN 03 3
JAN 13 5
JAN 21 6
FEB 01 7
FEB 02 6
So.
As of Jan 03, there was 3 animals total. 1 Dog and 2 cats.
As of Jan 13, there was 5 animals total. 2 Dogs, 1 Fish and 2 Cats,,,,,, NOT 6
As of Jan 21, there was 6 animals total. 2 Dogs, 1 Fish and 3 Cats,,,,,, NOT 9
As of Feb 01, there was 7 animals total. 2 Dogs, 1 Fish 1 Pig and 3 Cats, NOT 10

So far this is what I have. By using a helper column to filter the Animal Types I get a list without duplicates. Then I put that in a cell with Data Validation to pick the Type. Same for the Dates. However I would like to drop the Type input and just choose the Date. And be able to get a total.
Here is what works but not what I need.
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=$G$2,IF(TabData1[Date]<=$F$2,TabData1[Date]))))
I want to do away with the Single Cell reference ($F$2) of a single Animal Type and replace it with a Range to get the latest count of Animals for all Animal Types as of a certain date. Like this but this does not work.
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=(OFFSET($J$2,0,0,COUNT(IF(ListAnimalType="","",1)),1)),IF(TabData1[Date]<=$F$2,TabData1[Date]))))
To simplify (OFFSET($J$2,0,0,COUNT(IF(ListAnimalType="","",1)),1)) you can use $J$2:$J$5
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=$J$2:$J$5,IF(TabData1[Date]<=$F$2,TabData1[Date]))))
And it looks like this
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF({"Dog";"Cat";"Fish";"Dog";"Cat";"Pig";"Cat";0;0;0;0;0;0;0;0;0}={"Cat";"Dog";"Fish";"Pig"},IF(TabData1[Date]<=$F$2,TabData1[Date]))))
Like I said, I want one formula that will take each Animal Type find the latest date from a specified cell and return the sum for each Animal Type then sum them all up.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python Web-scraped text, print vertically - python-3.x

You really just need to convert all newlines to spaces. This can be done with first splitting by newlines then join with spaces. You can remove MATCH DETAILS by removing the last 13 characters. Hence it is like this: print(' '.join(Results.split('\n'))[:-13])

Related

How to remove space between numbers but leave space between names on the same column in a DataFrame Pandas

Extract Datetime information from a string in a DataFrame column

Count Occurrences for Objects in a Column of Lists for Really Large CSV File

TypeError: while doing web scraping

Sum IFs of total count without recounting Multiple instances, only the closest date prior to the AS OF DATE

Categories

Resources