How to improve Beautiful soup webscrape loop? - python-3.x

I'm dipping my toes into webscraping with beautiful soup, and to do so I'm doing a small project where I'm looking at a pokemon fansite and getting the pokemon moves from a table. I'm going for the move name and nothing else. Currently my code does that poorly and incorrectly until the very bottom of its output.
It looks something like this.
It eventually does what I anticipate at the end there (starting with pound).
Here is what the table looks like on the webpage.
What I've got:
import requests
from bs4 import BeautifulSoup as bs
# Load page
r = requests.get("https://bulbapedia.bulbagarden.net/wiki/List_of_moves")
# Convert to soup object
soup = bs(r.content)
# Get first table (aka the one we need)
first_table = soup.find('table')
# Loop and grab what we want
for td in first_table.find_all('td', style=False, align=False):
download = td.find_all('a', href=True, title=True, style=False, align=False)
for a in download:
text = a.string
print(text)
input()

All of this is not even necessary. You can simply use pandas to scrape the entire table:
import requests
import pandas as pd
r = requests.get("https://bulbapedia.bulbagarden.net/wiki/List_of_moves")
df = pd.read_html(r.content)[1]
print(df)
Output:
# Name Type Category Contest PP Power Accuracy Gen
0 1 Pound Normal Physical Tough 35 40 100% I
1 2 Karate Chop* Fighting Physical Tough 25 50 100% I
2 3 Double Slap Normal Physical Cute 10 15 85% I
3 4 Comet Punch Normal Physical Tough 15 18 85% I
4 5 Mega Punch Normal Physical Tough 20 80 85% I
.. ... ... ... ... ... .. ... ... ...
821 822 Fiery Wrath Dark Special ??? 10 90 100% VIII
822 823 Thunderous Kick Fighting Physical ??? 10 90 100% VIII
823 824 Glacial Lance Ice Physical ??? 5 130 100% VIII
824 825 Astral Barrage Ghost Special ??? 5 120 100% VIII
825 826 Eerie Spell Psychic Special ??? 5 80 100% VIII
[826 rows x 9 columns]
You can also send these values to a neat csv file by adding this line to your code:
df.to_csv('Moves.csv', index = False)
Screenshot of csv file:

Related

How to print each line from multiline text in pandas dataframe in python?

I have a dataset in which there is a column named "Mobile Features". when printed each row this column has multiline text and prints all the text at a time.
How to print each line from the multiline text
df
Mobile_Name Mobile_Price Mobile_Features
1. Realme 25000 54 MP Camera
12 GB RAM
750 Snapdragon Processor
Best in Class
4.5 rating / 5
2. Celkon 18000 45 MP Camera
8 GB RAM
750 Snapdragon Processor
Best in Class
4.7 rating / 5
for each_row in df['Mobile_Features']:
for i in each_row:
print(i)
But it prints each character instead of one single line.
How to print a single line? It would be great if someone can help me. Thank You.
You need to split the data in each_row by newline (\n):
for each_row in df['Mobile_Features']:
for i in each_row.split('\n'):
print(i)
or
for each_row in df['Mobile_Features'].str.split('\n'):
for i in each_row:
print(i)
Output (for either code for your sample data):
54 MP Camera
12 GB RAM
750 Snapdragon Processor
Best in Class
4.5 rating / 5
45 MP Camera
8 GB RAM
750 Snapdragon Processor
Best in Class
4.7 rating / 5

Excel diagram with time value or number on category ax

I need to make a diagram which shows the lines of different ceramic firing schedules. I want them to be plotted in one diagram and they need to be plotted in time-relative ax. It needs to show the different durations in a right way. I don't seem to be able to achieve this.
What I have is the following:
First table:
Pendelen
Temp. per uur
Stooktemp.
Stooktijd 4
Stooktijd Cum.4
95
120
1:15:47
1,26
205
537
2:02:03
3,30
80
620
1:02:15
4,33
150
1075
3:02:00
7,37
50
1196
2:25:12
9,79
10
1196
0:10:00
9,95
Total
9:57:17
Second table:
Pendelen
Temp. per uur
Stooktemp.
Stooktijd 5
Stooktijd Cum.5
140
540
3:51:26
3,86
65
650
1:41:32
5,55
140
1095
3:10:43
8,73
50
1222
2:32:24
11,27
Total
11:16:05
The lines to be shown in a diagram should represent the 'stooktijd cum.' for both programs 4 and 5 (which is a cumulation of the time needed to fire up the kiln from it's previous temp. in the schedule). One should be able to see in the diagram that program 5 takes more time to reach it's endtemp.
What I achieved is nothing more than a diagram with two lines, but only plotted in the 'stooktijd cum.4' points from program 4. The image shows a screenshot of this diagram.
But as you can see, this doesn't look like program 5 takes more time to reach it's end. I would like it to show something like this:
Create this table :
p4
p5
0
10
3.86
540
5.55
650
8.73
1095
11.27
1222
0
0
1.26
120
3.3
537
4.33
620
7.37
1075
9.79
1196
9.95
1196
Select all > F11 > Design > Chg Chart type > scatter with straight line and marker
Here's my tryout :
Please share if it works/not. ( :

Convert .txt file into multi index dataframe pandas

I have a very unorganized dataset located in a text file say file.txt
The sample looks something like so
TYPE Invoice C AC DATE TIME Total Invoice Qty1 ITEMVG By Total 3,000.00
Piece Item
5696 01/03/2018 09:21 32,501.35 1 Golden Plate ÞÔÞæÇä ÈÞÑ 6,517.52
1 áÈä ÑæÇÈí ÊÚäÇíá 2 ßÛ 4,261.45
1 Magic chef pop corn 907g 3,509.43
1 áÈäÉ ÊÚäÇíá ÔÝÇÝÉ 1 ßíáæ 9,525.60
1 KHOURY UHT 1 L 2,506.74
1 ÎÈÒ ÔãÓíä ÕÛíÑ 1,002.69
2 Almera 200Tiss 2,506.74
1.55 VG Potato 1,550.17
0.41 VG Eggplant 619.67
1 Delivery Charge 501.35
5697 01/03/2018 09:31 15,751.35 0.5 Halloum 1K. 4,476.03
0.59 Cheese double Cream 3,253.75
3 ãæáÇä ÏæÑ ÎÈÒ æÓØ 32 3,760.11
3 ãæáÇä ÏæÑ ÎÈÒ æÓØ 32 3,760.11
1 Delivery Charge 501.35
I want to import it into a data frame pandas using multi-index. Can someone help me with this?
In fact it can not read it as a txt file
# Obtain the Unorganized data from txt
file1=open('file.txt','r')
UnOrgan=file1.read()
You should be able to just read it in using read_table.
import pandas as pd
df = pd.read_table(<your file>, sep="\t", headers=[rows with column info])
I'm guessing that the separator is a tab.

How to convert bytes data into a python pandas dataframe?

I would like to convert 'bytes' data into a Pandas dataframe.
The data looks like this (few first lines):
(b'#Settlement Date,Settlement Period,CCGT,OIL,COAL,NUCLEAR,WIND,PS,NPSHYD,OCGT'
b',OTHER,INTFR,INTIRL,INTNED,INTEW,BIOMASS\n2017-01-01,1,7727,0,3815,7404,3'
b'923,0,944,0,2123,948,296,856,238,\n2017-01-01,2,8338,0,3815,7403,3658,16,'
b'909,0,2124,998,298,874,288,\n2017-01-01,3,7927,0,3801,7408,3925,0,864,0,2'
b'122,998,298,816,286,\n2017-01-01,4,6996,0,3803,7407,4393,0,863,0,2122,998'
The columns headers appear at the top. each subsequent line is a timestamp and numbers.
Is there a straightforward way to do this?
Thank you very much
#Paula Livingstone:
This seems to work:
s=str(bytes_data,'utf-8')
file = open("data.txt","w")
file.write(s)
df=pd.read_csv('data.txt')
maybe this can be done without using a file in between.
I had the same issue and found this library https://docs.python.org/2/library/stringio.html from the answer here: How to create a Pandas DataFrame from a string
Try something like:
from io import StringIO
s=str(bytes_data,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
You can also use BytesIO directly:
from io import BytesIO
df = pd.read_csv(BytesIO(bytes_data))
This will save you the step of transforming bytes_data to a string
Ok cool, your input formatting is quite awkward but the following works:
with open('file.txt', 'r') as myfile:
data=myfile.read().replace('\n', '') #read in file as a string
df = pd.Series(" ".join(data.strip(' b\'').strip('\'').split('\' b\'')).split('\\n')).str.split(',', expand=True)
print(df)
this produces the following:
0 1 2 3 4 5 6 7 \
0 #Settlement Date Settlement Period CCGT OIL COAL NUCLEAR WIND PS
1 2017-01-01 1 7727 0 3815 7404 3923 0
2 2017-01-01 2 8338 0 3815 7403 3658 16
3 2017-01-01 3 7927 0 3801 7408 3925 0
8 9 10 11 12 13 14 15
0 NPSHYD OCGT OTHER INTFR INTIRL INTNED INTEW BIOMASS
1 944 0 2123 948 296 856 238
2 909 0 2124 998 298 874 288
3 864 0 2122 998 298 816 286 None
In order for this to work you will need to ensure that your input file contains only a collection of complete rows. For this reason I removed the partial row for the purposes of the test.
As you have said that the data source is an http GET request then the initial read would take place using pandas.read_html.
More detail on this can be found here. Note specifically the section on io (io : str or file-like).

Modifying A SAS Data set after it was created using PROC IMPORT

I have a dataset like this
Obs MinNo EurNo MinLav EurLav
1 103 15.9 92 21.9
2 68 18.5 126 18.5
3 79 15.9 114 22.3
My goal is to create a data set like this from the dataset above:
Obs Min Eur Lav
1 103 15.9 No
2 92 21.9 Yes
3 68 18.5 No
4 126 18.5 Yes
5 79 15.9 No
6 114 22.3 Yes
Basically I'm taking the 4 columns and appending them into 2 columns + a Categorical indicating which set of 2 columns they came from
Here's what I have so far
PROC IMPORT DATAFILE='f:\data\order_effect.xls' DBMS=XLS OUT=orderEffect;
RUN;
DATA temp;
INFILE orderEffect;
INPUT minutes euros ##;
IF MOD(_N_,2)^=0 THEN lav='Yes';
ELSE lav='No';
RUN;
My question though is how I can I import an Excel sheet but then modify the SAS dataset it creates so I can shove the second two columns below the first two and add a third column based on which columns in came from?
I know how to do this by splitting the dataset into two datasets then appending one onto the other but with the mode function above it would be a lot faster.
You were very close, but misunderstanding what PROC IMPORT does.
When PROC EXPORT completes, it will have created a SAS data set named orderEffect containing SAS variables from the columns in your worksheet. You just need to do a little data step program to give the result you want. Try this:
data want;
/* Define the SAS variables you want to keep */
format Min 8. Eur 8.1;
length Lav $3;
keep Min Eur Lav;
set orderEffect;
Min = MinNo;
Eur = EurNo;
Lav = 'No';
output;
Min = MinLav;
Eur = EurLav;
Lav = 'Yes';
output;
run;
This assumes that the PROC IMPORT step created a data set with those names. Run that step first to be sure and revise the program if necessary.

Resources