How do i split the web scraped column name? - python-3.x

Hey I'm actually doing a web scraping analysis on pokemon data. So i have this below code:
pokemon='https://pokemondb.net//pokedex/bulbasaur'
tables = pd.read_html(requests.get(pokemon, headers={'User-agent': 'Mozilla/5.0'}).text)[-2].T
tables.columns = tables.iloc[0]
tables=tables.drop(tables.index[0])
tables
That gives me something like this:
Columns with pokemon location
Now what i want is the column to split for different games for example, RedBlue would be Red and Blue separately with the same data for RedBlue being shown for Red and blue separate columns. I think i can do that when scraping the data but I don't know how to go about it.
I'm attaching below a screenshot of the html tags related to the data. I think the Red and blue can be split as there's a '&' between them. How do I do this?
HTML tags

In this case, you should iterate per columns:
(Pdb) tables.columns
Index(['RedBlue', 'Yellow', 'GoldSilverCrystal', 'RubySapphire',
'FireRedLeafGreen', 'Emerald', 'DiamondPearlPlatinum',
'HeartGoldSoulSilver', 'BlackWhiteBlack 2White 2', 'XY',
'Omega RubyAlpha Sapphire', 'SunMoon', 'Ultra SunUltra Moon',
'Let's Go PikachuLet's Go Eevee', 'SwordShield'],
And create a special logic like:
for column in tables.columns:
if column = "RedBlue":
# create new columns
As you don't have any special character separating the strings, is difficult to create a logic for all columns

Related

How to programmatically print a long text string in 3 columns over multiple lines in excel vba

I have a long string of text. I want to print this in 3 columns over 2 pages in excel, using vba. e.g.:
A|B|C
-----
D|E|F
This is easy to do in ms word, I just split the page into 3 columns and print the text and it automatically goes onto the next column/page once the previous on is full. But I need to do similar in excel.
Ideas I've explored:
Having a textbox that stretches over 2 pages - splits into 3 columns, but the arrangement is:
a|c|e
b|d|f
I can't seem to find a way to put a page break within the text box.
Have 6 separate text boxes and split the text up - can't find a way of determining when one box is full so that the other can be started. Can't even determine how many lines the text takes up as characters have varying widths.
Have 6 separate large cells and split the text up - same issues as above.
Does anyone know of a way this can be overcome? I just want to replicate the behaviour of ms word.
Edit: here's the code for a textbox with columns that I can't get to page break:
Dim tb_1
Set tb_1 = jumbledwords_sheet.Shapes.AddTextbox(msoTextOrientationHorizontal, 10, 7470, 488, 1300)
tb_1.TextFrame2.Column.Spacing = 10
tb_1.TextFrame2.Column.Number = 3
tb_1.TextFrame2.TextRange.Font.Size = 9
tb_1.TextFrame2.TextRange.Characters.Text = long_string
The stucture of long_string isn't important, and I can break it etc. to fit the solution.

pdfplumber - How to extract table with no horizontal lines?

So I have a table like this one, with an unknown number of description lines. Some can have 1, 2, 5, even zero, or more lines:
(I removed all sensitive informations.)
and I use :
with pdfplumber.open("invoice.pdf") as pdf:
pages = pdf.pages
for page in pages:
page.extract_table()
which is does extract all data from the table but the second column it treats as one row.
I want somehow to split the lines of second column (or better all columns) by a small blank row, which so I put it on red rectangles to highlight it.
I know that I need to use table_settings={}, but I can't figure out ... yet, which property (ies), to use ?
What I tried:
print(page.extract_table(table_settings={
"horizontal_strategy": "text",
"snap_y_tolerance": 3,
"keep_blank_chars": True,
}))
Which, again, it splits when he wants ..
So it's possible to extract a mix-borderless table ?

Reading an Excel file with united cells in Python

I have an excel table of the following type (the problem described below is driven by the presence of the united cells).
I am using read_excel from pandas to read it.
What I want: I would like to use the values in the first column as an index, and to have the values in the third column combined in one cell, e.g. like here.
What I get from directly applying read_excel can be seen here.
If needed: please see the code used to read the file below (I am reading it from google drive in google colab):
path = '/content/drive/MyDrive/ExampleFile.xlsx'
pd.read_excel(path, header = 0, index_col = 0)
Could you please help?
Please let me know if anything in the question is unclear.
here is one way to accomplish it. I created the xls similar to yours, the first column had a heading of sno
# fill the null values with values from previous rows
df=df.ffill()
# combine the rows where class is the same and create a new column
df=df.assign(comb=df.groupby(['class'])['type'].transform(lambda x: ','.join(x)))
# drop the duplicated rows
df2=df.drop_duplicates(subset=['class','comb'])[['class','comb']]
class comb
0 fruit apple,orange
2 toys car,truck,train

Use excel to analyze lab data and present preliminary findings?

I am trying to build an excel file to take soil lab test results and organize and assign them preliminary labels.
A sample test will include pH, SAR/ESP, and EC readings. Based on those readings I want to assign the results the label Normal, Saline, Saline-Sodic, or Sodic.
Each label has an associated range of values for each criteria, simplest way to visualize what Im looking for is a a graph with two axis (SAR/ESP vs EC) with 4 quadrants. 3 of the quadrants refer to the same pH range.
I have a simple if then setup going right now that basically assigns each result all the possible labels based on each category then assigns it the label that comes up the most. However this is slow and ugly. Is there a way to consolidate this so that when I import a table where each row is a test I can have one column calculating this?
For example ph is evaluated:
=IF($I$2<=8.5,"A B D","C")
With A = Saline, B = Saline-Sodic, C= Sodic, D = Normal.
Then SAR is evaluated:
=IF($I$3<=13,"A D","B C")
etc.
Then:
=COUNTIF($B$9:$B$12,"A*")
Iterated for each label.
The labels are then counted:
=INDEX(Table1[Column1],MATCH(MAX(Table1[Column3]),Table1[Column3],0))
Working properly:

Hide a column of a specific Row-Grouped Table in RDLC

I have a matrix which contains a rowgroup and groups my products based on the Category. I have three categories: Laptops, Tablets, Televisions. My first two categories contain columns i.e. RAM which I don't want to display for the Televisions. Each category is separated by a page break.
I'm trying to hide the column 'RAM' if the Category name is 'Televisions' but only for the specific page
My structure:
[Categories]
[ProductID] [Processor] [RAM] [Colour] [etc]
Desired result:
[Laptop]
[125] [Intel Pentium] [250 MB RAM] [Black] [etc]
Desired result:
[Television]
[126] [Ix TV processor] [White] [etc]
Current result:
[Television]
[126] [Intel Pentium] [need to hide this] [White] [etc]
It is possible to do. The reason you're having trouble with it is because the matrix column falls outside of the category row-grouping scope. To hide the entire column, you have to move the category group above it. The easiest way to do that is to nest your matrix inside of a list control. Put your Category grouping/page breaks at the List level, and then set your RAM column of the matrix (which is now entirely inside of the category grouping scope since it is inside the list control) visibility based on the value of the Category as follows:
=Iif(Fields!Category.Value = "Televisions",True,False)

Resources