How to extract a table from a website(url) using python - python-3.x

The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format
“) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.
import pandas as pd
# URL of the table
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
# Read the table into a pandas dataframe
df = pd.read_html(url, header=0, index_col=0)[0]
# Save the processed table to a CSV file
df.to_csv("nist_table.csv", index=False)

You could use:
.droplevel([0,1]) to remove the unwanted header rows
.dropna(axis=1, how='all') to remove the empty columns
.iloc[:,1:] to select only specific 3 columns
Example
import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.
The code below should extract the desired data:
# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
# declare table variable and use soup to find table in HTML dom
table = soup.find('table')
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
# only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
if i > 3:
rows.append([value.text.strip() for value in row.find_all('td')])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv(r"datafile.csv")

Related

How do I convert my response with byte characters to readable CSV - PYTHON

I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this:
after using data=response.content the output of data is:
b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n'
how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv.
I tried:
data=response.content.decode('utf-8', 'ignore').split(',')
however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column.
I tried:
data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split
however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor
I tried:
data=json.loads(response.content)
however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given:
data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #...
If you just want a CSV version of your data you can simply do:
with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out:
file_out.writelines(data.decode())
If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can:
import io
import pandas
foo = pandas.read_csv(io.StringIO(data.decode()))
print(foo)

Get second column of a data frame using pandas

I am new to Pandas in Python and I am having some difficulties returning the second column of a dataframe without column names just numbers as indexes.
import pandas as pd
import os
directory = 'A://'
sample = 'test.txt'
# Test with Air Sample
fileAir = os.path.join(directory,sample)
dataAir = pd.read_csv(fileAir,skiprows=3)
print(dataAir.iloc[:,1])
The data I am working with would be similar to:
data = [[1,2,3],[1,2,3],[1,2,3]]
Then, using pandas I wanted to have only
[[2,2,2]].
You can use
dataframe_name[column_index].values
like
df[1].values
or
dataframe_name['column_name'].values
like
df['col1'].values

How to read comma separated data into data frame without column header?

I am getting a comma separated data set as bytes which I need to:
Convert in to string from byte
Create csv (can skip this if there is any way to jump to 3rd output)
format and read as data frame without converting first row as column name.
(Later I will be using this df to compare with oracle db output.)
Input data:
val = '-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
type(val)
i managed to do till step 3 but my first row is becoming header. I am fine if we can give any value as column header e.g. a, b, c, ...
#1 Code I tried to convert byte to str
strval = val.decode('ascii').strip()
#2 code to craete csv. Frist i created blank csv and later appended the data
import csv
import pandas as pd
abc = ""
with open('csvfile.csv', 'w') as csvOutput:
testData = csv.writer(csvOutput)
testData.writerow(abc)
with open('csvfile.csv', 'a') as csvAppend:
csvAppend.write(val)
#3 now converting it into dataframe
df = pd.read_csv('csvfile.csv')
# hdf = pd.read_csv('csvfile.csv', column=none) -- this give NameError: name 'none' is not defined
output:
df
according to the read_csv documentation it should be enough to add header=None as a parameter:
df = pd.read_csv('csvfile.csv', header=None)
In this way the header will be interpreted as a row of data. If you want to exclude this line then you need to add the skiprows=1 parameter:
df = pd.read_csv('csvfile.csv', header=None, skiprows=1)
You can do it without saving to csv file like this, you don't need to convert the bytes to string or save that to file
Here val is of type bytes if it is of type string as in your example you can use io.StringIO instead of io.BytesIO
import pandas as pd
import io
val = b'-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
buf_bytes = io.BytesIO(val)
pd.read_csv(buf_bytes, header=None)

Use Pandas to extract the values from a column based on some condition

I'm trying to pick a particular column from a csv file using Python's Pandas module, where I would like to fetch the Hostname if the column Group is SJ or DC.
Below is what I'm trying but it's not printing anything:
import csv
import pandas as pd
pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
low_memory=False
data = pd.read_csv('splnk.csv', usecols=['Hostname', 'Group'])
for line in data:
if 'DC' and 'SJ' in line:
print(line)
The data variable contains the values for Hostname & Group columns as follows:
11960 NaN DB-Server
11961 DC Sap-Server
11962 SJ comput-server
Note: while printing the data it stripped the data and does not print complete data.
PS: I have used the pandas.set_option to get the complete data on the terminal!
for line in data: doesn't iterate over row contents, it iterates over the column names. Pandas has several good ways to filter columns by their contents.
For example, you can use df.Series.isin() to select rows matching one of several values:
print data[data['Group'].isin(['DC', 'SJ'])]['Hostname']
If it's important that you iterate over rows, you can use df.iterrows():
for index, row in data.iterrows():
if row['Group'] == 'DC' or row['Group'] == 'SJ':
print row['Hostname']
If you're just getting started with Pandas, I'd recommend trying a tutorial to get familiar with the basic structure.
Try this:
import csv
import pandas as pd
import numpy as np #You can comment numpy as it is not needed.
low_memory=False
data = pd.read_csv('splnk.csv', usecols=['Hostname', 'Group'])
hostnames = data[(data['Group']=='DC') | (data['Group']=='SJ')]['Hostname'] # corrected the `hostname` to `Hostname`
print(hostnames)

Read from CSV and store in Excel tabs

I am reading multiple CSVs (via URL) into multiple Pandas DataFrames and want to store the results of each CSV into separate excel worksheets (tabs). When I keep writer.save() inside the for loop, I only get the last result in a single worksheet. And when I move writer.save() outside the for loop, I only get the first result in a single worksheet. Both are wrong.
import requests
import pandas as pd
from pandas import ExcelWriter
work_statements = {
'sheet1': 'URL1',
'sheet2': 'URL2',
'sheet3': 'URL3'
}
for sheet, statement in work_statements.items():
writer = pd.ExcelWriter('B.xlsx', engine='xlsxwriter')
r = requests.get(statement) # go to URL
df = pd.read_csv(statement) # read from URL
df.to_excel(writer, sheet_name= sheet)
writer.save()
How can I get all three results in three separate worksheets?
You are re-initializing the writer object with each loop. Simply initialize it once before for and save document once after the loop. Also, in read_csv() line, you should be reading in the request content, not the URL (i.e., statement) saved in dictionary:
writer = pd.ExcelWriter('B.xlsx', engine='xlsxwriter')
for sheet, statement in work_statements.items():
r = requests.get(statement) # go to URL
df = pd.read_csv(r.content) # read from URL
df.to_excel(writer, sheet_name= sheet)
writer.save()

Resources