how to extract different tables in excel sheet using python - python-3.x

In one excel file, sheet 1 , there are 4 tables at different locations in the sheet .How to read those 4 tables . for example I have even added one picture snap from google for reference. without using indexes is there any other way to extract tables.

I assume your tables are formatted as "Excel Tables".
You can create an excel table by mark a range and then click:
Then there are a good guide from Samuel Oranyeli how to import the Excel Tables with Python. I have used his code and show with examples.
I have used the following data in excel, where each color represents a table.
Remarks about code:
The following part can be used to check which tables exist in the worksheet that we are working with:
# check what tables that exist in the worksheet
print({key : value for key, value in ws.tables.items()})
In our example this code will give:
{'Table2': 'A1:C18', 'Table3': 'D1:F18', 'Table4': 'G1:I18', 'Table5': 'J1:K18'}
Here you set the dataframe names. Be cautious if the number of dataframes missmatches the number of tables you will get an error.
# Extract all the tables to individually dataframes from the dictionary
Table2, Table3, Table4, Table5 = mapping.values()
# Print each dataframe
print(Table2.head(3)) # Print first 3 rows from df
print(Table2.head(3)) gives:
Index first_name last_name address
0 Aleshia Tomkiewicz 14 Taylor St
1 Evan Zigomalas 5 Binney St
2 France Andrade 8 Moor Place
Full code:
#import libraries
from openpyxl import load_workbook
import pandas as pd
# read file
wb = load_workbook("G:/Till/Tables.xlsx") # Set the filepath + filename
# select the sheet where tables are located
ws = wb["Tables"]
# check what tables that exist in the worksheet
print({key : value for key, value in ws.tables.items()})
mapping = {}
# loop through all the tables and add to a dictionary
for entry, data_boundary in ws.tables.items():
# parse the data within the ref boundary
data = ws[data_boundary]
### extract the data ###
# the inner list comprehension gets the values for each cell in the table
content = [[cell.value for cell in ent]
for ent in data]
header = content[0]
#the contents ... excluding the header
rest = content[1:]
#create dataframe with the column names
#and pair table name with dataframe
df = pd.DataFrame(rest, columns = header)
mapping[entry] = df
# print(mapping)
# Extract all the tables to individually dataframes from the dictionary
Table2, Table3, Table4, Table5 = mapping.values()
# Print each dataframe
print(Table2)
print(Table3)
print(Table4)
print(Table5)
Example data, example file:
first_name
last_name
address
city
county
postal
Aleshia
Tomkiewicz
14 Taylor St
St. Stephens Ward
Kent
CT2 7PP
Evan
Zigomalas
5 Binney St
Abbey Ward
Buckinghamshire
HP11 2AX
France
Andrade
8 Moor Place
East Southbourne and Tuckton W
Bournemouth
BH6 3BE
Ulysses
Mcwalters
505 Exeter Rd
Hawerby cum Beesby
Lincolnshire
DN36 5RP
Tyisha
Veness
5396 Forth Street
Greets Green and Lyng Ward
West Midlands
B70 9DT
Eric
Rampy
9472 Lind St
Desborough
Northamptonshire
NN14 2GH
Marg
Grasmick
7457 Cowl St #70
Bargate Ward
Southampton
SO14 3TY
Laquita
Hisaw
20 Gloucester Pl #96
Chirton Ward
Tyne & Wear
NE29 7AD
Lura
Manzella
929 Augustine St
Staple Hill Ward
South Gloucestershire
BS16 4LL
Yuette
Klapec
45 Bradfield St #166
Parwich
Derbyshire
DE6 1QN
Fernanda
Writer
620 Northampton St
Wilmington
Kent
DA2 7PP
Charlesetta
Erm
5 Hygeia St
Loundsley Green Ward
Derbyshire
S40 4LY
Corrinne
Jaret
2150 Morley St
Dee Ward
Dumfries and Galloway
DG8 7DE
Niesha
Bruch
24 Bolton St
Broxburn, Uphall and Winchburg
West Lothian
EH52 5TL
Rueben
Gastellum
4 Forrest St
Weston-Super-Mare
North Somerset
BS23 3HG
Michell
Throssell
89 Noon St
Carbrooke
Norfolk
IP25 6JQ
Edgar
Kanne
99 Guthrie St
New Milton
Hampshire
BH25 5DF

You may convert your excel sheet to csv file and then use csv module to grab rows.
import pandas as pd
read_file = pd.read_excel("Test.xlsx")
read_file.to_csv ("Test.csv",index = None,header=True)
enter code here
df = pd.DataFrame(pd.read_csv("Test.csv"))
print(df)
For better approch please provide us sample excel file

You need two things:
Access OpenXML data via python: https://github.com/python-openxml/python-xlsx
Find the tables in the file, via what is called a DefinedName: https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.spreadsheet.definedname?view=openxml-2.8.1

You may convert your excel sheet to csv file and then use csv module to grab rows.
//Code for excel to csv
import pandas as pd
read_file = pd.read_excel ("Test.xlsx")
read_file.to_csv ("Test.csv",index = None,header=True)
df = pd.DataFrame(pd.read_csv("Test.csv"))
print(df)
For better approch please provide us sample excel file

Related

Incremental append updated rows (based on some columns) in PySpark Palantir Foundry

I would like to create a historical dataset on which I would like to add all NEW records of a dataset.
For NEW records I mean new records or modified records: all those that are the same for all columns except the 'reference_date' one.
I insert here the piece of code that allows me to do it on all columns, but I can't figure out how to implement the exclusion condition of a column.
Inputs:
historical (previous):
ID
A
B
dt_run
1
abc
football
2022-02-14 21:00:00
2
dba
volley
2022-02-14 21:00:00
3
wxy
tennis
2022-02-14 21:00:00
input_df (new data):
ID
A
B
1
abc
football
2
dba
football
3
wxy
tennis
7
abc
tennis
DESIRED OUTPUT (new records in bold)
ID
A
B
dt_run
1
abc
football
2022-02-14 21:00:00
2
dba
volley
2022-02-15 21:00:00
3
wxy
tennis
2022-02-01 21:00:00
2
dba
football
2022-03-15 14:00:00
7
abc
tennis
2022-03-15 14:00:00
My code which doesn't work:
#incremental(snapshot_inputs=['input_df'])
#transform(historical = Output(....), input_df = Input(....))
def append(input_df, historical):
input_df = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(datetime.now())))
historical = historical.write_dataframe(dataset_input_df.distinct()\
.subtract(historical.dataframe('previous', schema=input_df.schema)))
return historical
I've tested the following script and it works. In the following example, you don't need to drop/select columns. Using withColumn you create the missing column in input_df and also change the values in the existing column in historical. This way you can safely do subtract on the whole dataframe. Later, since you append the data rows, the old historical rows will stay intact with their old timestamps.
from transforms.api import transform, Input, Output, incremental
from pyspark.sql import functions as F
from datetime import datetime
#incremental(snapshot_inputs=['input_df'])
#transform(
historical=Output("...."),
input_df=Input("....")
)
def append(input_df, historical):
now = datetime.now()
df_inp = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(now)))
df_hist = historical.dataframe('previous', df_inp.schema).withColumn('dt_run', F.to_timestamp(F.lit(now)))
historical.write_dataframe(df_inp.subtract(df_hist))
You can use code similar to what is found here.
Once you have combined the previous output with the new input, you just need to use PySpark to determine which is the newest row and only keep that row instead of line 19.
A possible implementation for this could be using F.row_number e.g.
import pyspark.sql.window as W
import pyspark.sql.functions as F
#incremental()
#transform(
input_df=Input('/examples/input_df'),
output_df=Output('/examples/output_df')
)
def incremental_group_by(input_df, output_df):
# Get new rows
new_input_df = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(datetime.now())))
# Union with the old rows
out_schema = new_input_df.schema
both_df = new_input_df.union(
output_df.dataframe('previous', schema=out_schema)
)
partition_cols = ["A","B"]
# Get most recent row
totals_df = totals_df.withColumn("row_number",
F.row_number().over(W.Window.partitionBy(*partition_cols).orderBy(F.desc("dt_run")))
).where(F.col("row_number") == 1).drop("row_number")
# To fully replace the output, we always set the output mode to 'replace'.
# Checkpoint the totals dataframe before changing the output mode.
both_df.localCheckpoint(eager=True)
output_df.set_mode('replace')
output_df.write_dataframe(both_df.select(out_schema.fieldNames()))
Edit : The main difference between my answer and the one above is whether you want to have multiple rows in the ouput where 'A' and 'B' are the same. It depends on your usecase which one is better!
I have used the union() function along with dropDulicates()
from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as fx
def append(df_input, df_hist):
df_union = df_hist.unionByName(df_input,allowMissingColumns=True).dropDuplicates(['ID','A','B'])
historical = df_union.withColumn('dt_run', fx.coalesce('dt_run', fx.to_timestamp(fx.lit(datetime.now()))))
return historical
df_hist= spark.createDataFrame( [(1,'abc','football','2022-02-14 21:00:00'),(2,'dba','volley','2022-02-14 21:00:00'),(3,'wxy','tennis','2022-02-14 21:00:00')],schema= ['ID','A','B','dt_run'])
df_hist = df_hist.withColumn('dt_run',fx.col('dt_run').cast('timestamp'))
df_input= spark.createDataFrame([(1,'abc','football'),(2,'dba','football'),(3,'wxy','tennis'),(7,'abc','tennis')],schema= ['ID','A','B'])
df_historical = append(df_input,df_hist)
df_historical.show(truncate=False)

How to convert a non-fixed width spaced delimited file to a pandas dataframe

ID 0x4607
Delivery_person_ID INDORES13DEL02
Delivery_person_Age 37.000000
Delivery_person_Ratings 4.900000
Restaurant_latitude 22.745049
Restaurant_longitude 75.892471
Delivery_location_latitude 22.765049
Delivery_location_longitude 75.912471
Order_Date 19-03-2022
Time_Orderd 11:30
Time_Order_picked 11:45
Weather conditions Sunny
Road_traffic_density High
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle motorcycle
multiple_deliveries 0.000000
Festival No
City Urban
Time_taken (min) 24.000000
Name: 0, dtype: object
In an online exam, the machine learning training dataset has been split into multiple txt files. The file contains data as shown in the image. I am unable to understand how to read this data in python and convert it to a pandas dataframe. There are more than 45,000 txt files each containing data of a record of the dataset. I will have to merge those 45,000 txt files into a single .csv file. Any help will be highly appreciated.
Each of your txt files seems to contain only 1 row (as a Series).
Unfortunately, these rows are not in an easy-to-read format (for the machines) - looks like they were just printed out and saved like that.
Because of this in my solution the indices of the dataframe (which correspond to the Name - in last row of each file) won't be read: my final dataframe will be reindexed.
You'll have to iterate through all your files. Just for my example, I'm using a list of the file names:
file_names = ['file0.txt', 'file1.txt']
rows = [pd.read_csv(file_name, sep='\s\s+', header=None, index_col=0, skipfooter=1, engine='python').iloc[:, 0]
for file_name in file_names]
df = pd.DataFrame(rows).reset_index(drop=True)
You can simply use basic python to do it with something like:
data = """ID 0x4607
Delivery_person_ID INDORES13DEL02
Delivery_person_Age 37.000000
Delivery_person_Ratings 4.900000
Restaurant_latitude 22.745049
Restaurant_longitude 75.892471
Delivery_location_latitude 22.765049
Delivery_location_longitude 75.912471
Order_Date 19-03-2022
Time_Orderd 11:30
Time_Order_picked 11:45
Weather conditions Sunny
Road_traffic_density High
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle motorcycle
multiple_deliveries 0.000000
Festival No
City Urban
Time_taken (min) 24.000000"""
for line in data.split('\n'):
content = line.split()
name = ' '.join(content[:-1])
value = content[-1]
print(name, value)
And from the moment that you have the name and the value you can add them to a panda dataframe

How to realign column headers with the respective rows after importing a csv data set

I tried to load data from a csv file but i can't seem to be able to re-align the column headers to the respective rows for a clearer data frame.
Below is the output of
df.head()
bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count
0 1,Harry Potter and the Half-Blood Prince (Harr...
1 2,Harry Potter and the Order of the Phoenix (H...
2 3,Harry Potter and the Sorcerer's Stone (Harry...
3 4,Harry Potter and the Chamber of Secrets (Har...
4 5,Harry Potter and the Prisoner of Azkaban (Ha...
import pandas as pd
file = 'C:/Users/user/Documents/Temporary data sets for practise only/books.csv'
df = pd.read_csv(file, sep ='/t')
df.head()
Ref
You've set '\t' as your delimiter value while your document extract shows commas between the columns. Try
df = pd.read_csv(file, sep =',')
or just
pd.read_csv(file)
since ',' is the standard delimiter.

How to write content of a list into an Excel sheet using openpyxl

I have the following list:
d_list = ["No., Start Name, Destination, Distance (miles)",
"1,ALBANY,NY CRAFT,28",
"2,GRACO,PIONEER,39",
"3,FONDA,ROME,41",
"4,NICCE,MARRINERS,132",
"5,TOUCAN,SUBVERSIVE,100",
"6,POLL,CONVERGENCE,28",
"7,STONE HOUSE,HUDSON VALLEY,9",
"8,GLOUCESTER GRAIN,BLACK MUDD POND,75",
"9,ARMY LEAGUE,MUMURA,190",
"10,MURRAY,FARMINGDALE,123"]
So, basically, the list consists of thousands of elements (just showed here a sample of 10), each is a string of comma separated elements. I'd like to write this into a new worksheet in a workbook.
Note: the workbook already exists and contains other sheets, I'm just adding a new sheet with this data.
My code:
import openpyxl
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.create_sheet(title='distance')
for i in range(len(d_list)):
sheet.append(list(d_list[i]))
I'm expecting (in this example) 11 rows of data, each with 4 columns. However, I'm getting 11 rows alright but with each character of each string written in each cell! I think am almost there ... what am I missing? (Note: I've read through all the available posts related to this topic, but couldn't find any that answers this specific type of of question, hence I'm asking).
Many thanks!
You can use pandas to solve this:
1.) Convert your list into a dataframe:
In [231]: l
Out[231]:
['No., Start Name, Destination, Distance (miles)',
'1,ALBANY,NY CRAFT,28',
'2,GRACO,PIONEER,39',
'3,FONDA,ROME,41',
'4,NICCE,MARRINERS,132',
'5,TOUCAN,SUBVERSIVE,100',
'6,POLL,CONVERGENCE,28',
'7,STONE HOUSE,HUDSON VALLEY,9',
'8,GLOUCESTER GRAIN,BLACK MUDD POND,75',
'9,ARMY LEAGUE,MUMURA,190',
'10,MURRAY,FARMINGDALE,123']
In [228]: df = pd.DataFrame([i.split(",") for i in l])
In [229]: df
Out[229]:
0 1 2 3
0 No. Start Name Destination Distance (miles)
1 1 ALBANY NY CRAFT 28
2 2 GRACO PIONEER 39
3 3 FONDA ROME 41
4 4 NICCE MARRINERS 132
5 5 TOUCAN SUBVERSIVE 100
6 6 POLL CONVERGENCE 28
7 7 STONE HOUSE HUDSON VALLEY 9
8 8 GLOUCESTER GRAIN BLACK MUDD POND 75
9 9 ARMY LEAGUE MUMURA 190
10 10 MURRAY FARMINGDALE 123
2.) Write the above Dataframe to excel in a new-sheet in 4 columns:
import numpy as np
from openpyxl import load_workbook
path = "data.xlsx"
book = load_workbook(path)
writer = pd.ExcelWriter(path, engine = 'openpyxl')
writer.book = book
df.to_excel(writer, sheet_name = 'distance')
writer.save()
writer.close()

Extracting selective text using beautiful soup and write the result in CSV

I am trying to extract selective text from website [https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal%20asc%2C%20score%20desc%2C%20metadata_modified%20desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0]
and have written code using beautiful soup:
`
wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0"
page= urllib.request.urlopen(wiki)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page)
data2 = soup.find_all('h3', class_="dataset-heading")
data3 = []
getdata = []
for link in data2:
data3 = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE))
for data in data3:
getdata = data.text
print(getdata)
len(getdata)
`
My HTML is like :
<a href = "/dataset/banks-assets, class = "label" data-format = "xls">XLS<\a>
When I am running above code I am getting text that I want but 'XLS' word is coming alternatively, I want to remove 'XLS' and want to parse remaining text in csv in one column.My output is :
Banks – Assets
XLS
Consolidated Exposures – Immediate and Ultimate
Risk Basis
XLS
Foreign Exchange Transactions and Holdings of
Official Reserve Assets
XLS
Finance Companies and General Financiers
– Selected Assets and Liabilities
XLS
Liabilities and Assets –
Monthly XLS Consolidated Exposures – Immediate Risk Basis –
International Claims by Country
XLS
and so on.......
I checked whether above output is list. It was given list but it has only one element but as I have shown above my output is many texts.
Please help me out with it.
if the purpose is only to remove XLS rows from result column, then it can be reached, for example, ths way:
from urllib.request import urlopen
wiki = "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0"
page= urlopen(wiki)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page)
data2 = soup.find_all('h3', class_="dataset-heading")
data3 = []
getdata = []
for link in data2:
data3 = soup.find_all("a", href=re.compile('/dataset/', re.IGNORECASE))
for data in data3:
if data.text.upper() != 'XLS':
getdata.append(data.text)
print(getdata)
You will get a list with text you need. Then it can be easily transformed, for example, to DataFrame, where this data will appear as a column.
import pandas as pd
df = pd.DataFrame(columns=['col1'], data=getdata)
output:
col1
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
5 Consolidated Exposures – Immediate Risk Basis ...
6 Consolidated Exposures – Ultimate Risk Basis
7 Banks – Consolidated Group off-balance Sheet B...
8 Liabilities of Australian-located Operations
9 Building Societies – Selected Assets and Liabi...
10 Consolidated Exposures – Immediate Risk Basis ...
11 Banks – Consolidated Group Impaired Assets
12 Assets and Liabilities of Australian-Located O...
13 Managed Funds
14 Daily Net Foreign Exchange Transactions
15 Consolidated Exposures-Immediate Risk Basis
16 Public Unit Trust
17 Securitisation Vehicles
18 Assets of Australian-located Operations
19 Banks – Consolidated Group Capital
Putting to csv:
df.to_csv('C:\Users\Username\output.csv')

Resources