I have setup a few argparse arguments for my script like so:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--file", "-i", type=str, required=True)
parser.add_argument("--outfile", "-o", type=str, required=False)
parser.add_argument("--tab", "-t", type=str, required=False)
parser.add_argument("--tab_result", "-tr", type=str, required=False)
args = parser.parse_args()
#assign value too variables
infile = args.file
outfilepath = args.outfile
tabs = args.tab
tab_result = args.tab_result
I need to pass the variables of each the argparsers above into a function and assign values to a dataframe. I am trying to do this like so:
def func1():
print(infile)
doc = pd.DataFrame()
doc['file'] = infile
doc['output_table_name'] = outfilepath
doc['output_table_fields'] = json_normalized['index'] #from another df, works fine
doc['output_table_datatypes'] = json_normalized['dtypes.name'] #from another df, works fine
writer = pd.ExcelWriter(tabs)
doc.to_excel(writer,args.documentor_tab)
writer.save()
#print(infile)
#print(outfilename)
print(doc)
return doc
print('wrote document')
func1()
When I print this dataframe the infile and outpathfile argparse values dont get assigned to the dataframe columns, however all the rest of the argparse values do.
What I am doing wrong that not all values from argparse are getting assigned to the dataframe?
doc['file'] references the 'file' column so you can't set it to a string before there are any rows in the dataframe.
If there's only one row in json_normalized then you probably want something like this:
def func1(infile, outfilepath, tabs, tabs_result, json_normalized):
doc = pd.DataFrame(columns=['file', 'output_table_name', 'output_table_fields', 'output_table_datatypes'])
index = json_normalized['index'][0]
dtypes_name = json_normalized['dtypes.name'][0]
doc.loc[0] = [infile, outfilepath, index, dtypes_name]
...
return doc
or if you mean to write a whole column of index then swap the order:
def func1(infile, outfilepath, tabs, tabs_result, json_normalized):
doc = pd.DataFrame(columns=['file', 'output_table_name', 'output_table_fields', 'output_table_datatypes'])
doc['output_table_fields'] = json_normalized['index']
doc['output_table_datatypes'] = json_normalized['dtypes.name']
doc['output_table_name'] = outfilepath
doc['file'] = infile
...
return doc
argparse and passing/using variables in a function isn't the issue. The problem is with how you create the data frame.
Consider this stripped down example:
In [255]: doc = pd.DataFrame()
In [256]: doc['file'] = 'foobar'
In [257]: doc['outfile'] = 'anothername'
In [258]: doc
Out[258]:
Empty DataFrame
Columns: [file, outfile]
Index: []
In [259]: doc['col'] = [1,2,3,4]
In [260]: doc
Out[260]:
file outfile col
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 3
3 NaN NaN 4
The initial assignments apply to an empty frame, one without any rows.
Assigning the constant values to the columns after creating the rows:
In [261]: doc['file'] = 'foobar'
In [262]: doc['outfile'] = 'anothername'
In [263]: doc
Out[263]:
file outfile col
0 foobar anothername 1
1 foobar anothername 2
2 foobar anothername 3
3 foobar anothername 4
Alternatively you could specify the row indices at the start:
In [265]: doc = pd.DataFrame(index=np.arange(5))
In [266]: doc
Out[266]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
In [267]: doc['file'] = 'foobar'
In [268]: doc['outfile'] = 'anothername'
In [269]: doc
Out[269]:
file outfile
0 foobar anothername
1 foobar anothername
2 foobar anothername
3 foobar anothername
4 foobar anothername
Related
Let's say we have many excel files with the multiple sheets as follows:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I wish to append all the sheets to one if the last part split by _ of sheet names are same across all excel files. ie., sheet 1 will append with sheet 2 since their both have common bj, if another excel file also have sheets with name bj, it will also be append to this one, same logic for sheet 3 and sheet 4.
How could I achieve that in Pandas or other Python packages?
The expected result for current excel file would be:
bj:
a b c d
0 1 2 23 2
1 2 3 45 5
2 1 2 23 6
3 2 3 45 7
sh:
a b c
0 1 2 23
1 2 3 45
2 1 2 23
3 2 3 40
Code for reference:
import os, glob
import pandas as pd
files = glob.glob("*.xlsx")
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx', index=False)
Update:
You may download test excel files and final expected result from this link.
Try:
dfs = pd.read_excel('Downloads/WS_1.xlsx', sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx')
Update
import os, glob
import pandas as pd
files = glob.glob("Downloads/test_data/*.xlsx")
writer = pd.ExcelWriter('Downloads/test_data/Output_file.xlsx', engine='xlsxwriter')
excel_dict = {}
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
excel_dict.update(dfs)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(writer, index=False, sheet_name=f'{n}')
writer.save()
writer.close()
I have achieved the whole process and get the final expected result with the code below.
Thanks to provide alternative and more concise solutions or give me some advices if it's possible:
import os, glob
import pandas as pd
from pandas import ExcelWriter
from datetime import datetime
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key, index=False)
writer.save()
root_dir = './original/'
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
file_path = os.path.join(root_dir, file)
print(file)
f = pd.ExcelFile(file_path)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
print(sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
# year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = './add_columns_from_sheet_name/' + "new_" + file)
root_dir = './add_columns_from_sheet_name/'
list1 = []
df = pd.DataFrame()
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
# print(file)
city = file.split('_')[0]
# print(file)
file_path = os.path.join(root_dir, file)
# print(file_path)
dfs = pd.read_excel(file_path, sheet_name=None)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
print(n)
timestr = datetime.utcnow().strftime('%Y%m%d-%H%M%S%f')[:-3]
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'./output/{n}_{timestr}.xlsx', index=False)
file_set = set()
file_dir = './output/'
file_list = os.listdir(file_dir)
for file in file_list:
data_type = file.split('_')[0]
file_set.add(data_type)
print(file_set)
file_dir = './output'
file_list = os.listdir(file_dir)
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
file_set = set()
for file in file_list:
if '.xlsx' in file:
# print(file)
df_temp = pd.read_excel(os.path.join(file_dir, file))
if 'bj' in file:
df1 = df1.append(df_temp)
elif 'sh' in file:
df2 = df2.append(df_temp)
elif 'gz' in file:
df3 = df3.append(df_temp)
elif 'sz' in file:
df4 = df4.append(df_temp)
# function
def dfs_tabs(df_list, sheet_list, file_name):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0, index=False)
writer.save()
# list of dataframes and sheet names
dfs = [df1, df2, df3, df4]
sheets = ['bj', 'sh', 'gz', 'sz']
# run function
dfs_tabs(dfs, sheets, './final/final_result.xlsx')
I have this weird Pandas problem, when I use the apply function using values from a data frame, it only gets applied to the first row:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [10, 20]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
#variable data frame - pull values from this to edit main data frame
headerVariables = [['varA', 'varB']]
valuesVariables = [[2, 10]]
dfVariables = pd.DataFrame(valuesVariables, columns = headerVariables)
dfVariables.to_csv('Variables.csv', index=False)
readVariablesCSV = pd.read_csv('Variables.csv')
readVarA = readVariablesCSV['varA']
readVarB = readVariablesCSV['varB']
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 NaN NaN
But when I just use regular variables (not being called from a data frame), it functions just fine:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [20, 40]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
# variables
readVarA = 2
readVarB = 10
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 100.0 200.0
Help please I'm pulling my hair out.
If you take readVarA and readVarB from the dataframe by selecting the column it is a pandas Series with an index, which gives a problem in the calculation (dividing a series by another series with a different index doesn't work).
You can take the first value from the series to get the value like this:
def formula(x):
return (x / readVarA[0]) * readVarB[0]
I have a utils.py and a main.py. In the utils.py file I want all my data load, formula defs and so on. I want to create a class Data_load() and make that ensure load of data I can pull directly from main.py.
I have this:
utils.py:
def readMyFile(filename):
file = []
with open(filename) as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';', )
for row in csvReader:
file.append(row[0])
return file
file = readMyFile('C:\\...\\count_all_terminate.csv')
file_load = pd.DataFrame(file)
Got this:
main.py reads (one column only and with no header??!!):
0
0 User Name
1 146166
2 146166
3 146166
4 146166
... ...
3987 200589
3988 194018
3989 194449
3990 174565
3991 175440
I wanted main.py to read this:
0 col 2 col 3 col n
0 User Name
1 146166
2 146166
3 146166
4 146166
... ...
3987 200589
3988 194018
3989 194449
3990 174565
3991 175440
How do I
place the def in class, something like the following...
class Data_load():
def __init__(self, ....):
self
def readMyFile(filename):
file = []
with open(filename) as csvDataFile:
csvReader = csv.reader(csvDataFile, delimiter=';', )
for row in csvReader:
file.append(row[0])
return file
..and how do I make it print all the columns I know to exist in the 'count_all_terminate.csv' file? Any help is appreciated, Happy New Year from Hubsandspokes
If you want to use a class, I would employ pandas. You could do it standalone as follows:
import pandas as pd
df = pd.DataFrame.read_csv(filepath, **kargs) #see docs for **kargs definitions
df_stats = df.describe() #Provides summary statistics on each column see docs for
# description
If you desire a class,
class Data_load:
def __init__(self, filepath, **kargs):
self.df = pd.read_csv(filepath, **kargs)
def summary_stats(self):
return seld.df.describe()
Then to use:
filepath = r"path to csv file of interest"
myData = Data_load(filepath)
EDA = myData.summary_stats()
I have created a script that collects the information on a website and puts it on a script. I'm on my process to become acquainted with python scraping and I would like some help as I would like to player numbers to be on a different column
# import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import xlsxwriter
import xlwt
from xlwt import Workbook
# Workbook is created
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
#send request
#url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
url = 'https://www.fcf.cat/acta/2422183'
page = requests.get(url,timeout=5, verify=False)
soup = BeautifulSoup(page.text,'html.parser')
#read acta
#acta_text = []
#acta_text_element = soup.find_all(class_='acta-table')
#for item in acta_text_element:
# acta_text.append(item.text)
i = 0
acta = []
for tr in soup.find_all('tr'):
values = [td.text.strip() for td in tr.find_all('td') ]
print(values)
acta.append(values)
i = 1 + i
sheet1.write(i,0,values)
wb.save('xlwt example.xls')
print(acta)
Thanks,
Two things to consider:
You can separate the first element in the list by using values[0] then use values[1:] for the remaining items
Use isnumeric to check if a string value is a number
Try this code:
for tr in soup.find_all('tr'):
values = [td.text.strip() for td in tr.find_all('td') ]
print(values)
acta.append(values)
i = 1 + i
if len(values) and values[0].isnumeric(): # if first element is number
sheet1.write(i,0,values[0]) # number in column 1
sheet1.write(i,1,values[1:]) # rest of list in column 2
else:
sheet1.write(i,0,values) # all values in column 1
Excel output (truncated)
To take the team on the left, for example, try this:
tables = soup.select('table')
players = []
columns = ["Player","Shirt"]
titulars = [item for item in tables[1].text.strip().split('\n') if len(item)>0]
#tables[1] is where the data for the first team is; the other team is in tables[8]
for num, name in zip(titulars[2::2],titulars[1::2]):
player = []
player.extend((num,name))
players.append(player)
pd.DataFrame(players,columns=columns)
Output:
Player Shirt
0 TORNER ENCINAS, GONZALO 1
1 MACHUCA LOVERA, OSMAR SILVESTRE 3
2 JARA MARTIN, BLAI 4
3 AGUILAR LUQUE, DANIEL 5
4 FONT MURILLO, JOAQUIN 6
5 MARTÍNEZ ELVIR, RICHARD ADRIAN 7
6 MARQUEZ RODRIGUEZ, GERARD 8
7 PATUEL BATLLE, GERARD 10
8 EL MAHI ZAROUALI, BILAL 11
9 JAUME MORERA, ADRIA 14
10 DEL VALLE ESCANCIANO, MARTI 15
Initial question
I want to calculate the Levenshtein distance between multiple strings, one in a series, the other in a list. I tried my hands on map, zip, etc., but I only got the desired result using a for loop and apply. Is there a way to improve style and especially speed?
Here is what I tried and it does what it is supposed to do, but lacks of speed given a large series.
import stringdist
strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
df = pd.DataFrame()
for w in c:
df[w] = s.apply(lambda x: stringdist.levenshtein(x, w))
## Result: ##
me mine Friend
Hello 4 5 6
my 1 3 6
Friend 5 4 0
I 2 4 6
am 2 4 6
Solution
Thanks to #Dames and #molybdenum42, I can provide the solution I used, directly beneath the question. For more insights, please check their great answers below.
import stringdist
from itertools import product
strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
word_combinations = np.array(list(product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:, 0],
word_combinations[:, 1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
This results in the desired data frame.
Setup:
import stringdist
import pandas as pd
import numpy as np
import itertools
s = pd.Series(data=['Hello', 'my', 'Friend'],
index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']
Options
option: an easy one-liner
df = pd.DataFrame([s.apply(lambda x: stringdist.levenshtein(x, w)) for w in c])
option: np.fromfunction (thanks to #baccandr)
#np.vectorize
def lavdist(a, b):
return stringdist.levenshtein(c[a], s[b])
df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int),
columns=c, index=s)
option: see #molybdenum42
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2].unstack()
(the best) option: modified option 3
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
Performance testing:
import timeit
from Levenshtein import distance
import pandas as pd
import numpy as np
import itertools
s = pd.Series(data=['Hello', 'my', 'Friend'],
index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']
test_code0 = """
df = pd.DataFrame()
for w in c:
df[w] = s.apply(lambda x: distance(x, w))
"""
test_code1 = """
df = pd.DataFrame({w:s.apply(lambda x: distance(x, w)) for w in c})
"""
test_code2 = """
#np.vectorize
def lavdist(a, b):
return distance(c[a], s[b])
df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int),
columns=c, index=s)
"""
test_code3 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2] #.unstack() produces error
"""
test_code4 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
"""
test_setup = "from __main__ import distance, s, c, pd, np, itertools"
print("test0", timeit.timeit(test_code0, number = 1000, setup = test_setup))
print("test1", timeit.timeit(test_code1, number = 1000, setup = test_setup))
print("test2", timeit.timeit(test_code2, number = 1000, setup = test_setup))
print("test3", timeit.timeit(test_code3, number = 1000, setup = test_setup))
print("test4", timeit.timeit(test_code4, number = 1000, setup = test_setup))
Results
# results
# test0 1.3671939949999796
# test1 0.5982696900009614
# test2 0.3246431229999871
# test3 2.0100400850005826
# test4 0.23796007100099814
Using itertools, you can at least get all the required combinations. Using a vectorized version of stringcount.levenshtein (made using numpy.vectorize()) you can then get your desired result without looping at all, though I haven't tested the performance of the vectorized levenshtein function.
The code could look something like this:
import stringdist
import numpy as np
import pandas as pd
import itertools
s = pd.Series(["Hello", "my","Friend"])
c = ['me', 'mine', 'Friend']
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
At this point you have the results in a numpy array, each corresponding to one of all the possible combinations of your two intial arrays. If you want to get it into the shape you have in your example, there's some pandas trickery to be done:
df = pd.DataFrame([word_combinations[:,0], word_combinations[:,1], result]).T
### initially looks like: ###
# 0 1 2
# 0 Hello me 4
# 1 Hello mine 5
# 2 Hello Friend 6
# 3 my me 1
# 4 my mine 3
# 5 my Friend 6
# 6 Friend me 5
# 7 Friend mine 4
# 8 Friend Friend 0
df = df.set_index([0,1])[2].unstack()
### Now looks like: ###
# Friend Hello my
# Friend 0 6 6
# me 5 4 1
# mine 4 5 3
Again, I haven't tested the performance of this method, so I recommend checking that out - it should be faster than iteration though.
EDIT:
User #Dames has a better suggestion for making the result all pretty-like:
result = result.reshape(len(c), len(s))
df = pd.DataFrame(result, columns=c, index=s)