Cannot get Pandas to concat/append - python-3.x

I'm trying to parse a website's tables and I'm pretty noobish still. For each link only the second table/dataframe is appended to the SS. There are multiple links and therefore it requires a while loop. Using what little I could find only I am just stuck with this which Im pretty sure is totally off:
import pandas as pd
from pandas import ExcelWriter
a=1
alist = []
writer = ExcelWriter('name.xlsx')
def dffunc():
dfs = pd.read_html('http://websitepath{}.htm'.format(a))
df = dfs[1]
alist.append(df,ignore_index=True)
alist = pd.concat(df, axis=0)
while a<9:
dffunc()
a+=1
alist.to_excel(writer, index=False)
writer.save()

df=dfs[1] takes the second table in the list. Is that what you want?

old:
df = dfs[1]
alist.append(df,ignore_index=True)
alist = pd.concat(df, axis=0)
You're appending the 2nd table in the dfs collection to global alist
You're assigning the 2nd table in the dfs collection to alist, undoing all previous steps
Operating on a global var that is written to file once at the end of the loop defeats the purpose of your loop given second bullet; alist will only ever take on the value of the 2nd table in the last query when you write to file
new:
import pandas as pd
from pandas import ExcelWriter
writer = ExcelWriter('name.xlsx')
writer_kwargs = {'index': False}
A = 9
def dffunc(a):
dfs = pd.read_html('http://websitepath{}.htm'.format(a))
return pd.concat(dfs, axis=0)
def dfhandler(df, writer, **kwargs):
df.to_excel(writer, sheet_name=a, **kwargs)
for a in xrange(1, A):
dfhandler(dffunc(a), writer, **writer_kwargs)
writer.save()

Related

Python - Creating a for loop to build a single csv file with multiple dataframes

I am new to python and trying various things to learn the fundamentals. One of the things that i'm currently stuck on is for loops. I have the following code and am positive it can be built out more efficiently using a loop but i'm not sure exactly how.
import pandas as pd
import numpy as np
url1 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=1'
url2 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=2'
url3 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=3'
df1 = pd.read_html(url1)
df1[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df2 = pd.read_html(url2)
df2[0].to_csv ('NFL_Receiving_Page2.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df3 = pd.read_html(url3)
df3[0].to_csv ('NFL_Receiving_Page3.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df_receiving_agg = pd.concat([df1[0], df2[0], df3[0]])
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
I'm ultimately trying to combine the data in the above URL's into a single table in a csv file.
You can try this:
urls = [url1,url2,url3]
df_receiving_agg = pd.DataFrame()
for url in urls:
df = pd.read_html(url)
df_receiving_agg = pd.concat([df_receiving_agg, df])
df_receiving_agg.to_csv('filepath.csv',index=False)
You can do this:
base_url = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page='
dfs = []
for page in range(1, 4):
url = f'{base_url}{page}'
df = pd.read_html(url)
df.to_csv(f'NFL_Receiving_Page{page}.csv', index=False)
dfs.append(df)
df_receiving_agg = pd.concat(dfs)
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False)

How to merge big data of csv files column wise into a single csv file using Pandas?

I have lots of big data csv files in terms of countries and I want to merge their column in a single csv file, furthermore, each file has 'Year' as an index and having same in terms of length and numbers. You can see below is a given example of a Japan.csv file.
If anyone can help me please let me know. Thank you!!
Try using:
import pandas as pd
import glob
l = []
path = 'path/to/directory/'
csvs = glob.glob(path + "/*.csv")
for i in csvs:
df = pd.read_csv(i, index_col=None, header=0)
l.append(df)
df = pd.concat(l, ignore_index=True)
This should work. It goes over each file name, reads it and combines everything into one df. You can export this df to csv or do whatever with it. gl.
import pandas as pd
def combine_csvs_into_one_df(names_of_files):
one_big_df = pd.DataFrame()
for file in names_of_files:
try:
content = pd.read_csv(file)
except PermissionError:
print (file,"was not found")
continue
one_big_df = pd.concat([one_big_df,content])
print (file," added!")
print ("------")
print ("Finished")
return one_big_df

Create multiple Dataframe from XML based on Specific Value

I am trying to parse an XML and save the results in Pandas Data-frame. I have succeeded in saving the details in one specific Data-frame. However now am trying to save the results in multiple data-frame based on one specific class value.
import pandas as pd
import xml.etree.ElementTree as ET
import os
from collections import defaultdict, OrderedDict
tree = ET.parse('PowerChange_76.xml')
root = tree.getroot()
df_list = []
for i, child in enumerate(root):
for subchildren in child.findall('{raml20.xsd}header'):
for subchildren in child.findall('{raml20.xsd}managedObject'):
match_found = 0
xml_class_name = subchildren.get('class')
xml_dist_name = subchildren.get('distName')
print(xml_class_name)
df_dict = OrderedDict()
for subchild in subchildren:
header = subchild.attrib.get('name')
df_dict['Class'] = xml_class_name
df_dict['CellDN'] = xml_dist_name
df_dict[header]=subchild.text
df_list.append(df_dict)
df_cm = pd.DataFrame(df_list)
Expected Result is creation of multiple data-frame based on number of 'class'.
Current Output:
XML File
This is being answered with below method:
def ExtractMOParam(xmlfile2):
tree2=etree.parse(xmlfile2)
root2=tree2.getroot()
df_list2=[]
for i, child in enumerate(root2):
for subchildren in (child.findall('{raml21.xsd}header') or child.findall('{raml20.xsd}header')):
for subchildren in (child.findall('{raml21.xsd}managedObject') or child.findall('{raml20.xsd}managedObject')):
xml_class_name2 = subchildren.get('class')
xml_dist_name2 = subchildren.get('distName')
if ((xml_class_name2 in GetMOClass) and (xml_dist_name2 in GetCellDN)):
#xml_dist_name2 = subchildren.get('distName')
#df_list1.append(xml_class_name1)
for subchild in subchildren:
df_dict2=OrderedDict()
header2=subchild.attrib.get('name')
df_dict2['MOClass']=xml_class_name2
df_dict2['CellDN']=xml_dist_name2
df_dict2['Parameter']=header2
df_dict2['CurrentValue']=subchild.text
df_list2.append(df_dict2)
return df_list2
ExtractDump=pd.DataFrame(ExtractMOParam(inputdfile))
d = dict(tuple(ExtractDump.groupby('MOClass')))
for key in d:
d[key]=d[key].reset_index().groupby(['CellDN','MOClass','Parameter'])['CurrentValue'].aggregate('first').unstack()
d[key].reset_index(level=0, inplace=True)
d[key].reset_index(level=0, inplace=True)
writer = pd.ExcelWriter('ExtractedDump.xlsx', engine='xlsxwriter')
for tab_name, dframe in d.items():
dframe.to_excel(writer, sheet_name=tab_name,index=False)
writer.save()
Hope this will help others as well.

Fuzzy logic for excel data -Pandas

I have two dataframes DF(~100k rows)which is a raw data file and DF1(15k rows), mapping file. I'm trying to match the DF.address and DF.Name columns to DF1.Address and DF1.Name. Once the match is found DF1.ID should be populated in DF.ID(if DF1.ID is not None) else DF1.top_ID should be populated in DF.ID.
I'm able to match the address and name with the help of fuzzy logic but i'm stuck how to connect the result obtained to populate the ID.
DF1-Mapping file
DF Raw Data file
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from operator import itemgetter
df=pd.read_excel("Test1", index=False)
df1=pd.read_excel("Test2", index=False)
df=df[df['ID'].isnull()]
zip_code=df['Zip'].tolist()
Facility_city=df['City'].tolist()
Address=df['Address'].tolist()
Name_list=df['Name'].tolist()
def fuzzy_match(x, choice, scorer, cutoff):
return (process.extractOne(x,
choices=choice,
scorer=scorer,
score_cutoff=cutoff))
for pin,city,Add,Name in zip(zip_code,Facility_city,Address,Name_list):
#====Address Matching=====#
choice=df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Address1']
result=fuzzy_match(Add,choice,fuzz.ratio,70)
#====Name Matching========#
if (result is not None):
if (result[3]>70):
choice_1=(df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Name'])
result_1=(fuzzy_match(Name,choice_1,fuzz.ratio,95))
print(ID)
if (result_1 is not None):
if(result_1[3]>95):
#Here populating the matching ID
print("ok")
else:
continue
else:
continue
else:
continue
else:
IIUC: Here is a solution:
from fuzzywuzzy import fuzz
import pandas as pd
#Read raw data from clipboard
raw = pd.read_clipboard()
#Read map data from clipboard
mp = pd.read_clipboard()
#Merge raw data and mp data as following
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')
#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)
#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result
This link has sample data to test the solution provided.

Creating multiple dataframes with a loop

This undoubtedly reflects lack of knowledge on my part, but I can't find anything online to help. I am very new to programming. I want to load 6 csvs and do a few things to them before combining them later. The following code iterates over each file but only creates one dataframe, called df.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
for df, file in zip(dfs, files):
df = pd.read_csv(file)
print(df.shape)
print(df.dtypes)
print(list(df))
Use dictionary to store you DataFrames and access them by name
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
dfs[dfn] = pd.read_csv(file)
print(dfs[dfn].shape)
print(dfs[dfn].dtypes)
print(dfs['df3'])
Use list to store you DataFrames and access them by index
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in files:
dfs.append( pd.read_csv(file))
print(dfs[len(dfs)-1].shape)
print(dfs[len(dfs)-1].dtypes)
print (dfs[2])
Do not store intermediate DataFrame, just process them and add to resulting DataFrame.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in files:
df_n = pd.read_csv(file)
print(df_n.shape)
print(df_n.dtypes)
# do you want to do
df = df.append(df_n)
print (df)
If you will process them differently, then you do not need a general structure to store them. Do it simply independent.
df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
print(d.shape,d.dtypes)
df1 = pd.read_csv("data1.csv")
# do you want to with df1
do_general_stuff(df1)
df = df.append(df1)
del df1
df2 = pd.read_csv("data2.csv")
# do you want to with df2
do_general_stuff(df2)
df = df.append(df2)
del df2
df3 = pd.read_csv("data3.csv")
# do you want to with df3
do_general_stuff(df3)
df = df.append(df3)
del df3
# ... and so on
And one geeky way, but don't ask how it works:)
from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']
df = namedtuple('Cdfs',
['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
)(*[pd.read_csv(file) for file in files])
for df_n in df._fields:
print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)
print(df.df3)
I think you think your code is doing something that it is not actually doing.
Specifically, this line: df = pd.read_csv(file)
You might think that in each iteration through the for loop this line is being executed and modified with df being replaced with a string in dfs and file being replaced with a filename in files. While the latter is true, the former is not.
Each iteration through the for loop is reading a csv file and storing it in the variable df effectively overwriting the csv file that was read in during the previous for loop. In other words, df in your for loop is not being replaced with the variable names you defined in dfs.
The key takeaway here is that strings (e.g., 'df1', 'df2', etc.) cannot be substituted and used as variable names when executing code.
One way to achieve the result you want is store each csv file read by pd.read_csv() in a dictionary, where the key is name of the dataframe (e.g., 'df1', 'df2', etc.) and value is the dataframe returned by pd.read_csv().
list_of_dfs = {}
for df, file in zip(dfs, files):
list_of_dfs[df] = pd.read_csv(file)
print(list_of_dfs[df].shape)
print(list_of_dfs[df].dtypes)
print(list(list_of_dfs[df]))
You can then reference each of your dataframes like this:
print(list_of_dfs['df1'])
print(list_of_dfs['df2'])
You can learn more about dictionaries here:
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
A dictionary can store them too
import pandas as pd
from pprint import pprint
files = ('doms_stats201610051.csv', 'doms_stats201610052.csv')
dfsdic = {}
dfs = ('df1', 'df2')
for df, file in zip(dfs, files):
dfsdic[df] = pd.read_csv(file)
print(dfsdic[df].shape)
print(dfsdic[df].dtypes)
print(list(dfsdic[df]))
print(dfsdic['df1'].shape)
print(dfsdic['df2'].shape)

Resources