Python - Creating a for loop to build a single csv file with multiple dataframes - python-3.x

I am new to python and trying various things to learn the fundamentals. One of the things that i'm currently stuck on is for loops. I have the following code and am positive it can be built out more efficiently using a loop but i'm not sure exactly how.
import pandas as pd
import numpy as np
url1 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=1'
url2 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=2'
url3 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=3'
df1 = pd.read_html(url1)
df1[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df2 = pd.read_html(url2)
df2[0].to_csv ('NFL_Receiving_Page2.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df3 = pd.read_html(url3)
df3[0].to_csv ('NFL_Receiving_Page3.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df_receiving_agg = pd.concat([df1[0], df2[0], df3[0]])
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
I'm ultimately trying to combine the data in the above URL's into a single table in a csv file.

You can try this:
urls = [url1,url2,url3]
df_receiving_agg = pd.DataFrame()
for url in urls:
df = pd.read_html(url)
df_receiving_agg = pd.concat([df_receiving_agg, df])
df_receiving_agg.to_csv('filepath.csv',index=False)

You can do this:
base_url = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page='
dfs = []
for page in range(1, 4):
url = f'{base_url}{page}'
df = pd.read_html(url)
df.to_csv(f'NFL_Receiving_Page{page}.csv', index=False)
dfs.append(df)
df_receiving_agg = pd.concat(dfs)
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False)

Related

Append values to Dataframe in loop and if conditions

Need help please.
I have a dataframe that reads rows from Excel and appends to Dataframe if certain columns exist.
I need to add an additional Dataframe if the columns don't exist in a sheet and append filename and sheetname and write all the file names and sheet names for those sheets to an excel file. Also I want the values to be unique.
I tried adding to dfErrorList but it only showed the last sheetname and filename and repeated itself many times in the output excel file
from xlsxwriter import Workbook
import pandas as pd
import openpyxl
import glob
import os
path = 'filestoimport/*.xlsx'
list_of_dfs = []
list_of_dferror = []
dfErrorList = pd.DataFrame() #create empty df
for filepath in glob.glob(path):
xl = pd.ExcelFile(filepath)
# Define an empty list to store individual DataFrames
for sheet_name in xl.sheet_names:
df = pd.read_excel(filepath, sheet_name=sheet_name)
df['sheetname'] = sheet_name
file_name = os.path.basename(filepath)
df['sourcefilename'] = file_name
if "Project ID" in df.columns and "Status" in df.columns:
print('')
*else:
dfErrorList['sheetname'] = df['sheetname'] # adds `sheet_name` into the column
dfErrorList['sourcefilename'] = df['sourcefilename']
continue
list_of_dferror.append((dfErrorList))
df['Status'].fillna('', inplace=True)
df['Added by'].fillna('', inplace=True)
list_of_dfs.append(df)
# # Combine all DataFrames into one
data = pd.concat(list_of_dfs, ignore_index=True)
dataErrors = pd.concat(list_of_dferror, ignore_index=True)
dataErrors.to_excel(r'error.xlsx', index=False)
# data.to_excel("total_countries.xlsx", index=None)

Pandas - Add items to dataframe

I am trying to add row items to the dataframe, and I am not able to update the dataframe.
What i tried until now is commented out as it doesn't do what I need.
I simply want to download the json file and store it to a dataframe with those given columns. Seems I am not able to extract the child components fron JSON file and store them to a brand new dataframe.
Please find bellow my code:
import requests, json, urllib
import pandas as pd
url = "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
data = pd.read_json(url)
headers = []
df = pd.DataFrame()
for key, item in data['vulnerabilities'].items():
for k in item.keys():
headers.append(k)
col = list(set(headers))
new_df = pd.DataFrame(columns=col)
for item in data['vulnerabilities'].items():
print(item[1])
# new_df['product'] = item[1]['product']
# new_df['vendorProject'] = item[1]['vendorProject']
# new_df['dueDate'] = item[1]['dueDate']
# new_df['shortDescription'] = item[1]['shortDescription']
# new_df['dateAdded'] = item[1]['dateAdded']
# new_df['vulnerabilityName'] = item[1]['vulnerabilityName']
# new_df['cveID'] = item[1]['cveID']
# new_df.append(item[1], ignore_index = True)
new_df
At the end my df is still blank.
The nested JSON data can be directly converted to a flattened dataframe using pd.json_normalize(). The headers are extracted from the JSON itself.
new_df = pd.DataFrame(pd.json_normalize(data['vulnerabilities']))
UPDATE: Unnested the vulnerabilities column specifically.
Output:
It worked with this:
import requests, json, urllib
import pandas as pd
url = "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
data = pd.read_json(url)
headers = []
df = pd.DataFrame()
for key, item in data['vulnerabilities'].items():
for k in item.keys():
headers.append(k)
col = list(set(headers))
new_df = pd.DataFrame(columns=col)
for item in data['vulnerabilities'].items():
new_df.loc[len(new_df.index)] = item[1] <===THIS
new_df.head()

How to edit columns in .CSV files using pandas

import urllib.request
import pandas as pd
# Url file Website
url = 'https://......CSV'
# Download file
urllib.request.urlretrieve(
url, "F:\.....A.CSV")
csvFilePath = "F:\.....A.CSV"
df = pd.read_csv(csvFilePath, sep='\t')
rows=[0,1,2,3]
df2 = df.drop(rows, axis=0, inplace=True)
df.to_csv(
r'F:\....New_A.CSV')
I tried doing this in code but it's making columns merge into a single column.
What I'm going to do is remove the top row from the left as shown in the picture.
I found a problem sep='\t' change to sep=','
Replace:
df = pd.read_csv(csvFilePath, sep='\t')
by:
df = pd.read_csv(csvFilePath, sep='\t', skiprows=5)

How do you append rows to xlsx file when using beautifulsoup and pandas to scrape?

So, i've been looking all over and i can't seem to figure out why i can't get the results from my scrape to write to a xlsx file.
I'm running a list of urls from a .csv file. I throw 10 urls in there, beautifulsoup scrapes them. If i just print the dataframe, it comes our right.
If i try and save the results as a xlsx(which is preferred) or csv, it will only give me the results from the last url.
If i run this, it prints out perfect
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
But i can't seem to get it to append to an xlsx file. I'm only getting the last result, which i figure is because it is just "writing" and not appending.
I've tried:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
AND
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
I started reading into openpyxl, but at this point I am so confused, i don't understand it.
Any and all help is appreciated
You are iterating over your csv data line-by-line, but you are recreating your dataframe at every iteration, so you are losing the value of the previous one each time. You will need to create the df first outside of the loop, and add data in your for loop.
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
Your assumption of writing and not appending is correct, but you need to append the dataframe and then write it to excel, and not append data to the excel(if I understood correctly).
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
The pd.ExcelWriter will only produce the output when you run:
writer.save()
I have a similar code that opens the file with the following parameters and it works:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
Note that according to the documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" the document with another dataframe with the exact same format to the document structure.
For it to be reproducable, you could add a template xlsx, but I hope it helps. Please let me know.

Creating multiple dataframes with a loop

This undoubtedly reflects lack of knowledge on my part, but I can't find anything online to help. I am very new to programming. I want to load 6 csvs and do a few things to them before combining them later. The following code iterates over each file but only creates one dataframe, called df.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
for df, file in zip(dfs, files):
df = pd.read_csv(file)
print(df.shape)
print(df.dtypes)
print(list(df))
Use dictionary to store you DataFrames and access them by name
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
dfs[dfn] = pd.read_csv(file)
print(dfs[dfn].shape)
print(dfs[dfn].dtypes)
print(dfs['df3'])
Use list to store you DataFrames and access them by index
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in files:
dfs.append( pd.read_csv(file))
print(dfs[len(dfs)-1].shape)
print(dfs[len(dfs)-1].dtypes)
print (dfs[2])
Do not store intermediate DataFrame, just process them and add to resulting DataFrame.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in files:
df_n = pd.read_csv(file)
print(df_n.shape)
print(df_n.dtypes)
# do you want to do
df = df.append(df_n)
print (df)
If you will process them differently, then you do not need a general structure to store them. Do it simply independent.
df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
print(d.shape,d.dtypes)
df1 = pd.read_csv("data1.csv")
# do you want to with df1
do_general_stuff(df1)
df = df.append(df1)
del df1
df2 = pd.read_csv("data2.csv")
# do you want to with df2
do_general_stuff(df2)
df = df.append(df2)
del df2
df3 = pd.read_csv("data3.csv")
# do you want to with df3
do_general_stuff(df3)
df = df.append(df3)
del df3
# ... and so on
And one geeky way, but don't ask how it works:)
from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']
df = namedtuple('Cdfs',
['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
)(*[pd.read_csv(file) for file in files])
for df_n in df._fields:
print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)
print(df.df3)
I think you think your code is doing something that it is not actually doing.
Specifically, this line: df = pd.read_csv(file)
You might think that in each iteration through the for loop this line is being executed and modified with df being replaced with a string in dfs and file being replaced with a filename in files. While the latter is true, the former is not.
Each iteration through the for loop is reading a csv file and storing it in the variable df effectively overwriting the csv file that was read in during the previous for loop. In other words, df in your for loop is not being replaced with the variable names you defined in dfs.
The key takeaway here is that strings (e.g., 'df1', 'df2', etc.) cannot be substituted and used as variable names when executing code.
One way to achieve the result you want is store each csv file read by pd.read_csv() in a dictionary, where the key is name of the dataframe (e.g., 'df1', 'df2', etc.) and value is the dataframe returned by pd.read_csv().
list_of_dfs = {}
for df, file in zip(dfs, files):
list_of_dfs[df] = pd.read_csv(file)
print(list_of_dfs[df].shape)
print(list_of_dfs[df].dtypes)
print(list(list_of_dfs[df]))
You can then reference each of your dataframes like this:
print(list_of_dfs['df1'])
print(list_of_dfs['df2'])
You can learn more about dictionaries here:
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
A dictionary can store them too
import pandas as pd
from pprint import pprint
files = ('doms_stats201610051.csv', 'doms_stats201610052.csv')
dfsdic = {}
dfs = ('df1', 'df2')
for df, file in zip(dfs, files):
dfsdic[df] = pd.read_csv(file)
print(dfsdic[df].shape)
print(dfsdic[df].dtypes)
print(list(dfsdic[df]))
print(dfsdic['df1'].shape)
print(dfsdic['df2'].shape)

Resources