Pandas Reading excel files and concatenate them - excel

i have a question about efficeincy run time of reading and concatenating files into a single DF.
So i have about 15 files, and i want to each one of them read, filter and concatenate the file to the others.
right now the average size of a file is 8,000KB, and it takes about 8 minutes to run the code.
so basically i want to ask if there is a faster run time.
Thanks in advance!
plus the code is from another pc, so i copied it manually.
import pandas as pd
import os
Path = ~mypath~
Fields = pd.read_excel("Fields.xlsx")
Variable = Fields["Variable"].values.tolist()
Segments = Fields["Segment"].values.tolist()
Component= Fields["Component"].values.tolist()
li = []
for file in Path:
df= None
if file.endwith(".xlsx"):
df = pd.read_excel(file)
li.append(df[(df["Variable"].isin(Variable)) &
(df["Segment"].isin(Segments)) &
(df["Component"].isin(Components))])
Frame = pd.concat(li, axis=0, ignore_index=True)
EDIT:
Since i run the code on VDI the preformance is low.
tried to run it on a local pc, the execution time was a quater of the VDI.
Tried to search for a method

Related

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

Python 3.9: For loop is not producing output files eventhough no errors are displayed

everyone, I am fairly new to using python for data analysis,so apologies for silly questions:
IDE : PyCharm
What I have : A massive .xyz file (with 4 columns) which is a combination of several datasets, each dataset can be determined by the third column of the file which goes from 10,000 to -10,000 with 0 in between and 100 as spacing and repeats (so every 201 rows is one dataset)
What I want to do : Split the massive file into its individual datasets (201 rows each)and save each file under a different name.
What I have done so far :
# Import packages
import os
import pandas as pd
import numpy as np #For next steps
import math #For next steps
#Check and Change directory
path = 'C:/Clayton/lines/profiles_aufmod'
os.chdir(path)
print(os.getcwd()) #Correct path is printed
# split the xyz file into different files for each profile
main_xyz = 'bathy_SPO_1984_50x50_profile.xyz'
number_lines = sum(1 for row in (open(main_xyz)))
print(number_lines) # 10854 is the output
rowsize = 201
for i in range(number_lines, rowsize):
profile_raw_df = pd.read_csv(main_xyz, delimiter=',', header=None, nrows=rowsize,
skiprows=i)
out_xyz = 'Profile' + str(i) + '.xyz'
profile_raw_df.to_csv(out_xyz, index=False,
header=False, mode='a')
Problems I am facing :
The for loop was at first giving output files as seen in the image,check Proof of output but now it does not produce any outputs and it is not rewriting the previous files either. The other mystery is that I am not getting an error either,check Code executed without error.
What I tried to fix the issue :
I updated all the packages and restarted Pycharm
I ran each line of code one by one and everything works until the for loop
While counting the number of rows in
number_lines = sum(1 for row in (open(main_xyz)))
you have exhausted the iterator that loops over the lines of the file. But you do not close the file. But this should not prevent Pandas from reading the same file.
A better idiom would be
with open(main_xyz) as fh:
number_lines = sum(1 for row in fh)
Your for loop as it stands does not do what you probably want. I guess you want:
for i in range(0, number_lines, rowsize):
so, rowsize is the step-size, instead of the end value of the for loop.
If you want to number the output files by data set, keep a counnt of the dataset, like this
data_set = 0
for i in range(0, number_lines, rowsize):
data_set += 1
...
out_xyz = f"Profile{data_set}.xyz"
...

Export multiple csv file using different filename in python

I am new to python.
After researching some code based on my idea which is extracting historical stock data,
I have now working code(see below) when extracting individual name and exporting it to a csv file
import investpy
import sys
sys.stdout = open("extracted.csv", "w")
df = investpy.get_stock_historical_data(stock='JFC',
country='philippines',
from_date='25/11/2020',
to_date='18/12/2020')
print(df)
sys.stdout.close()
Now,
I'm trying to make it more advance.
I want to run this code multiple times automatically with different stock name(about 300 plus name) and export it respectively.
I know it is possible but I cannot search the exact terminology to this problem.
Hoping for your help.
Regards,
you can store the stock's name as a list and then iterate through the list and save all the dataframes into separate files.
import investpy
import sys
stocks_list = ['JFC','AAPL',....] # your stock lists
for stock in stocks_list:
df = investpy.get_stock_historical_data(stock=stock,
country='philippines',
from_date='25/11/2020',
to_date='18/12/2020')
print(df)
file_name = 'extracted_'+stock+'.csv'
df.to_csv(file_name,index=False)

python iterating on multiple files

I have
file_2000.dta, file_2001.dta, file_2002.dta and so on.
I also have
file1_2000.dta, file1_2001.dta, file1_2002.dta and so on.
I want to iterate on the file year.
Let (year) = 2000, 2001, 2002, etc
import file_(year) using pandas.
import file1_(year) using pandas.
file_(year)['name'] = file_(year).index
file1_(year)['name'] = file1_(year).index2
merged = pd.merge(file_(year), file1_(year), on='name')
write/export merged_(year).dta
It seems to me that you need to use the read_stata function, based on your .dta extensions, to read the files in a loop, create a list of the separate dataframes to be able to work with them separately, and then concatenate all dataframes into one.
Something like:
list_of_files = ['file_2000.dta', 'file_2001.dta', 'file_2002.dta'] # full paths here...
frames = []
for f in list_of_files:
df = pd.read_stata(f)
frames.append(df)
consolidated_df = pd.concat(frames, axis=0, ignore_index=True)
These questions might be relevant to your case:
How to Read multiple files in Python for Pandas separate dataframes
Pandas read_stata() with large .dta files
As much as I know there is not 'Let' keyword in Python. To iterate over multiple files in a directory you can simply use for loop with os module like the following:
import os
directory = r'C:\Users\admin'
for filename in os.listdir(directory):
if filename.startswith("file_200") and filename.endswith(".dat"):
# do something
else:
continue
Another approach is to use regex to tell python the files names to match during the iteration. the pattern should be: pattern = r"file_20\d+"

Why does pandas python use disk space

I have a PC with two disks:
110GB SSD
1TB HDD
There is around 18GB free in the SSD.
When I run the python code below, it "uses" all the space from my SSD (I end up having only 1GB free). This code iterates on all SAS files in a folder, performs a group by operation and appends results of each file to one big dataframe.
import pandas as pd
import os
import datetime
import numpy as np
#The function GetDailyPricePoints does the following:
#1. Imports file
#2. Creates "price" variable
#3. Performs a group by
#4. Decode byte variables and convert salesdate to date type (if needed)
def GetDailyPricePoints(inpath,infile):
intable = pd.read_sas(filepath_or_buffer=os.path.join(inpath,infile))
#Create price column
intable.loc[intable['quantity']!=0,'price'] = intable['salesvalue']/intable['quantity']
intable['price'] = round(intable['price'].fillna(0.0),0)
#Create outtable
outtable = intable.groupby(["salesdate", "storecode", "price", "barcode"]).agg({'key_row':'count', 'salesvalue':'sum', 'quantity':'sum'}).reset_index().rename(columns = {'key_row':'Baskets', 'salesvalue':'Sales', 'quantity':'Quantity'})
#Fix byte values and salesdate column
for column in outtable:
if not column in list(outtable.select_dtypes(include=[np.number]).columns.values): #loop non-numeric columns
outtable[column] = outtable[column].where(outtable[column].apply(type) != bytes, outtable[column].str.decode('utf-8'))
elif column=='salesdate': #numeric column and name is salesdate
outtable[column] = pd.to_timedelta(outtable[column], unit='D') + pd.Timestamp('1960-1-1')
return outtable
inpath = r'C:\Users\admin\Desktop\Transactions'
outpath = os.getcwd() + '\Export'
outfile = 'DailyPricePoints'
dirs = os.listdir(inpath)
outtable = pd.DataFrame()
#loop through SAS files in folder
for file in dirs:
if file[-9:] == '.sas7bdat':
outtable.append(GetDailyPricePoints(inpath,file,decimals))
I would like to understand what exactly is using disk space. Also, I would like to change the path where this "temporary works" are saved, to a path in my HDD.
You are copying all the data you have into RAM; you don't have enough in this case, so Python uses a page file or virtual memory instead. The only way to fix this would be to get more memory, or you could just not store everything in one big dataframe, e.g. write each file into a pickle with outtable.to_pickle('csvfile.csv').
However, if you insist on storing everything in one large csv, you can append to a csv by passing a file object as the first argument:
out = open('out.csv', 'a')
outtable.to_csv(out, index = False)
doing the .to_csv() step within your loop.
Also, the .append() method for dataframes does not modify the dataframe in place, but instead returns a new dataframe (unlike the method with lists). So your last block of code probably isn't doing what you're expecting.

Resources