Why does pandas python use disk space - python-3.x

I have a PC with two disks:
110GB SSD
1TB HDD
There is around 18GB free in the SSD.
When I run the python code below, it "uses" all the space from my SSD (I end up having only 1GB free). This code iterates on all SAS files in a folder, performs a group by operation and appends results of each file to one big dataframe.
import pandas as pd
import os
import datetime
import numpy as np
#The function GetDailyPricePoints does the following:
#1. Imports file
#2. Creates "price" variable
#3. Performs a group by
#4. Decode byte variables and convert salesdate to date type (if needed)
def GetDailyPricePoints(inpath,infile):
intable = pd.read_sas(filepath_or_buffer=os.path.join(inpath,infile))
#Create price column
intable.loc[intable['quantity']!=0,'price'] = intable['salesvalue']/intable['quantity']
intable['price'] = round(intable['price'].fillna(0.0),0)
#Create outtable
outtable = intable.groupby(["salesdate", "storecode", "price", "barcode"]).agg({'key_row':'count', 'salesvalue':'sum', 'quantity':'sum'}).reset_index().rename(columns = {'key_row':'Baskets', 'salesvalue':'Sales', 'quantity':'Quantity'})
#Fix byte values and salesdate column
for column in outtable:
if not column in list(outtable.select_dtypes(include=[np.number]).columns.values): #loop non-numeric columns
outtable[column] = outtable[column].where(outtable[column].apply(type) != bytes, outtable[column].str.decode('utf-8'))
elif column=='salesdate': #numeric column and name is salesdate
outtable[column] = pd.to_timedelta(outtable[column], unit='D') + pd.Timestamp('1960-1-1')
return outtable
inpath = r'C:\Users\admin\Desktop\Transactions'
outpath = os.getcwd() + '\Export'
outfile = 'DailyPricePoints'
dirs = os.listdir(inpath)
outtable = pd.DataFrame()
#loop through SAS files in folder
for file in dirs:
if file[-9:] == '.sas7bdat':
outtable.append(GetDailyPricePoints(inpath,file,decimals))
I would like to understand what exactly is using disk space. Also, I would like to change the path where this "temporary works" are saved, to a path in my HDD.

You are copying all the data you have into RAM; you don't have enough in this case, so Python uses a page file or virtual memory instead. The only way to fix this would be to get more memory, or you could just not store everything in one big dataframe, e.g. write each file into a pickle with outtable.to_pickle('csvfile.csv').
However, if you insist on storing everything in one large csv, you can append to a csv by passing a file object as the first argument:
out = open('out.csv', 'a')
outtable.to_csv(out, index = False)
doing the .to_csv() step within your loop.
Also, the .append() method for dataframes does not modify the dataframe in place, but instead returns a new dataframe (unlike the method with lists). So your last block of code probably isn't doing what you're expecting.

Related

Pandas Reading excel files and concatenate them

i have a question about efficeincy run time of reading and concatenating files into a single DF.
So i have about 15 files, and i want to each one of them read, filter and concatenate the file to the others.
right now the average size of a file is 8,000KB, and it takes about 8 minutes to run the code.
so basically i want to ask if there is a faster run time.
Thanks in advance!
plus the code is from another pc, so i copied it manually.
import pandas as pd
import os
Path = ~mypath~
Fields = pd.read_excel("Fields.xlsx")
Variable = Fields["Variable"].values.tolist()
Segments = Fields["Segment"].values.tolist()
Component= Fields["Component"].values.tolist()
li = []
for file in Path:
df= None
if file.endwith(".xlsx"):
df = pd.read_excel(file)
li.append(df[(df["Variable"].isin(Variable)) &
(df["Segment"].isin(Segments)) &
(df["Component"].isin(Components))])
Frame = pd.concat(li, axis=0, ignore_index=True)
EDIT:
Since i run the code on VDI the preformance is low.
tried to run it on a local pc, the execution time was a quater of the VDI.
Tried to search for a method

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

Python 3.9: For loop is not producing output files eventhough no errors are displayed

everyone, I am fairly new to using python for data analysis,so apologies for silly questions:
IDE : PyCharm
What I have : A massive .xyz file (with 4 columns) which is a combination of several datasets, each dataset can be determined by the third column of the file which goes from 10,000 to -10,000 with 0 in between and 100 as spacing and repeats (so every 201 rows is one dataset)
What I want to do : Split the massive file into its individual datasets (201 rows each)and save each file under a different name.
What I have done so far :
# Import packages
import os
import pandas as pd
import numpy as np #For next steps
import math #For next steps
#Check and Change directory
path = 'C:/Clayton/lines/profiles_aufmod'
os.chdir(path)
print(os.getcwd()) #Correct path is printed
# split the xyz file into different files for each profile
main_xyz = 'bathy_SPO_1984_50x50_profile.xyz'
number_lines = sum(1 for row in (open(main_xyz)))
print(number_lines) # 10854 is the output
rowsize = 201
for i in range(number_lines, rowsize):
profile_raw_df = pd.read_csv(main_xyz, delimiter=',', header=None, nrows=rowsize,
skiprows=i)
out_xyz = 'Profile' + str(i) + '.xyz'
profile_raw_df.to_csv(out_xyz, index=False,
header=False, mode='a')
Problems I am facing :
The for loop was at first giving output files as seen in the image,check Proof of output but now it does not produce any outputs and it is not rewriting the previous files either. The other mystery is that I am not getting an error either,check Code executed without error.
What I tried to fix the issue :
I updated all the packages and restarted Pycharm
I ran each line of code one by one and everything works until the for loop
While counting the number of rows in
number_lines = sum(1 for row in (open(main_xyz)))
you have exhausted the iterator that loops over the lines of the file. But you do not close the file. But this should not prevent Pandas from reading the same file.
A better idiom would be
with open(main_xyz) as fh:
number_lines = sum(1 for row in fh)
Your for loop as it stands does not do what you probably want. I guess you want:
for i in range(0, number_lines, rowsize):
so, rowsize is the step-size, instead of the end value of the for loop.
If you want to number the output files by data set, keep a counnt of the dataset, like this
data_set = 0
for i in range(0, number_lines, rowsize):
data_set += 1
...
out_xyz = f"Profile{data_set}.xyz"
...

Appending data from multiple excel files into a single excel file without overwriting using python pandas

Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.
Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Resources