colab getting crashed while running the python script - python-3.x

i'm using the following code for merging two excel files which has around 800k rows in each, is there any other way to merge the files in same fashion or any solution?
import pandas as pd
df=pd.read_csv("master file.csv")
df1=pd.read_csv("onto_diseas.csv")
df4=pd.merge(df, df1, left_on = 'extId', right_on = 'extId', how = 'inner')
df4

Try use dask or drill for merging or specify datatypes demanding less memory while (float16 instead float64 & so forth) creating dataframes. I could show this in code if you provide links to your files.

Related

How to fix NaN values in concatenated multiple tables in Jupyter Notebook

I am using Jupyter Notedbook. I have concatenated multiple tables. When I run the head() command I am not able to see the values in age and gender columns in a table rather than it's showing me NaN values against each user_id.
The following image_1 shows us the output when I concatenated the different two tables
How can I sort it out this issue or suggest me another way to conatenate tables where I can see all of the table values?
or do I need to access tables separeately and apply operations on different tables?
I am expecting to get the values in age and gender table rather than NaN values.
When I use these tables separately. It shows correct results but I have a big data problem so I need to concatenate the tables to access each of the feature column. In the end, I can apply operations on concatenated table features.
I have been testing your problem with the two csv's of your Github.
First of all I loaded the two tables as 'df1' and 'df2', importing the pandas library.
import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_1.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_2.csv')
Then, using the pandas library, you can merge both dataframes choosing the connection column, in this case, 'user_id'.
final_df= pd.merge(df1, df2,on='user_id')
Finally we have the 'final_df' with all the information of both tables and without NaN's.

Using constant memory with pandas xlsxwriter

Im trying to use the below code to write large pandas dataframes to excel worsheets, if i write it directly the system is running out of RAM, is this a viable option or are there any alternatives?
writer = pd.ExcelWriter('Python Output Analysis.xlsx', engine='xlsxwriter',options=dict(constant_memory=True))
The constant_memory mode of XlsxWriter can be used to write very large Excel files with low, constant, memory usage. The catch is that the data needs to be written in row order and (as #Stef points out in the comments above) Pandas writes to Excel in column order. So constant_memory mode won't work with Pandas ExcelWriter.
As an alternative you could avoid ExcelWriter and write the data directly to XlsxWriter from the dataframe on a row by row basis. However, that will be slower from a Pandas point of view.
If your data is large, just consider saving the data with raw text file. e.g. csv, txt, etc.
df.to_csv('file.csv', index=False, sep=',')
df.to_csv('file.tsv', index=False, sep='\t')
Or split DataFrame, and save to small files.
df_size = df.shape[0]
chunksize = df_size//10
for i in range(0, df_size, chunksize):
# print(i, i+chunksize)
dfn = df.iloc[i:i+chunksize,:]
dfn.to_excel('...')

Delete every pandas dataframe in final script

I'm using pandas dataframes in different scripts. For example:
script1.py:
import pandas as pd
df1 = pd.read_csv("textfile1.csv")
"do stuff with df1 including copying some columns to use"
script2.py:
import pandas
df2 = pd.read_csv("textfile2.csv")
"do stuff with df2 including using .loc to grab some specific rows.
and then using these two dataframes (in reality I'm using about 50 dataframes) in different Flask views and python scripts. However, when I go to the Homepage of my Flask application and I follow the steps to create a new result based on a different input file, the result keeps giving me the old (or the first) results file based on the dataframes it read in the first time.
I tried (mostly in combination of one another):
- logout_user()
- session.clear()
- CACHE_TYPE=null
- gc.collect()
- SECRET_KEY = str(uuid.uuid4())
- for var in dir():
if isinstance(eval(var), pd.core.frame.DataFrame):
del globals()[var]
I can't (read: shouldn't) delete pandas dataframes after they are created, as it is all interconnected. But what I would like is to have a button at the end of the last page, and if I were to click it, it would delete every pandas dataframe that exists in every script or in memory. Is that a possibility? That would hopefully solve my problem.
Try with a class
class Dataframe1():
def __init__(self, data):
self.data = data
d1 = Dataframe1(pd.read_csv("textfile1.csv"))
If you want to access to the data
d1.data
To delete
del d1

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Pandas / odo / bcolz selective loading of rows from a large CSV file

Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?
I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.

Resources