Reading Large CSV file using pandas - python-3.x

I have to read and analyse a logging file from CAN which is in CSV format. It has 161180 rows and I'm only separating 566 columns with semicolon. This is the code I have.
import csv
import dtale
import pandas as pd
path = 'C:\Thesis\Log_Files\InputOutput\Input\Test_Log.csv'
raw_data = pd.read_csv(path,engine="python",chunksize = 1000000, sep=";")
df = pd.DataFrame(raw_data)
#df
dtale.show(df)
I have following error when I run the code in Jupyter Notebook and it's encountering with below error message. Please help me with this. Thanks in advance!
MemoryError: Unable to allocate 348. MiB for an array with shape (161180, 566) and data type object
import time
import pandas as pd
import csv
import dtale
chunk_size = 1000000
batch_no=1
for chunk in pd.read_csv("C:\Thesis\Log_Files\InputOutput\Input\Book2.csv",chunksize=chunk_size,sep=";"):
chunk.to_csv('chunk'+str(batch_no)+'.csv', index=False)
batch_no+=1
df1 = pd.read_csv('chunk1.csv')
df1
dtale.show(df1)
I used above code for only 10 rows and 566 columns. Then it's working. If I consider all the rows (161180), it's not working. Could anyone help me with this. Thanks in advance!
I have attached the output here

You are running out of RAM when loading in the datafile. Best option is to split the file into chunks and read in chunks of files
To read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
You'll probably also want to use chunksize:
This returns a TextFileReader object for iteration:
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)

Related

How to get specific data from excel

Any idea on how can I acccess or get the box data (see image) under TI_Binning tab of excel file using python? What module or similar code you can recommend to me? I just need those specifica data and append it on other file such as .txt file.
Getting the data you circled:
import pandas as pd
df = pd.read_excel('yourfilepath', 'TI_Binning', skiprows=2)
df = df[['Number', 'Name']]
To appending to an existing text file:
import numpy as np
with open("filetoappenddata.txt", "ab") as f:
np.savetxt(f, df.values)
More info here on np.savetxt for formats to fit your output need.

How do I interpolate the time based on a given condition?

I have over 50 csv files to process, but each dataset like this:
[a simplified csv file example] (https://i.stack.imgur.com/yoGo9.png)
There are three columns: Times, Index, Voltage
I want to interpolate time when total voltage decreases [here is (84-69) = 15] reaches 53% [ it means 15*0.53] at index 2.
I will repeat this process for index 4, too. May I ask what I should do?
I am a beginner for python and try this following script:
source code (
import pandas as pd
import glob
import numpy as np
import os
import matplotlib.pyplot as plt
import xlwings as xw
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv")) # read all csv files
for f in csv_files: # process each dataset
df = pd.read_csv(f)
ta95 = df[df.Step_Index == 2] # create new dataframe based on index
a=ta95.iloc[1] # choose first row
b=a.loc[:, "Voltage(V)"] # save voltage in first row
c=ta95.iloc[-1] # choose last row
d=c.loc[:, "Voltage(V)"] # save voltage in last row
e=(b-d)*0.53 # get 53% decrease voltage
)
I don't know what should I do next for this script.
I appreciate your time and support if you can offer the help.
If you have any recommendation websites for me to read and help me solve this kind of problem. I do appreciate it, too. Thanks again.

python3 - after successfully reading CSV file into 2D Arrary - how to print a single cell by index?

I want to read a CSV File (filled in by temperature sensors) by python3.
Reading CSV File into array works fine. Printing a single cell by index fails. Please help for the right line of code.
This is the code.
import sys
import pandas as pd
import numpy as np
import array as ar
#Reading the CSV File
# date;seconds;Time;Aussen-Temp;Ruecklauf;Kessel;Vorlauf;Diff
# 20211019;0;20211019;12,9;24;22,1;24,8;0,800000000000001
# ...
# ... (2800 rows in total)
np = pd.read_csv('/var/log/LM92Temperature_U-20211019.csv',
header=0,
sep=";",
usecols=['date','seconds','Time','Aussen- Temp','Ruecklauf','Kessel','Vorlauf','Diff'])
br = np # works fine
print (br) # works fine - prints whole CSV Table :-) !
#-----------------------------------------------------------
# Now I want to print the element [2] [3] of the two dimensional "CSV" array ... How to manage that ?
print (br [2] [3]) # ... ends up with an error ...
# what is the correct coding needed now, please?
Thanks in advance & Regards
Give the name of the column, not the index:
print(br['Time'][3])
As an aside, you can read your data with only the following, and you may want decimal=',' as well:
import pandas as pd
br = pd.read_csv('/var/log/LM92Temperature_U-20211019.csv', sep=';', decimal=',')
print(br)

Problem with processing large(>1 GB) CSV file

I have a large CSV file and I have to sort and write the sorted data to another csv file. The CSV file has 10 columns. Here is my code for sorting.
data = [ x.strip().split(',') for x in open(filename+'.csv', 'r').readlines() if x[0] != 'I' ]
data = sorted(data, key=lambda x: (x[6], x[7], x[8], int(x[2])))
with open(filename + '_sorted.csv', 'w') as fout:
for x in data:
print(','.join(x), file=fout)
It works fine with file size below 500 Megabytes but cannot process files with a size greater than 1 GB. Is there any way I can make this process memory efficient? I am running this code on Google Colab.
Here is a Link to a blog about using pandas for large datasets. In the examples from the link, they are looking at analyzing data from large datasets ~1gb in size.
Simply type the following to import your csv data into python.
import pandas as pd
gl = pd.read_csv('game_logs.csv', sep = ',')

determine file path for pandas.read_csv in python

I use the following code to read a csv file and save it as pandas data frame, but the method always return the iris dataset not my data. What is the problem?
import pandas as pd
a = pd.read_csv(r"D:\data.csv")
print(a)

Resources