How to get number of rows in sas dataset in Python - python-3.x

Is there any way to get the no of rows in sas dataset("xxxx.sas7bdat") without actually reading the dataset in python. The reason for not reading the sas dataset is, it is huge.

You might be able to do this by simply counting the rows in the file by using the wc -l shell command and reading it into a variable:
import os
>>> import os
>>> stream = os.popen('wc -l example.txt')
>>> output = stream.read()
>>> output
' 3 example.txt\n'
You could further tokenize the output to get number of rows as a variable:
>>> output.split()
['3', 'example.txt']
>>> output.split()[0]
'3'
>>> int(output.split()[0])
3
Hope this helps.

Related

Reading Large CSV file using pandas

I have to read and analyse a logging file from CAN which is in CSV format. It has 161180 rows and I'm only separating 566 columns with semicolon. This is the code I have.
import csv
import dtale
import pandas as pd
path = 'C:\Thesis\Log_Files\InputOutput\Input\Test_Log.csv'
raw_data = pd.read_csv(path,engine="python",chunksize = 1000000, sep=";")
df = pd.DataFrame(raw_data)
#df
dtale.show(df)
I have following error when I run the code in Jupyter Notebook and it's encountering with below error message. Please help me with this. Thanks in advance!
MemoryError: Unable to allocate 348. MiB for an array with shape (161180, 566) and data type object
import time
import pandas as pd
import csv
import dtale
chunk_size = 1000000
batch_no=1
for chunk in pd.read_csv("C:\Thesis\Log_Files\InputOutput\Input\Book2.csv",chunksize=chunk_size,sep=";"):
chunk.to_csv('chunk'+str(batch_no)+'.csv', index=False)
batch_no+=1
df1 = pd.read_csv('chunk1.csv')
df1
dtale.show(df1)
I used above code for only 10 rows and 566 columns. Then it's working. If I consider all the rows (161180), it's not working. Could anyone help me with this. Thanks in advance!
I have attached the output here
You are running out of RAM when loading in the datafile. Best option is to split the file into chunks and read in chunks of files
To read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
You'll probably also want to use chunksize:
This returns a TextFileReader object for iteration:
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)

How do I interpolate the time based on a given condition?

I have over 50 csv files to process, but each dataset like this:
[a simplified csv file example] (https://i.stack.imgur.com/yoGo9.png)
There are three columns: Times, Index, Voltage
I want to interpolate time when total voltage decreases [here is (84-69) = 15] reaches 53% [ it means 15*0.53] at index 2.
I will repeat this process for index 4, too. May I ask what I should do?
I am a beginner for python and try this following script:
source code (
import pandas as pd
import glob
import numpy as np
import os
import matplotlib.pyplot as plt
import xlwings as xw
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv")) # read all csv files
for f in csv_files: # process each dataset
df = pd.read_csv(f)
ta95 = df[df.Step_Index == 2] # create new dataframe based on index
a=ta95.iloc[1] # choose first row
b=a.loc[:, "Voltage(V)"] # save voltage in first row
c=ta95.iloc[-1] # choose last row
d=c.loc[:, "Voltage(V)"] # save voltage in last row
e=(b-d)*0.53 # get 53% decrease voltage
)
I don't know what should I do next for this script.
I appreciate your time and support if you can offer the help.
If you have any recommendation websites for me to read and help me solve this kind of problem. I do appreciate it, too. Thanks again.

python3 - after successfully reading CSV file into 2D Arrary - how to print a single cell by index?

I want to read a CSV File (filled in by temperature sensors) by python3.
Reading CSV File into array works fine. Printing a single cell by index fails. Please help for the right line of code.
This is the code.
import sys
import pandas as pd
import numpy as np
import array as ar
#Reading the CSV File
# date;seconds;Time;Aussen-Temp;Ruecklauf;Kessel;Vorlauf;Diff
# 20211019;0;20211019;12,9;24;22,1;24,8;0,800000000000001
# ...
# ... (2800 rows in total)
np = pd.read_csv('/var/log/LM92Temperature_U-20211019.csv',
header=0,
sep=";",
usecols=['date','seconds','Time','Aussen- Temp','Ruecklauf','Kessel','Vorlauf','Diff'])
br = np # works fine
print (br) # works fine - prints whole CSV Table :-) !
#-----------------------------------------------------------
# Now I want to print the element [2] [3] of the two dimensional "CSV" array ... How to manage that ?
print (br [2] [3]) # ... ends up with an error ...
# what is the correct coding needed now, please?
Thanks in advance & Regards
Give the name of the column, not the index:
print(br['Time'][3])
As an aside, you can read your data with only the following, and you may want decimal=',' as well:
import pandas as pd
br = pd.read_csv('/var/log/LM92Temperature_U-20211019.csv', sep=';', decimal=',')
print(br)

How I can displace header of data in python.pandas?

I have a problem with opening a .prn file using pandas. My file contains spaces at the end of lines which causes a shift in the headers. Below is an example of what I need. How can I get the behaviour I want?
import pandas as pd
filename='D:/EXP/7head+2cup 60kHz/k=15 I=80 mA 20 kg/1p.prn'
df=pd.read_csv(filename, sep='\t', header=[0])
a = list(df['Temp'])
print(df)
print(a)
Input:
Expected output:

numpy reading a csv file to an numpy array

I am new to python and using numpy to read a csv into an array .So I used two methods:
Approach 1
train = np.asarray(np.genfromtxt(open("/Users/mac/train.csv","rb"),delimiter=","))
Approach 2
with open('/Users/mac/train.csv') as csvfile:
rows = csv.reader(csvfile)
for row in rows:
newrow = np.array(row).astype(np.int)
train.append(newrow)
I am not sure what is the difference between these two approaches? What is recommended to use?
I am not concerned which is faster since my data size is small but instead concerned more about differences in the resulting data type.
You can use pandas also, it is better and simple to use.
import pandas as pd
import numpy as np
dataset = pd.read_csv('file.csv')
# get all headers in csv
values = list(dataset.columns.values)
# get the labels, assuming last row is labels in csv
y = dataset[values[-1:]]
y = np.array(y, dtype='float32')
X = dataset[values[0:-1]]
X = np.array(X, dtype='float32')
So what is the difference in the result?
genfromtxt is the numpy csv reader. It returns an array. No need for an extra asarray.
The second expression is incomplete, looks like would produce a list of arrays, one for each line of the file. It uses the generic python csv reader which doesn't do much other than read a line and split it into strings.

Resources