python read/write data IndexError: list index out of range - python-3.x

I'm trying to write a simply code to extract specific data columns from my measurement results (.txt files) and then save them into a new text file. Unfortunately I'm already stuck even before the writing part. The code below results in a following error: IndexError: list index out of range
How do I solve this? It seems to be related to the size of the data, i.e. the same code worked for a much smaller data file.
f = open('data.txt', 'r')
header1 = f.readline()
header2 = f.readline()
header3 = f.readline()
for line in f:
line = line.strip()
columns = line.split()
name = columns[2]
j = columns[3]
print(name, j)

Before using index you should check the length of the split() result or check the line's pattern by using a regex.
Example of length check to add right after the columns = line.split() :
if len(columns) < 4:
continue
So if you have a line that does not match your awaited data format it won't crash

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

How to read specific blocks of data file within certain keywords

I have a text data file that looks as shown below:
BEGIN_CYCLE
..
start_data
2d_data1
end_data
..
..
END_CYCLE
BEGIN_CYCLE
..
start_data
2d_data2
end_data
BEGIN_CYCLE
..
start_data
2d_data3
end_data
...
END_CYCLE
and so on
I am only interested in data blocks that start with start_data and end with end_data keywords, AND fall between BEGIN_CYCLE and a matching END_CYCLE keywords. In the above example, I want to read 2d_data1 and 2d_data3. Notice that although 2d_data2 starts with start_data and ends with end_data keywords, it is NOT bound by BEGIN_CYCLE and a matching END_CYCLE keyword. It only has a BEGIN_CYCLE and no matching END_CYCLE keyword. Of course I can have any number of begin and end cycles, and not just 3. My code below still reads 2d_data2 and actually skips over 2d_data3, and reads subsequent data blocks correctly. I do not know why exactly this is happening.
indexes = []
with open(file) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith('BEGIN_CYCLE'):
s = i
elif line.startswith('END_CYCLE'):
e = i
indexes.append((s, e))
else:
pass
temp_list = [list(range(*idx)) for idx in indexes]
indexes = [item for sublist in temp_list for item in sublist]
data = []
with open(file) as f:
for i, line in enumerate(f):
if 'start_data' in line and i in indexes:
chunk = []
for line in f:
if not line.startswith('end_data'):
chunk.append(''.join(line.strip().split()))
else:
break
data.append(chunk)
My thought process is to first identify valid test cycles (those with begin_cycle and end_cycle keywords) which explains the first part of the code. Then within these bounds, I am searching for start_data and end_data keywords and appending lines of data into chunks which I eventually collect in a list of data. The problem with my code is that 2d_data2 is read and not ignored. In fact, the code works fine whenever the test file always has matching BEGIN_CYCLE and END_CYCLE keywords. However, as soon as there is one or more instances of missing END_CYCLE keywords, then instead of ignoring any data block under that cycle, it includes it. Any help or alternative solution is appreciated. Thanks.
Below works exactly like I wanted. However, I don't like the idea of opening the file each time I loop over the indexes. I cannot think of a fix for this.
indexes = []
with open(file) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith('BEGIN_CYCLE'):
s = i
elif line.startswith('END_CYCLE'):
e = i
indexes.append((s, e))
else:
pass
data = []
for idx in indexes:
with open(file) as f:
for line in itertools.islice(f, idx[0], idx[1]):
if line.startswith('start_data'):
chunk = []
for line in f:
if not line.startswith('end_data'):
chunk.append(''.join(line.strip().split()))
else:
break
data.append(chunk)

Python list() vs append()

I'm trying to create a list of lists from a csv file.
Row 1 of CSV is a line describing the data source
Row 2 of CSV is the header
Row 3 of CSV is where the data starts
There are two ways I can go about it but I don't know why they're different.
First is the using list() and for some reason the result of this ignores row 1 and row 2 of the CSV.
data = []
with open(datafile,'rb') as f:
for line in f:
data = list(csv.reader(f, delimiter = ','))
return (name, data)
Whereas if I use .append(), I'd have to use .next() to skip row 2
data = []
with open(datafile,'rb') as f:
file = csv.reader(f, delimiter = ',')
next(file)
for line in file:
data.append(line)
return (name, data)
Why does list() ignores the row with all the header whereas append() doesn't?
Actually, this is not related to python's list() or append(), it is related to the logic you have used in the first snippet.
The program is not skipping the header, it is replacing it.
For every line in the loop, you are assigning a new value to data. So it is a new list , as it overwrites everything that was there previously.
Correct code :
data = []
with open(datafile,'rb') as f:
next(f)
for line in f:
data.extend(line.split(","))
return (name, data)
This will just extend the existing list with a new list that is passed as an argument, and there is no problem with 2nd snippet.

Skip lines with strange characters when I read a file

I am trying to read some data files '.txt' and some of them contain strange random characters and even extra columns in random rows, like in the following example, where the second row is an example of a right row:
CTD 10/07/30 05:17:14.41 CTD 24.7813, 0.15752, 1.168, 0.7954, 1497.¸ 23.4848, 0.63042, 1.047, 3.5468, 1496.542
CTD 10/07/30 05:17:14.47 CTD 23.4846, 0.62156, 1.063, 3.4935, 1496.482
I read the description of np.loadtxt and I have not found a solution for my problem. Is there a systematic way to skip rows like these?
The code that I use to read the files is:
#Function to read a datafile
def Read(filename):
#Change delimiters for spaces
s = open(filename).read().replace(':',' ')
s = s.replace(',',' ')
s = s.replace('/',' ')
#Take the columns that we need
data=np.loadtxt(StringIO(s),usecols=(4,5,6,8,9,10,11,12))
return data
This works without using csv like the other answer and just reads line by line checking if it is ascii
data = []
def isascii(s):
return len(s) == len(s.encode())
with open("test.txt", "r") as fil:
for line in fil:
res = map(isascii, line)
if all(res):
data.append(line)
print(data)
You could use the csv module to read the file one line at a time and apply your desired filter.
import csv
def isascii(s):
len(s) == len(s.encode())
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if len(row)==expected_length and all((isascii(x) for x in row)):
'write row onto numpy array'
I got the ascii check from this thread
How to check if a string in Python is in ASCII?

I read a line on a csv file and want to know the item number of a word

The header line in my csv file is:
Number,Name,Type,Manufacturer,Material,Process,Thickness (mil),Weight (oz),Dk,Orientation,Pullback distance (mil),Description
I can open it and read the line, with no problems:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
for row in csv_reader:
But I want to find out what the item number is for the "Dk".
The problem is that not only can the items be in any order as decided by the user in a different application. There can also be up to 25 items in the line.
How do I quickly determine what item is "Dk" so I can write Dk = (row[i]) for it and extract it for all the data after the header.
I have tried this below on each of the potential 25 items and it works, but it seems like a waste of time, energy and my ocd.
while True:
try:
if (row[0]) == "Dk":
DkColumn = 0
break
elif (row[1]) == "Dk":
DkColumn = 1
break
...
elif (row[24]) == "Dk":
DkColumn = 24
break
else:
f.write('Stackup needs a "Dk" column.')
break
except:
print ("Exception occurred")
break
Can't you get the index of the column (using list.index()) that has the value Dk in it? Something like:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
# Store the header
headers = next(csv_reader, None)
# Get the index of the 'Dk' column
dkColumnIndex = header.index('Dk')
for row in csv_reader:
# Access values that belong to the 'Dk' column
rowDkValue = row[dkColumnIndex]
print(rowDkValue)
In the code above, we store the first line of the CSV in as a list in headers. We then search the list to find the index of the item that has the value of 'Dk'. That will be the column index.
Once we have that column index, we can then use it in each row to access the particular index, which will correspond to the column which Dk is the header of.
Use pandas library to save your order and have access to each column by typing:
row["column_name"]
import pandas as pd
dataframe = pd.read_csv(
"",
cols=["Number","Name","Type" ....])
for index, row in df.iterrows():
# do something
If I understand your question correctly, and you're not interested in using pandas (as suggested by Mikey - you sohuld really consider his suggestion, however), you should be able to do something like the following:
with open('CS_Data/_AD_LayersTest.csv','r') as infile:
csv_reader = csv.reader(infile, delimiter=',')
header = next(csv_reader)
col_map = {col_name: idx for idx, col_name in enumerate(header)}
for row in csv_reader:
row_dk = row[col_map['Dk']]
One solution would be to use pandas.
import pandas as pd
df=pd.read_csv('CS_Data/_AD_LayersTest.csv')
Now you can access 'Dk' easily as long as the file is read correctly.
dk=df['Dk']
and you can access individual values of dk like
for i in range(0,10):
temp_var=df.loc('Dk',i)
or however you want to access those indexes.

Resources