Converting a csv file containing pixel values to it's equivalent images - python-3.x

This is my first time working with such a dataset.
I have a .csv file containing pixel values (48x48 = 2304 columns) of images, with their labels in the first column and the pixels in the subsequent ones, as below:
A glimpse of the dataset
I want to convert these pixels into their images, and store them into different directories corresponding to their respective labels. Now I have tried the solution posted here but it doesn't seem to work for me.
Here's what I've tried to do:
labels = ['Fear', 'Happy', 'Sad']
with open('dataset.csv') as csv_file:
csv_reader = csv.reader(csv_file)
fear = 0
happy = 0
sad = 0
# skip headers
next(csv_reader)
for row in csv_reader:
pixels = row[1:] # without label
pixels = np.array(pixels, dtype='uint8')
pixels = pixels.reshape((48, 48))
image = Image.fromarray(pixels)
if csv_file['emotion'][row] == 'Fear':
image.save('C:\\Users\\name\\data\\fear\\im'+str(fear)+'.jpg')
fear += 1
elif csv_file['emotion'][row] == 'Happy':
image.save('C:\\Users\\name\\data\\happy\\im'+str(happy)+'.jpg')
happy += 1
elif csv_file['emotion'][row] == 'Sad':
image.save('C:\\Users\\name\\data\\sad\\im'+str(sad)+'.jpg')
sad += 1
However, upon running the above block of code, the following is the error message I get:
Traceback (most recent call last):
File "<ipython-input-11-aa928099f061>", line 18, in <module>
if csv_file['emotion'][row] == 'Fear':
TypeError: '_io.TextIOWrapper' object is not subscriptable
I referred to a bunch of posts that solved the above error (like this one), but I found that the people were trying their hand at a relatively different problem than mine, and others I couldn't understand.
This may well be a very trivial question, but as I mentioned earlier, this is my first time working with such a dataset. Kindly tell me what am I doing wrong and how I can fix my code.

Try -
if str(row[0]) == 'Fear':
And in a similar way for the other conditions:
elif str(row[0]) == 'Happy':
elif str(row[0]) == 'Sad':
(a good practice is to just save the first value of the array as a variable)

The first problem that arose was that the first row was just the column names. In order to take care of this, I used the skiprows parameter like so:
raw = pd.read_csv('dataset.csv', skiprows = 1)
Secondly, I moved the labels column to the end due to it being in the first column. For my own convenience.
Thirdly, after all the preparations were done, the dataset won't iterate over the whole row, and instead just took in the value of the first row and first column, which gave an issue in resizing. So I instead used the df.itertuples() like so:
for row in data.itertuples(index = False, name = 'Pandas'):
Lastly, thanks to #HadarM 's suggestions, I was able to get it to work.
Modified code of the above code snippet that was the problematic block:
for row in data.itertuples(index = False, name = 'Pandas'):
pixels = row[:-1] # without label
pixels = np.array(pixels, dtype='uint8')
pixels = pixels.reshape((48, 48))
image = Image.fromarray(pixels)
if str(row[-1]) == 'Fear':
image.save('C:\\Users\\name\\data\\fear\\im'+str(fear)+'.jpg')
fear += 1
elif str(row[-1]) == 'Happy':
image.save('C:\\Users\\name\\data\\happy\\im'+str(happy)+'.jpg')
happy += 1
elif str(row[-1]) == 'Sad':
image.save('C:\\Users\\name\\data\\sad\\im'+str(sad)+'.jpg')
sad += 1
print('done')

Related

Iterating thourgh a SRT file until index is found

This might sound like "Iterate through file until condition is met" question (which I have already checked), but it doesn't work for me.
Given a SRT file (any) as srtDir, I want to go to the index choice and get timecode values and caption values.
I did the following, which is supposed to iterate though the SRT file until condition is met:
import os
srtDir = "./media/srt/001.srt"
index = 100 #Index. Number is an examaple
found = False
with open(srtDir, "r") as SRT:
print(srtDir)
content = SRT.readlines()
content = [x.strip() for x in content]
for x in content:
print(x)
if x == index:
print("Found")
found = True
break
if not found:
print("Nothing was found")
As said, it is supposed to iterate until Index is found, but it returns "Nothing is found", which is weird, because I can see the number printed on screen.
What did I do wrong?
(I have checked libraries, AFAIK, there's no one that can return timecode and captions given the index)
You have a type mismatch in your code: index is an int but x in your loop is a str. In Python, 100 == "100" evaluates to False. The solution to this kind of bug is to adopt a well-defined data model and write library methods that apply it consistently.
However, with something like this, it's best not to reinvent the wheel and let other people do the boring work for you.
import srt
# Sample SRT file
raw = '''\
1
00:31:37,894 --> 00:31:39,928
OK, look, I think I have a plan here.
2
00:31:39,931 --> 00:31:41,931
Using mainly spoons,
3
00:31:41,933 --> 00:31:43,435
we dig a tunnel under the city and release it into the wild.
'''
# Parse and get index
subs = list(srt.parse(raw))
def get_index(n, subs_list):
for i in subs_list:
if i.index == n:
return i
return None
s = get_index(2, subs)
print(s)
See:
https://github.com/cdown/srt
https://srt.readthedocs.io/en/latest/quickstart.html
https://srt.readthedocs.io/en/latest/api.html

Please help me to fix the ''list index out of range'' error

I wrote a program to calculate the ratio of minor (under 20 of age) population in each prefecture of Japan and it keeps producing this error: list index out of range, at line 19: ratio =(agerange[1]+agerange[2]+agerange[3]+agerange[4])/population*100.0
Link to csv: https://drive.google.com/open?id=1uPSMpgHw0csRx1UgAJzRLit9p6NrztFY
f=open("population.csv","r")
header=f.readline()
header=header.rstrip("\r\n")
while True:
line=f.readline()
if line=="":
break
line=line.rstrip("\r\n")
field=line.split(sep=",")
population=0
ratio=0
agerange=[ "pref" ]
for age in range(1, len(field)):
agerange.append(int(field[age]))
population+=int(field[age])
ratio =(agerange[1]+agerange[2]+agerange[3]+agerange[4])/population*100.0
print(field[0],ratio)
On line 17, I assume you to do the following code:
ratio =(agerange[0]+agerange[1]+agerange[2]+agerange[3])/population*100.0
next time, write your error more in detail please.
What you could do instead is get the sums of populations in the required age ranges and then perform the ratio calculation.
In Python, you can use the map function to convert the values in an iterable to ints, and make that into a list.
Once you have the list, you can use the sum function on it, or a part of it.
So, I came up with:
f = open("population.csv","r")
header = f.readline()
header = header.rstrip("\r\n")
while True:
line = f.readline()
if line == "":
break
line = line.rstrip("\r\n")
field = line.split(sep=",")
popData = list(map(int, field[1:]))
youngPop = sum(popData[:4])
oldPop = sum(popData[4:])
ratio = youngPop / (youngPop + oldPop)
print(field[0].ljust(12), ratio)
f.close()
Which outputs (just showing a portion here):
Hokkaido 0.1544532130777903
Aomori 0.1564945226917058
Iwate 0.16108452950558214
Miyagi 0.16831683168316833
Akita 0.14357429718875503
Yamagata 0.16515426497277677
Fukushima 0.16586921850079744
(I don't really know Python, so there could be some "better" or more conventional way.)

Analysis of Eye-Tracking data in python (Eye-link)

I have data from eye-tracking (.edf file - from Eyelink by SR-research). I want to analyse it and get various measures such as fixation, saccade, duration, etc.
Is there an existing package to analyse Eye-Tracking data?
Thanks!
At least for importing the .edf-file into a pandas DF, you can use the following package by Niklas Wilming: https://github.com/nwilming/pyedfread/tree/master/pyedfread
This should already take care of saccades and fixations - have a look at the readme. Once they're in the data frame, you can apply whatever analysis you want to it.
pyeparse seems to be another (yet currently unmaintained as it seems) library that can be used for eyelink data analysis.
Here is a short excerpt from their example:
import numpy as np
import matplotlib.pyplot as plt
import pyeparse as pp
fname = '../pyeparse/tests/data/test_raw.edf'
raw = pp.read_raw(fname)
# visualize initial calibration
raw.plot_calibration(title='5-Point Calibration')
# create heatmap
raw.plot_heatmap(start=3., stop=60.)
EDIT: After I posted my answer I found a nice list compiling lots of potential tools for eyelink edf data analysis: https://github.com/davebraze/FDBeye/wiki/Researcher-Contributed-Eye-Tracking-Tools
Hey the question seems rather old but maybe I can reactivate it, because I am currently facing the same situation.
To start I recommend to convert your .edf to an .asc file. In this way it is easier to read it to get a first impression.
For this there exist many tools, but I used the SR-Research Eyelink Developers Kit (here).
I don't know your setup but the Eyelink 1000 itself detects saccades and fixation. I my case in the .asc file it looks like that:
SFIX L 10350642
10350642 864.3 542.7 2317.0
...
...
10350962 863.2 540.4 2354.0
EFIX L 10350642 10350962 322 863.1 541.2 2339
SSACC L 10350964
10350964 863.4 539.8 2359.0
...
...
10351004 683.4 511.2 2363.0
ESACC L 10350964 10351004 42 863.4 539.8 683.4 511.2 5.79 221
The first number corresponds to the timestamp, the second and third to x-y coordinates and the last is your pupil diameter (what the last numbers after ESACC are, I don't know).
SFIX -> start fixation
EFIX -> end fixation
SSACC -> start saccade
ESACC -> end saccade
You can also check out PyGaze, I haven't worked with it, but searching for a toolbox, this one always popped up.
EDIT
I found this toolbox here. It looks cool and works fine with the example data, but sadly does not work with mine
EDIT No 2
Revisiting this question after working on my own Eyetracking data I thought I might share a function wrote, to work with my data:
def eyedata2pandasframe(directory):
'''
This function takes a directory from which it tries to read in ASCII files containing eyetracking data
It returns eye_data: A pandas dataframe containing data from fixations AND saccades fix_data: A pandas dataframe containing only data from fixations
sac_data: pandas dataframe containing only data from saccades
fixation: numpy array containing information about fixation onsets and offsets
saccades: numpy array containing information about saccade onsets and offsets
blinks: numpy array containing information about blink onsets and offsets
trials: numpy array containing information about trial onsets
'''
eye_data= []
fix_data = []
sac_data = []
data_header = {0: 'TimeStamp',1: 'X_Coord',2: 'Y_Coord',3: 'Diameter'}
event_header = {0: 'Start', 1: 'End'}
start_reading = False
in_blink = False
in_saccade = False
fix_timestamps = []
sac_timestamps = []
blink_timestamps = []
trials = []
sample_rate_info = []
sample_rate = 0
# read the file and store, depending on the messages the data
# we have the following structure:
# a header -- every line starts with a '**'
# a bunch of messages containing information about callibration/validation and so on all starting with 'MSG'
# followed by:
# START 10350638 LEFT SAMPLES EVENTS
# PRESCALER 1
# VPRESCALER 1
# PUPIL AREA
# EVENTS GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# SAMPLES GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# followed by the actual data:
# normal data --> [TIMESTAMP]\t [X-Coords]\t [Y-Coords]\t [Diameter]
# Start of EVENTS [BLINKS FIXATION SACCADES] --> S[EVENTNAME] [EYE] [TIMESTAMP]
# End of EVENTS --> E[EVENT] [EYE] [TIMESTAMP_START]\t [TIMESTAMP_END]\t [TIME OF EVENT]\t [X-Coords start]\t [Y-Coords start]\t [X_Coords end]\t [Y-Coords end]\t [?]\t [?]
# Trial messages --> MSG timestamp\t TRIAL [TRIALNUMBER]
try:
with open(directory) as f:
csv_reader = csv.reader(f, delimiter ='\t')
for i, row in enumerate (csv_reader):
if any ('RATE' in item for item in row):
sample_rate_info = row
if any('SYNCTIME' in item for item in row): # only start reading after this message
start_reading = True
elif any('SFIX' in item for item in row): pass
#fix_timestamps[0].append (row)
elif any('EFIX' in item for item in row):
fix_timestamps.append ([row[0].split(' ')[4],row[1]])
#fix_timestamps[1].append (row)
elif any('SSACC' in item for item in row):
#sac_timestamps[0].append (row)
in_saccade = True
elif any('ESACC' in item for item in row):
sac_timestamps.append ([row[0].split(' ')[3],row[1]])
in_saccade = False
elif any('SBLINK' in item for item in row): # stop reading here because the blinks contain NaN
# blink_timestamps[0].append (row)
in_blink = True
elif any('EBLINK' in item for item in row): # start reading again. the blink ended
blink_timestamps.append ([row[0].split(' ')[2],row[1]])
in_blink = False
elif any('TRIAL' in item for item in row):
# the first element is 'MSG', we don't need it, then we split the second element to seperate the timestamp and only keep it as an integer
trials.append (int(row[1].split(' ')[0]))
elif start_reading and not in_blink:
eye_data.append(row)
if in_saccade:
sac_data.append(row)
else:
fix_data.append(row)
# drop the last data point, because it is the 'END' message
eye_data.pop(-1)
sac_data.pop(-1)
fix_data.pop(-1)
# convert every item in list into a float, substract the start of the first trial to set the start of the first video to t0=0
# then devide by 1000 to convert from milliseconds to seconds
for row in eye_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in sac_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in sac_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in blink_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
sample_rate = float (sample_rate_info[4])
# convert into pandas fix_data Frames for a better overview
eye_data = pd.DataFrame(eye_data)
fix_data = pd.DataFrame(fix_data)
sac_data = pd.DataFrame(sac_data)
fix_timestamps = pd.DataFrame(fix_timestamps)
sac_timestamps = pd.DataFrame(sac_timestamps)
trials = np.array(trials)
blink_timestamps = pd.DataFrame(blink_timestamps)
# rename header for an even better overview
eye_data = eye_data.rename(columns=data_header)
fix_data = fix_data.rename(columns=data_header)
sac_data = sac_data.rename(columns=data_header)
fix_timestamps = fix_timestamps.rename(columns=event_header)
sac_timestamps = sac_timestamps.rename(columns=event_header)
blink_timestamps = blink_timestamps.rename(columns=event_header)
# substract the first timestamp of trials to set the start of the first video to t0=0
eye_data.TimeStamp -= trials[0]
fix_data.TimeStamp -= trials[0]
sac_data.TimeStamp -= trials[0]
trials -= trials[0]
trials = trials /1000 # does not work with trials/=1000
# devide TimeStamp to get time in seconds
eye_data.TimeStamp /=1000
fix_data.TimeStamp /=1000
sac_data.TimeStamp /=1000
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
except:
print ('Could not read ' + str(directory) + ' properly!!! Returned empty data')
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
Hope it helps you guys. Some parts of the code you may need to change, like the index where to split the strings to get the crutial information about event on/offsets. Or you don't want to convert your timestamps into seconds or do not want to set the onset of your first trial to 0. That is up to you.
Additionally in my data we sent a message to know when we started measuring ('SYNCTIME') and I had only ONE condition in my experiment, so there is only one 'TRIAL' message
Cheers

Bad output when building a histogram

I am having some issues building a histogram.
Here is my code:
distribution = dict()
count = 0
name = input("Enter file:")
handle = open(name)
for line in handle:
line = line.rstrip()
if not line.startswith("From "):
continue
count = count + 1
firstSplit = line.split() # This gets me the line of text
time = firstSplit[5] # This gets me time - ex: 09:11:38
# print(firstSplit[5])
timeSplit = time.split(':')
hr = timeSplit[1] # This gets me hrs - ex: 09
# Gets me the histogram
if hr not in distribution:
distribution[hr[1]] = 1
else:
distribution[hr[1]] = distribution[hr[1]] + 1
print(distribution)
# print(firstSplit[5])
I read the text in, and I split it to get the lines, done by firstSplit. This line of text includes a time stamp. I do a second split to get the time, done by timeSplit.
From here, I try to build the histogram by trying to see if the hour is in the dictionary, if it is, add one, if not, add the hour. But this is where it goes wrong. My output looks like:
Example of Output
Any advise or suggestions would be great!
Seán
You are using an incorrect method to check if the hour is a key in the histogram. Here is the correct way to check:
if not (hr in list(distribution.keys()):
Also, you should be checking if a value is a key, then using the same value as the key that you create / add to. Therefore, the above will now be:
if not (hr[1] in list(distribution.keys()):
These two changes should fix your code and build you a great histogram!
Note: Code is untested

Openpyxl: Manipulation of cell values

I'm trying to pull cell values from an excel sheet, do math with them, and write the output to a new sheet. I keep getting an ErrorType. I've run the code successfully before, but just added this aspect of it, thus code has been distilled to below:
import openpyxl
#set up ws from file, and ws_out write to new file
def get_data():
first = 0
second = 0
for x in range (1, 1000):
if ws.cell(row=x, column=1).value == 'string':
for y in range (1, 10): #Only need next ten rows after 'string'
ws_out.cell(row=y, column=1).value = ws.cell(row=x+y, column=1).value
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
break
Throws a TypeError message:
first = ws.cell(row=x+y, column=1).value)/100
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
I assume this is referring to the ws.cell value and 100, respectively, so I've also tried:
first = int(ws.cell(row=x, column=1))/100 #also tried with float
Which raises:
TypeError: int() argument must be a string or a number
I've confirmed that every cell in the column is made up of numbers only. Additionally, openpyxl's cell.data_type returns 'n' (presumably for number as far as I can tell by the documentation).
I've also tested more simple math, and have the same error.
All of my searching seems to point to openpyxl normally behaving like this. Am I doing something wrong, or is this simply a limitation of the module? If so, are there any programmatic workarounds?
As a bonus, advice on writing code more succinctly would be much appreciated. I'm just beginning, and feel there must be a cleaner way to write an ideas like this.
Python 3.3, openpyxl-1.6.2, Windows 7
Summary
cfi's answer helped me figure it out, although I used a slightly different workaround. On inspection of the originating file, there was one empty cell (which I had missed earlier). Since I will be re-using this code later on columns with more sporadic empty cells, I used:
if ws.cell(row=x+r, column=40).data_type == 'n':
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
Thus, if a specified cell was empty, it was ignored and skipped.
Are you 100% sure (=have verified) that all the cells you are accessing actually hold a value? (Edit: Do a print("dbg> cell value of {}, {} is {}".format(row, 1, ws.cell(row=row, column=1).value)) to verify content)
Instead of going through a fixed range(1,1000) I'd recomment to use openpyxl introspection methods to iterate over existing rows. E.g.:
wb=load_workbook(inputfile)
for ws in wb.worksheets:
for row in ws.rows:
for cell in row: value = cell.value
When getting the values do not forget to extract the .value attribute:
first = ws.cell(row=x+y, column=1).value/100 #new value for first
As a general note: x, and y are useful variable names for 2D coordinates. Don't use them both for rows. It will mislead others who have to read the code. Instead of x you could use start_row or row_offset or something similar. Instead of y you could just use row and you could let it start with the first index being the start_row+1.
Some example code (untested):
def get_data():
first = 0
second = 0
for start_row in range (1, ws.rows):
if ws.cell(row=start_row, column=1).value == 'string':
for row in range (start_row+1, start_row+10):
ws_out.cell(row=start_row, column=1).value = ws.cell(row=row, column=1)
second = first
first = ws.cell(row=row, column=1).value/100
difference = first - second
ws_out.cell(row=row+1, column=1).value = difference
break
Now with this code I still don't understand what you are trying to achieve. Is the break indented correctly? If yes, the first time you are matching string, the outer loop will be quit by the break. Then, what is the point of the variables first and second?
Edit: Also ensure that your reading from and writing into cell().value not just cell().

Resources