Formatting specific rows in a Jupyter Notebook dataframe output - python-3.x

I'm presenting a data frame in Jupyter Notebook. The initial data type of the data frame is float. I want to present rows 1 & 3 of the printed table as integers and rows 2 & 4 as percentage. How do I do that? (I've spent numerous hours looking for a solution with no success)
Here's the code I'm using:
#Creating the table
clms = sales.columns
indx = ['# of Poeple','% of Poeple','# Purchased per Activity','% Purchased per Activity']
basic_stats = pd.DataFrame(data=np.nan,index=indx,columns=clms)
basic_stats.head()
#Calculating the # of people who took part in each activity
for clm in sales.columns:
basic_stats.iloc[0][clm] = int(round(sales[sales[clm]>0][clm].count(),0))
#Calculating the % of people who took part in each activity from the total email list
for clm in sales.columns:
basic_stats.iloc[1][clm] = round((basic_stats.iloc[0][clm] / sales['Sales'].count())*100,2)
#Calculating the # of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[2][clm] = int(round(sales[(sales[clm] >0) & (sales['Sales']>0)][clm].count()))
#Calculating the % of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[3][clm] = round((basic_stats.iloc[2][clm] / basic_stats.iloc[0][clm])*100,2)
#Present the table
basic_stats
Here's the printed table:
Output table of 'basic_stats' data frame in Jupyter Notebook

Integer representation
You already assign integers to the cells of row 1 and 3 are. The reason why these integers are printed as floats is that all columns have the data type float64. This is caused by the way you initially create the Dataframe. You can view the data types by printing the .dtypes attribute:
basic_stats = pd.DataFrame(data=np.nan,index=indx,columns=clms)
print(basic_stats.dtypes)
# Prints:
# column1 float64
# column2 float64
# ...
# dtype: object
If you don't provide the data keyword argument in the constructor of the Data
frame, the data type of each cell will be object which can be any object:
basic_stats = pd.DataFrame(index=indx,columns=clms)
print(basic_stats.dtypes)
# Prints:
# column1 object
# column2 object
# ...
# dtype: object
When the data type of a cell is object, the content is formatted using it's builtin methods which leads to integers bein formatted properly.
Percentage representation
In order to display percentages, you can use a custom class that prints a float number the way you want:
class PercentRepr(object):
"""Represents a floating point number as percent"""
def __init__(self, float_value):
self.value = float_value
def __str__(self):
return "{:.2f}%".format(self.value*100)
Then just use this class for the values of row 1 and 3:
#Creating the table
clms = sales.columns
indx = ['# of Poeple','% of Poeple','# Purchased per Activity','% Purchased per Activity']
basic_stats = pd.DataFrame(index=indx,columns=clms)
basic_stats.head()
#Calculating the # of people who took part in each activity
for clm in sales.columns:
basic_stats.iloc[0][clm] = int(round(sales[sales[clm]>0][clm].count(),0))
#Calculating the % of people who took part in each activity from the total email list
for clm in sales.columns:
basic_stats.iloc[1][clm] = PercentRepr(basic_stats.iloc[0][clm] / sales['Sales'].count())
#Calculating the # of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[2][clm] = int(round(sales[(sales[clm] >0) & (sales['Sales']>0)][clm].count()))
#Calculating the % of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[3][clm] = PercentRepr(basic_stats.iloc[2][clm] / basic_stats.iloc[0][clm])
#Present the table
basic_stats
Note: This actually changes the data in your dataframe! If you want to do further processing with the data of rows 1 and 3, you should be aware that these rows don't contain float objects anymore.

Here's one way, kind of a hack, but if its simply for pretty printing, it'll work.
df = pd.DataFrame(np.random.random(20).reshape(4,5))
# first and third rows display as integers
df.loc[0,] = df.loc[0,]*100
df.loc[2,] = df.loc[2,]*100
df.loc[0,:] = df.loc[0,:].astype(int).astype(str)
df.loc[2,:] = df.loc[2,:].astype(int).astype(str)
# second and fourth rows display as percents (with 2 decimals)
df.loc[1,:] = np.round(df.loc[1,:].values.astype(float),4).astype(float)*100
df.loc[3,:] = np.round(df.loc[3,:].values.astype(float),4).astype(float)*100

Related

Pandas or Python method for removing unwanted string elements in a column, based on strings in another column

I have a problem similar to this question.
I am importing a large .csv file into pandas for a project. One column in the dataframe contains ultimately 4 columns of concatenated data(I can't control the data I receive) a Brand name (what I want to remove), a product description, product size and UPC. Please note that the brand description in the Item_UPC does not always == Brand.
for example
import pandas as pd
df = pd.DataFrame({'Item_UPC': ['fubar baz dr frm prob onc dly wmn ogc 30vcp 06580-66-832',
'xxx stuff coll tides 20 oz 09980-66-832',
'hel world sambucus elder 60 chw 0392-67-491',
'northern cold ultimate 180 sg 06580-66-832',
'ancient nuts boogs 16oz 58532-42-123 '],
'Brand': ['FUBAR OF BAZ',
'XXX STUFF',
'HELLO WORLD',
'NORTHERN COLDNITES',
'ANCIENT NUTS']})
I want to remove the brand name from the Item_UPC column as this is redundant information among other issues. Currently I have a function, that takes the new df and pulls out the UPC and cleans it up to match what one finds on bottles and another database I have for a single brand, minus the last check sum digit.
def clean_upc(df):
#take in a dataframe, expand the number of columns into a temp
#dataframe
temp = df["Item_UPC"].str.rsplit(" ", n=1, expand = True)
#add columns to main dataframe from Temp
df.insert(0, "UPC", temp[1])
df.insert(1, "Item", temp[0])
#drop original combined column
df.drop(columns= ["Item_UPC"], inplace=True)
#remove leading zero on and hyphens in UPC.
df["UPC"]= df["UPC"].apply(lambda x : x[1:] if x.startswith("0") else x)
df["UPC"]=df["UPC"].apply(lambda x :x.replace('-', ''))
col_names = df.columns
#make all columns lower case to ease searching
for cols in col_names:
df[cols] = df[cols].apply(lambda x: x.lower() if type(x) == str else x)
after running this I have a data frame with three columns
UPC, Item, Brand
The data frame has over 300k rows and 2300 unique brands in it. There is also no consistent manner in which they shorten names. When I run the following code
temp = df["Item"].str.rsplit(" ", expand = True)
temp has a shape of
temp.shape
(329868, 13)
which makes manual curating a pain when most of columns 9-13 are empty.
Currently my logic is to first split brand in to 2 while dropping the first column in temp
brand = df["brand"].str.rsplit(" ", n=1,expand = True) #produce a dataframe of two columns
temp.drop(columns= [0], inplace=True)
and then do a string replace on temp[1] to see if it contains regex in brand[1] and then replace it with " " or vice versa, and then concatenate temp back together (
temp["combined"] = temp[1] + temp[2]....+temp[13]
and replace the existing Item column with the combined column
df["Item"] = temp["combined"]
or is there a better way all around? There are many brands that only have one name, which may make everything faster. I have been struggling with regex and logically it seems like this would be faster, I just have a hard time thinking of the syntax to make it work.
Because the input does not follow any well-defined rules, this looks like more of an optimization problem. You can start by stripping exact matches:
df["Item_cleaned"] = df.apply(lambda x: x.Item_UPC.lstrip(x.Brand.lower()), axis=1)
output:
Item_UPC Brand Item_cleaned
0 fubar baz dr frm prob onc dly wmn ogc 30vcp 06... FUBAR OF BAZ dr frm prob onc dly wmn ogc 30vcp 06580-66-832
1 xxx stuff coll tides 20 oz 09980-66-832 XXX STUFF coll tides 20 oz 09980-66-832
2 hel world sambucus elder 60 chw 0392-67-491 HELLO WORLD sambucus elder 60 chw 0392-67-491
3 northern cold ultimate 180 sg 06580-66-832 NORTHERN COLDNITES ultimate 180 sg 06580-66-832
4 ancient nuts boogs 16oz 58532-42-123 ANCIENT NUTS boogs 16oz 58532-42-123
This method should will strip any exact matches and output to a new column Item_cleaned. If your input is abbreviated, you should apply a more complex fuzzy string matching algorithm. This may be prohibitively slow, however. In that case, I would recommend a two-step method, saving all rows that have been cleaned by the approach above, and do a second pass for more complicated cleaning as needed.

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

Spotfire: How to increment variables to build scoring mechanism?

I'm trying to figure out how I could use variables in Spotfire (online version) to build a scoring mechanism and populate a calculated column with the final result.
I have a couple of values stored in columns that I would use to evaluate and attribute a score like this:
if column1<10 then segment1 = segment1 + 1
if column1>10 then segment2 = segment2+1
...ETC...
In the end each "segment" should have a score and I would like to simply display the name of the segment that has the highest score.
Ex:
Segment1 has a final value of 10
Segment2 has a final value of 22
Segment3 has a final value of 122
I would display Segment3 as value for the calculated column
Using only "IF" would lead me to a complicated IF structure so I'm more looking for something that looks more like a script.
Is there a way to achieve this with Spotfire?
Thanks
Laurent
To cycle through the data rows and calculate a running score, you can use an IronPython script. The script below is reading the numeric data from Col1 and Col2 of a data table named "Data Table". It calculates a score value for each row and writes it to a tab delimited text string. When done, it adds it to the Spotfire table using the Add Columns function. Note, the existing data needs to have a unique identifier. If not, the RowId() function can be used to create a calculated column for a unique row id.
from Spotfire.Dxp.Data import *
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin
from Spotfire.Dxp.Data.Import import *
from System import Array
def add_column(table, text, col_name):
# read the text data into memory
mem_stream = MemoryStream()
writer = StreamWriter(mem_stream)
writer.Write(text)
writer.Flush()
mem_stream.Seek(0, SeekOrigin.Begin)
# define the structure of the text data
settings = TextDataReaderSettings()
settings.Separator = "\t"
settings.SetDataType(0, DataType.Integer)
settings.SetColumnName(0, 'ID')
settings.SetDataType(1, DataType.Real)
settings.SetColumnName(1, col_name)
# create a data source from the in memory text data
data = TextFileDataSource(mem_stream, settings)
# define the relationship between the existing table (left) and the new data (right)
leftColumnSignature = DataColumnSignature("Store ID", DataType.Integer)
rightColumnSignature = DataColumnSignature("ID", DataType.Integer)
columnMap = {leftColumnSignature:rightColumnSignature}
ignoredColumns = []
columnSettings = AddColumnsSettings(columnMap, JoinType.LeftOuterJoin, ignoredColumns)
# now add the column(s)
table.AddColumns(data, columnSettings)
#get the data table
table=Document.Data.Tables["Data Table"]
#place data cursor on a specific column
cursorCol1 = DataValueCursor.CreateFormatted(table.Columns["Col1"])
cursorCol2 = DataValueCursor.CreateFormatted(table.Columns["Col2"])
cursorColId = DataValueCursor.CreateFormatted(table.Columns["ID"])
cursorsList = Array[DataValueCursor]([cursorCol1, cursorCol2, cursorColId])
text = ""
rowsToInclude = IndexSet(table.RowCount,True)
#iterate through table column rows to retrieve the values
for row in table.GetRows(rowsToInclude, cursorsList):
score = 0
# get the current values from the cursors
col1Val = cursorCol1.CurrentDataValue.ValidValue
col2Val = cursorCol2.CurrentDataValue.ValidValue
id = cursorColId.CurrentDataValue.ValidValue
# now apply rules for scoring
if col1Val <= 3:
score -= 3
elif col1Val > 3 and col2Val > 50:
score += 10
else:
score += 5
text += "%d\t%f\r\n" % (id, score)
add_column(table, text, 'Score_Result')
For an approach with no scripting, but also no accumulation, you can use calculated columns.
To get the scores, you can use a calculated column with case statements. For Segment 1, you might have:
case
when [Col1] > 100 then 10
when [Col1] < 100 and [Col2] > 600 then 20
end
The, once you have the scores, you can create a calculated column, say [MaxSegment]. The expression for this will be Max([Segment1],[Segment2],[Segment3]...). Then display the value of [MaxSegment].
The max function in this case is acting as a row expression and is calculating the max value across the row of the columns given.

Analysis of Eye-Tracking data in python (Eye-link)

I have data from eye-tracking (.edf file - from Eyelink by SR-research). I want to analyse it and get various measures such as fixation, saccade, duration, etc.
Is there an existing package to analyse Eye-Tracking data?
Thanks!
At least for importing the .edf-file into a pandas DF, you can use the following package by Niklas Wilming: https://github.com/nwilming/pyedfread/tree/master/pyedfread
This should already take care of saccades and fixations - have a look at the readme. Once they're in the data frame, you can apply whatever analysis you want to it.
pyeparse seems to be another (yet currently unmaintained as it seems) library that can be used for eyelink data analysis.
Here is a short excerpt from their example:
import numpy as np
import matplotlib.pyplot as plt
import pyeparse as pp
fname = '../pyeparse/tests/data/test_raw.edf'
raw = pp.read_raw(fname)
# visualize initial calibration
raw.plot_calibration(title='5-Point Calibration')
# create heatmap
raw.plot_heatmap(start=3., stop=60.)
EDIT: After I posted my answer I found a nice list compiling lots of potential tools for eyelink edf data analysis: https://github.com/davebraze/FDBeye/wiki/Researcher-Contributed-Eye-Tracking-Tools
Hey the question seems rather old but maybe I can reactivate it, because I am currently facing the same situation.
To start I recommend to convert your .edf to an .asc file. In this way it is easier to read it to get a first impression.
For this there exist many tools, but I used the SR-Research Eyelink Developers Kit (here).
I don't know your setup but the Eyelink 1000 itself detects saccades and fixation. I my case in the .asc file it looks like that:
SFIX L 10350642
10350642 864.3 542.7 2317.0
...
...
10350962 863.2 540.4 2354.0
EFIX L 10350642 10350962 322 863.1 541.2 2339
SSACC L 10350964
10350964 863.4 539.8 2359.0
...
...
10351004 683.4 511.2 2363.0
ESACC L 10350964 10351004 42 863.4 539.8 683.4 511.2 5.79 221
The first number corresponds to the timestamp, the second and third to x-y coordinates and the last is your pupil diameter (what the last numbers after ESACC are, I don't know).
SFIX -> start fixation
EFIX -> end fixation
SSACC -> start saccade
ESACC -> end saccade
You can also check out PyGaze, I haven't worked with it, but searching for a toolbox, this one always popped up.
EDIT
I found this toolbox here. It looks cool and works fine with the example data, but sadly does not work with mine
EDIT No 2
Revisiting this question after working on my own Eyetracking data I thought I might share a function wrote, to work with my data:
def eyedata2pandasframe(directory):
'''
This function takes a directory from which it tries to read in ASCII files containing eyetracking data
It returns eye_data: A pandas dataframe containing data from fixations AND saccades fix_data: A pandas dataframe containing only data from fixations
sac_data: pandas dataframe containing only data from saccades
fixation: numpy array containing information about fixation onsets and offsets
saccades: numpy array containing information about saccade onsets and offsets
blinks: numpy array containing information about blink onsets and offsets
trials: numpy array containing information about trial onsets
'''
eye_data= []
fix_data = []
sac_data = []
data_header = {0: 'TimeStamp',1: 'X_Coord',2: 'Y_Coord',3: 'Diameter'}
event_header = {0: 'Start', 1: 'End'}
start_reading = False
in_blink = False
in_saccade = False
fix_timestamps = []
sac_timestamps = []
blink_timestamps = []
trials = []
sample_rate_info = []
sample_rate = 0
# read the file and store, depending on the messages the data
# we have the following structure:
# a header -- every line starts with a '**'
# a bunch of messages containing information about callibration/validation and so on all starting with 'MSG'
# followed by:
# START 10350638 LEFT SAMPLES EVENTS
# PRESCALER 1
# VPRESCALER 1
# PUPIL AREA
# EVENTS GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# SAMPLES GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# followed by the actual data:
# normal data --> [TIMESTAMP]\t [X-Coords]\t [Y-Coords]\t [Diameter]
# Start of EVENTS [BLINKS FIXATION SACCADES] --> S[EVENTNAME] [EYE] [TIMESTAMP]
# End of EVENTS --> E[EVENT] [EYE] [TIMESTAMP_START]\t [TIMESTAMP_END]\t [TIME OF EVENT]\t [X-Coords start]\t [Y-Coords start]\t [X_Coords end]\t [Y-Coords end]\t [?]\t [?]
# Trial messages --> MSG timestamp\t TRIAL [TRIALNUMBER]
try:
with open(directory) as f:
csv_reader = csv.reader(f, delimiter ='\t')
for i, row in enumerate (csv_reader):
if any ('RATE' in item for item in row):
sample_rate_info = row
if any('SYNCTIME' in item for item in row): # only start reading after this message
start_reading = True
elif any('SFIX' in item for item in row): pass
#fix_timestamps[0].append (row)
elif any('EFIX' in item for item in row):
fix_timestamps.append ([row[0].split(' ')[4],row[1]])
#fix_timestamps[1].append (row)
elif any('SSACC' in item for item in row):
#sac_timestamps[0].append (row)
in_saccade = True
elif any('ESACC' in item for item in row):
sac_timestamps.append ([row[0].split(' ')[3],row[1]])
in_saccade = False
elif any('SBLINK' in item for item in row): # stop reading here because the blinks contain NaN
# blink_timestamps[0].append (row)
in_blink = True
elif any('EBLINK' in item for item in row): # start reading again. the blink ended
blink_timestamps.append ([row[0].split(' ')[2],row[1]])
in_blink = False
elif any('TRIAL' in item for item in row):
# the first element is 'MSG', we don't need it, then we split the second element to seperate the timestamp and only keep it as an integer
trials.append (int(row[1].split(' ')[0]))
elif start_reading and not in_blink:
eye_data.append(row)
if in_saccade:
sac_data.append(row)
else:
fix_data.append(row)
# drop the last data point, because it is the 'END' message
eye_data.pop(-1)
sac_data.pop(-1)
fix_data.pop(-1)
# convert every item in list into a float, substract the start of the first trial to set the start of the first video to t0=0
# then devide by 1000 to convert from milliseconds to seconds
for row in eye_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in sac_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in sac_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in blink_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
sample_rate = float (sample_rate_info[4])
# convert into pandas fix_data Frames for a better overview
eye_data = pd.DataFrame(eye_data)
fix_data = pd.DataFrame(fix_data)
sac_data = pd.DataFrame(sac_data)
fix_timestamps = pd.DataFrame(fix_timestamps)
sac_timestamps = pd.DataFrame(sac_timestamps)
trials = np.array(trials)
blink_timestamps = pd.DataFrame(blink_timestamps)
# rename header for an even better overview
eye_data = eye_data.rename(columns=data_header)
fix_data = fix_data.rename(columns=data_header)
sac_data = sac_data.rename(columns=data_header)
fix_timestamps = fix_timestamps.rename(columns=event_header)
sac_timestamps = sac_timestamps.rename(columns=event_header)
blink_timestamps = blink_timestamps.rename(columns=event_header)
# substract the first timestamp of trials to set the start of the first video to t0=0
eye_data.TimeStamp -= trials[0]
fix_data.TimeStamp -= trials[0]
sac_data.TimeStamp -= trials[0]
trials -= trials[0]
trials = trials /1000 # does not work with trials/=1000
# devide TimeStamp to get time in seconds
eye_data.TimeStamp /=1000
fix_data.TimeStamp /=1000
sac_data.TimeStamp /=1000
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
except:
print ('Could not read ' + str(directory) + ' properly!!! Returned empty data')
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
Hope it helps you guys. Some parts of the code you may need to change, like the index where to split the strings to get the crutial information about event on/offsets. Or you don't want to convert your timestamps into seconds or do not want to set the onset of your first trial to 0. That is up to you.
Additionally in my data we sent a message to know when we started measuring ('SYNCTIME') and I had only ONE condition in my experiment, so there is only one 'TRIAL' message
Cheers

Read query to SQLite DB hangs while using Python3 multiprocessing

I have a test list of 10,000 IDs and this is what I have to do:
For every test ID, calculate rank by comparing with other IDs i.e. people from same company
Check if the rank for this test ID is above 'normal' by a) calculating ranks (same as step 1) of 1000 randomly selected IDs b) comparing these 1000 ranks with the rank of test ID
Do this (step 1 and 2) for 10,000 test IDs with data from 10 different months.
To store the master data of 14000 IDs and observation for 10 months I am using sqlite as it makes querying and ranking easier and faster.
To reduce the run time, I am using 'multiprocessing' and parallelize calculations on number of months i.e. ranks calculated for different months on different cores. This works well for less number of test IDs (<=2000) or less random ranks (>=200) but if I calculate ranks for all 10 months in parallel and using 1000 as number of random ranks for each ID than the script freezes after a few hours. No error is provided. I believe SQLite is the culprit and need your help to figure out the issue.
Here is my code:
nproc = 10 ## Number of cores
randNum = 1000 ## Number of random ranks for each ID
def main():
'''
This will go through every specified column one by one, and for each entry
a rank of entry will be computed which is comapred with ranks of randomly selected 1000 entries from same column
'''
## Read master file with 14000 rows X 20 cols, each row pertains to an ID/ID,
## first 9 columns have info related to ID and last 10 have observed values from 10 diff. months
resList = List with 14000 entries Eg. [(123,"ABC",.....),(234,"DEF",........)....14000n]
## Read test file, for which ranks to be calculated. Contains 10,000 IDs/IDs and their names
global testIDList ## for p-value calculation
testIDList = List with 1000 entries Eg. [(123,"ABC"),(234,"DEF")..10,000n]
## Create identifier SET - Used for random selection of IDs
global idSET ## USed in rankCalcTest
idSET = SET OF ALL IDs FROM MASTER FILE
global trackTableName,coordsDB,chrLimit ## Globals for all other functions
## Specify column numbers in master file that have values for each ID from different months
trackList = [10,11,12,13,14,15,16,17,18,19,20] ## Columns in file with 14000 rows each.
### Parallel
allTrackPvals = PPResults(rankCalcTest,trackList)
DO SOME PROCESSING
SCRIPT ENDS
def rankCalcTest(col):
'''
Calculates ranks for test IDs using column/month specified by 'main()' function
'''
DB = '%s_%s.db' % (coordsDB.split('.')[0],col) ## New DB for every column/month - Because current function is paralleized so every core works on a column/month
conn = sqlite3.connect(DB)
trackPvals = [] ## Temporary list that will hold ranks for single column/month
tableCols = [col] ## Column with observed values from an month, that will be used to generate column-specific ranks
## Make sqlite3 table for current track
trackTableName = 'track_%s' % (col) ## Here a table is created containing all IDs and observations from specifc column
trackTableName = tableMaker(trackTableName,annoDict,resList,tableCols,conn) ## This modules not included in example, as it works well -uses SQLite
chrLimit = chrLimits(trackTableName,conn) ## This module not included in examples as it works well - uses SQLite
for ent in testIDList: ## Total 10,000 entries
## Generate Relative Rank for ID/ of interest
mainID = ent[0] ## ID number
mainRank = rankGenerator(mainID,trackTableName,chrLimit,conn) ## See below for function
randomIDs = randomSelect(idSET,randNum)
randomRanks = []
for randID in randomIDs:
randomRank = rankGenerator(randID,trackTableName,chrLimit,conn)
randomRanks.append(randomRank)
### Some calculation
probRR = DO SOME CALCULATION
trackPvals.append(round(probRR,5))
conn.close()
return trackPvals
def rankGenerator(ID,trackTableName,chrLimit,conn):
'''
Generate a rank for each ID provided by 'rankCalcTest' function
'''
print ('\nRank is being calculated for ID:%s' % (ID))
IDCoord = aDict[ID] ## Get required info to construct the read query
company = IDCoord[0]
intervalIDs = [] ## List to hold all the IDs in an interval
rank = 0 ##Initialize
cur = conn.cursor()
print ('ID class 0')
cur.execute("SELECT ID,hours FROM %s WHERE chr = '%s' AND start between %s and %s ORDER BY hours desc" % (trackTableName,comapny))
intIDs = cur.fetchall()
intervalIDs.extend(intIDs) ## There is one ore query in certain cases, removed for brewity of code
Rank = SOME CALCULATION
print('Relative Rank for %s: %s'% (ID,str(weigRelativeRank)))
return Rank
def PPResults(module,alist):
npool = Pool(int(nproc))
res = npool.map_async(module, alist)
results = (res.get())
return results
The script freezes in 'rankGenerator' function:
Rank is being calculated for ID:1423187_at
Rank is being calculated for ID:1452528_a_at
Coordinates found for:1423187_at - 8,111940709,111952915
Coordinates found for:1452528_a_at - 19,43612500,43614912
ID class 0
As, the run was performed in parallel its hard to say at which line script is freezing but seems like the query in 'rankGenerator' is the freezing point. Is it related to locks in SQLite?
Sorry for large code. It is actually a very trimmed version that took me 3 hrs to prepare. I hope to get some help.
AK

Resources