Parsing heterogenous data from a text file in Python

Parsing heterogenous data from a text file in Python - python-3.x

I am trying to parse raw data results from a text file into an organised tuple but having trouble getting it right.
My raw data from the textfile looks something like this:
Episode Cumulative Results
EpisodeXD0281119
Date collected21/10/2019
Time collected10:00
Real time PCR for M. tuberculosis (Xpert MTB/Rif Ultra):
PCR result Mycobacterium tuberculosis complex NOT detected
Bacterial Culture:
Bottle: Type FAN Aerobic Plus
Result No growth after 5 days
EpisodeST32423457
Date collected23/02/2019
Time collected09:00
Gram Stain:
Neutrophils Occasional
Gram positive bacilli Moderate (2+)
Gram negative bacilli Numerous (3+)
Gram negative cocci Moderate (2+)
EpisodeST23423457
Date collected23/02/2019
Time collected09:00
Bacterial Culture:
A heavy growth of
1) Klebsiella pneumoniae subsp pneumoniae (KLEPP)
ensure that this organism does not spread in the ward/unit.
A heavy growth of
2) Enterococcus species (ENCSP)
Antibiotic/Culture KLEPP ENCSP
Trimethoprim-sulfam R
Ampicillin / Amoxic R S
Amoxicillin-clavula R
Ciprofloxacin R
Cefuroxime (Parente R
Cefuroxime (Oral) R
Cefotaxime / Ceftri R
Ceftazidime R
Cefepime R
Gentamicin S
Piperacillin/tazoba R
Ertapenem R
Imipenem S
Meropenem R
S - Sensitive ; I - Intermediate ; R - Resistant ; SDD - Sensitive Dose Dependant
Comment for organism KLEPP:
** Please note: this is a carbapenem-RESISTANT organism. Although some
carbapenems may appear susceptible in vitro, these agents should NOT be used as
MONOTHERAPY in the treatment of this patient. **
Please isolate this patient and practice strict contact precautions. Please
inform Infection Prevention and Control as contact screening might be
indicated.
For further advice on the treatment of this isolate, please contact.
The currently available laboratory methods for performing colistin
susceptibility results are unreliable and may not predict clinical outcome.
Based on published data and clinical experience, colistin is a suitable
therapeutic alternative for carbapenem resistant Acinetobacter spp, as well as
carbapenem resistant Enterobacteriaceae. If colistin is clinically indicated,
please carefully assess clinical response.
EpisodeST234234057
Date collected23/02/2019
Time collected09:00
Authorised by xxxx on 27/02/2019 at 10:35
MIC by E-test:
Organism Klebsiella pneumoniae (KLEPN)
Antibiotic Meropenem
MIC corrected 4 ug/mL
MIC interpretation Resistant
Antibiotic Imipenem
MIC corrected 1 ug/mL
MIC interpretation Sensitive
Antibiotic Ertapenem
MIC corrected 2 ug/mL
MIC interpretation Resistant
EpisodeST23423493
Date collected18/02/2019
Time collected03:15
Potassium 4.4 mmol/L 3.5 - 5.1
EpisodeST45445293
Date collected18/02/2019
Time collected03:15
Creatinine 32 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Creatinine 28 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Essentially the pattern is that ALL information starts with a unique EPISODE NUMBER and follows with a DATE and TIME and then the result of whatever test. This is the pattern throughout.
What I am trying to parse into my tuple is the date, time, name of the test and the result - whatever it might be. I have the following code:
with open(filename) as f:
data = f.read()
data = data.splitlines()
DS = namedtuple('DS', 'date time name value')
parsed = list()
idx_date = [i for i, r in enumerate(data) if r.strip().startswith('Date')]
for start, stop in zip(idx_date[:-1], idx_date[1:]):
chunk = data[start:stop]
date = time = name = value = None
for row in chunk:
if not row: continue
row = row.strip()
if row.startswith('Episode'): continue
if row.startswith('Date'):
_, date = row.split()
date = date.replace('collected', '')
elif row.startswith('Time'):
_, time = row.split()
time = time.replace('collected', '')
else:
name, value, *_ = row.split()
print (name)
parsed.append(DS(date, time, name, value))
print(parsed)
My error is that I am unable to find a way to parse the heterogeneity of the test RESULT in a way that I can use later, for example for the tuple DS ('DS', 'date time name value'):
DATE = 21/10/2019
TIME = 10:00
NAME = Real time PCR for M tuberculosis or Potassium
RESULT = Negative or 4.7
Any advice appreciated. I have hit a brick wall.

Related

Avoiding cartesian when adding unique classifier to a list in python 3

I have 5 .csv files I am importing and all contain emails:
Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
usecols=["Email Address"])
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
usecols=["Email"])
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
usecols=["Email"])
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv",
usecols=["Email Address"])
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
usecols=["Email Address"])
Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv",
usecols=["Email"])
While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc.
Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'
I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have:
S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count
The total on that last line is 166,749
Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier.
I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ).
Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record.
Any help would be greatly appreciated.

I'll give it a try with some made-up data:
import pandas as pd
fa = pd.DataFrame([['paul#mail.com', 'Donors'], ['max#mail.com', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['paul#mail.com', 'Activists'], ['annie#mail.com', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])
print(S1)
gives
Handle Value
0 paul#mail.com Donors
1 max#mail.com Donors
0 paul#mail.com Activists
1 annie#mail.com Activists
You can group by Handle and then pick any Value you like, e.g. the first:
for handle, group in S1.groupby('Handle'):
print(handle, group.reset_index().loc[0, 'Value'])
gives
annie#mail.com Activists
max#mail.com Donors
paul#mail.com Donors
or collect all roles of a person:
for handle, group in S1.groupby('Handle'):
print(handle, group.Value.unique())
gives
annie#mail.com ['Activists']
max#mail.com ['Donors']
paul#mail.com ['Donors' 'Activists']

What is the simplest way to complete a function on every row of a large table?

so I want to do a fisher exact test (one sided) on every row of a 3000+ row table with a format matching the below example
gene
sample_alt
sample_ref
population_alt
population_ref
One
4
556
770
37000
Two
5
555
771
36999
Three
6
554
772
36998
I would ideally like to make another column of the table equivalent to
[(4+556)!(4+770)!(770+37000)!(556+37000)!]/[4!(556!)770!(37000!)(4+556+770+37000)!]
for the first row of data, and so on and so forth for each row of the table.
I know how to do a fisher test in R for simple 2x2 tables, but I wouldn't know how I would apply the fisher.test() function to each row of a large table. I also can't use an excel formula because the numbers get so big with the factorials that they reach excel's digit limit and result in a #NUM error. What's the best way to simply complete this? Thanks in advance!

Beginning with a tab-delimited text file on desktop (table.txt) with the same format as shown in the stem question
if(!require(psych)){install.packages("psych")}
multiFisher = function(file="Desktop/table.txt", saveit=TRUE,
outfile="Desktop/table.csv", progress=T,
verbose=FALSE, digits=3, ... )
{
require(psych)
Data = read.table(file, skip=1, header=F,
col.names=c("Gene", "MD", "WTD", "MC", "WTC"), ...)
if(verbose){print(str(Data))}
Data$Fisher.p = NA
Data$phi = NA
Data$OR1 = format(0.123, nsmall=3)
Data$OR2 = NA
if(progress){cat("\n")}
for(i in 1:length(Data$Gene)){
Matrix = matrix(c(Data$WTC[i],Data$MC[i],Data$WTD[i],Data$MD[i]), nrow=2)
Fisher = fisher.test(Matrix, alternative = 'greater')
Data$Fisher.p[i] = signif(Fisher$p.value, digits=digits)
Data$phi[i] = phi(Matrix, digits=digits)
OR1 = (Data$WTC[i]*Data$MD[i])/(Data$MC[i]*Data$WTD[i])
OR2 = 1 / OR1
Data$OR1[i] = format(signif(OR1, digits=digits), nsmall=3)
Data$OR2[i] = signif(OR2, digits=digits)
if(progress) {cat(".")}
}
if(progress){cat("\n"); cat("\n")}
if(saveit){write.csv(Data, outfile)}
return(Data)
}
multiFisher()

Failing to use sumproduct on date ranges with multiple conditions [Python]

From replacement data table (below on the image), I am trying to incorporate the solbox product replace in time series data format(above on the image). I need to extract out the number of consumers per day from the information.
What I need to find out:
On a specific date, which number of solbox product was active
On a specific date, which number of solbox product (which was a consumer) was active
I have used this line of code in excel but cannot implement this on python properly.
=SUMPRODUCT((Record_Solbox_Replacement!$O$2:$O$1367 = "consumer") * (A475>=Record_Solbox_Replacement!$L$2:$L$1367)*(A475<Record_Solbox_Replacement!$M$2:$M$1367))
I tried in python -
timebase_df['date'] = pd.date_range(start = replace_table_df['solbox_started'].min(), end = replace_table_df['solbox_started'].max(), freq = frequency)
timebase_df['date_unix'] = timebase_df['date'].astype(np.int64) // 10**9
timebase_df['no_of_solboxes'] = ((timebase_df['date_unix']>=replace_table_df['started'].to_numpy()) & (timebase_df['date_unix'] < replace_table_df['ended'].to_numpy() & replace_table_df['customer_type'] == 'customer']))
ERROR:
~\Anaconda3\Anaconda4\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
232 # The ambiguous case is object-dtype. See GH#27803
233 if len(lvalues) != len(rvalues):
--> 234 raise ValueError("Lengths must match to compare")
235
236 if should_extension_dispatch(lvalues, rvalues):
ValueError: Lengths must match to compare
Can someone help me please? I can explain in comment section if I have missed something.

How to use custom mean, median, mode functions with array of 2500 in python?

So I am trying to solve mean, median and mode challenge on Hackerrank. I defined 3 functions to calculate mean, median and mode for a given array with length between 10 and 2500, inclusive.
I get an error with an array of 2500 integers, not sure why. I looked into python documentation and found no mentions of max length for lists. I know I can use statistics module but trying the hard way and being stubborn I guess. Any help and criticism is appreciated regarding my code. Please be honest and brutal if need be. Thanks
N = int(input())
var_list = [int(x) for x in input().split()]
def mean(sample_list):
mean = sum(sample_list)/N
print(mean)
return
def median(sample_list):
sorted_list = sorted(sample_list)
if N%2 != 0:
median = sorted_list[(N//2)]
else:
median = (sorted_list[N//2] + sorted_list[(N//2)-1])/2
print(median)
return
def mode(sample_list):
sorted_list = sorted(sample_list)
mode = min(sorted_list)
max_count = sorted_list.count(mode)
for i in sorted_list:
if (i <= mode) and (sorted_list.count(i) >= max_count):
mode = i
print(mode)
return
mean(var_list)
median(var_list)
mode(var_list)
Compiler Message
Wrong Answer
Input (stdin)
2500
19325 74348 68955 98497 26622 32516 97390 64601 64410 10205 5173 25044 23966 60492 71098 13852 27371 40577 74997 42548 95799 26783 51505 25284 49987 99134 33865 25198 24497 19837 53534 44961 93979 76075 57999 93564 71865 90141 5736 54600 58914 72031 78758 30015 21729 57992 35083 33079 6932 96145 73623 55226 18447 15526 41033 46267 52486 64081 3705 51675 97470 64777 31060 90341 55108 77695 16588 64492 21642 56200 48312 5279 15252 20428 57224 38086 19494 57178 49084 37239 32317 68884 98127 79085 77820 2664 37698 84039 63449 63987 20771 3946 862 1311 77463 19216 57974 73012 78016 9412 90919 40744 24322 68755 59072 57407 4026 15452 82125 91125 99024 49150 90465 62477 30556 39943 44421 68568 31056 66870 63203 43521 78523 58464 38319 30682 77207 86684 44876 81896 58623 24624 14808 73395 92533 4398 8767 72743 1999 6507 49353 81676 71188 78019 88429 68320 59395 95307 95770 32034 57015 26439 2878 40394 33748 41552 64939 49762 71841 40393 38293 48853 81628 52111 49934 74061 98537 83075 83920 42792 96943 3357 83393{-truncated-}
Download to view the full testcase
Expected Output
49921.5
49253.5
2184

Your issue seems to be that you are actually using standard list operations rather than calculating things on the fly, while looping through the data once (for the average). sum(sample_list) will almost surely give you something which exceeds the double-limit, i.a.w. it becomes really big.
Further reading
Calculating the mean, variance, skewness, and kurtosis on the fly
How do I determine the standard deviation (stddev) of a set of values?
Rolling variance algorithm
What is a good solution for calculating an average where the sum of all values exceeds a double's limits?
How do I determine the standard deviation (stddev) of a set of values?
How to efficiently compute average on the fly (moving average)?

I figured out that you forgot to change the max_count variable inside the if block. Probably that causes the wrong result. I tested the debugged version on my computer and they seem to work well when I compare their result with the scipy's built-in functions. The correct mode function should be
def mode(sample_list):
N = len(sample_list)
sorted_list = sorted(sample_list)
mode = min(sorted_list)
max_count = sorted_list.count(mode)
for i in sorted_list:
if (sorted_list.count(i) >= max_count):
mode = i
max_count = sorted_list.count(i)
print(mode)

I was busy with some stuff and now came back to completing this. I am happy to say that I have matured enough as a coder and solved this issue.
Here is the solution:
# Enter your code here. Read input from STDIN. Print output to STDOUT
# Input an array of numbers, convert it to integer array
n = int(input())
my_array = list(map(int, input().split()))
my_array.sort()
# Find mean
array_mean = sum(my_array) / n
print(array_mean)
# Find median
if (n%2) != 0:
array_median = my_array[n//2]
else:
array_median = (my_array[n//2 - 1] + my_array[n//2]) / 2
print(array_median)
# Find mode(I could do this using multimode method of statistics module for python 3.8)
def sort_second(array):
return array[1]
modes = [[i, my_array.count(i)] for i in my_array]
modes.sort(key = sort_second, reverse=True)
array_mode = modes[0][0]
print(array_mode)

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5

Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string