extract position of char from each row & provide an aggregate of position across a list - python-3.x

I need some python help with this problem. Would appreciate any assistance !. Thanks.
I need an extracted matrix of values enclosed between square brackets. A toy example is below:
File Input will be in a txt file as below:
AB_1 Q[A]IHY[P]GVA
AB_2 Q[G][C]HY[R]GVA
AB_3 Q[G][C]HY[R]GV[D]
Answer out.txt: Script extracts index of char enclosed between sq.brackets "[]" for each row from input and makes an aggregate of index positions for the entire list. The aggregated index is then used to extract all of those positions from input file and produce a matrix as below.
Index 2,3,6,9
AB_1 [A],I,[P],A
AB_2 [G],[C],[R],A
AB_3 [G],[C],[R],[D]
Any help would be greatly appreciated !. Thanks.

If you want to reduce your table to only those columns in which an entry with square-brackets appears, you can go with this:
import re
def transpose(matrix):
return [[matrix[j][i] for j in range(len(matrix))] for i in range(len(matrix[0]))]
with open("test_table.txt", "r") as f:
content = f.read()
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")]
columns = transpose(rows)
matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)]
matching_rows = transpose(matching_columns)
headline = ["Index {}".format(",".join(matching_rows[0]))]
target_table = headline + ["AB_{0} {1}".format((i + 1), ",".join(line)) for i, line in enumerate(matching_rows[1:])]
with open("out.txt", "w") as f:
f.write("\n".join(target_table))
First of all you want the content of your .txt file to be represented in arrays. Unfortunately your input data has no seperators yet (as in .csv files) so you need to take care of that. To get a string like this "Q[A]IHY[P]GVA" sorted out I would recommend working with regular expressions.
import re
cells = re.findall(r'(\[.\]|.)', "Q[A]IHY[P]GVA")
The pattern within the r'' string matches a single character within square brackets or just a single character. The re.findall() method returns a list of all matching substrings, in this case: ['Q', '[A]', 'I', 'H', 'Y', '[P]', 'G', 'V', 'A']
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")] applies this method on every line in your file. The line.split()[1] will leave out the row label "AB_X " as it is not usefull.
Having your data sorted in columns is more fitting, because you want to preserve all columns that match a certain condition (contain an entry in brackets). For this you can just transpose rows. This is done by the transpose() function. If you have imported numpy numpy.transpose(rows) would be the better option I guess.
Next you want to get all columns that satisfy your condition "[" in "".join(column). All done in one line by: matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)] Here [str(i + 1)] does add the column index that you want to use later.
The rest now is easy: Transpose the columns back to rows, relabel the rows, format the row data into strings that fit your desired output format and then write those strings to the out.txt file.

Related

Replacing "DoIt.py" script with flexible functions that match DFs on partial string matching of column names [Python3] [Pandas] [Merge]

I spent too much time trying to write a generic solution to a problem (below this). I ran into a couple issues, so I ended up writing a Do-It script, which is here:
# No imports necessary
# set file paths
annofh="/Path/To/Annotation/File.tsv"
datafh="/Path/To/Data/File.tsv"
mergedfh="/Path/To/MergedOutput/File.tsv"
# Read all the annotation data into a dict:
annoD={}
with open(annofh, 'r') as annoObj:
h1=annoObj.readline()
for l in annoObj:
l=l.strip().split('\t')
k=l[0] + ':' + l[1] + ' ' + l[3] + ' ' + l[4]
annoD[k]=l
keyset=set(annoD.keys())
with open(mergedfh, 'w') as oF:
with open(datafh, 'r') as dataObj:
h2=dataObj.readline().strip(); oF.write(h2 + '\t'+ h1) # write the header line to the output file
for l in dataObj:
l=l.strip().split('\t') # Read through the data to be annotated line-by-line:
if "-" in l[13]:
pos=l[13].split('-')
l[13]=pos[0]
key=l[12][3:] + ":" + l[13] + " " + l[15] + " " + l[16]
if key in annoD.keys():
l = l + annoD[key]
oF.write('\t'.join(l) + '\n')
else:
oF.write('\t'.join(l) + '\n')
The function of DoIt.py (which functions correctly, above ^ ) is simple:
first read a file containing annotation information into a dictionary.
read through the data to be annotated line-by-line, and add annotation info. to the data by matching a string constructed by pasting together 4 columns.
As you can see, this script contains index positions, that I obtained by writing a quick awk one-liner, finding the corresponding columns in both files, then putting these into the python script.
Here's the thing. I do this kind of task all the time. I want to write a robust solution that will enable me to automate this task, *even if column names vary. My first goal is to use partial string matching; but eventually it would be nice to be even more robust.
I got part of the way to doing this, but at present the below solution is actually no better than the DoIt.py script...
# Across many projects, the correct columns names vary.
# For example, the name might be "#CHROM" or "Chromosome" or "CHR" for the first DF, But "Chrom" for the second df.
# in any case, if I conduct str.lower() then search for a substring, it should match any of the above options.
MasterColNamesList=["chr", "pos", "ref", "alt"]
def selectFields(h, columnNames):
##### currently this will only fix lower case uppercase problems. need to fix to catch any kind of mapping issue, like a partial string match (e.g., chr will match #CHROM)
indices=[]
h=map(str.lower,h)
for fld in columnNames:
if fld in h:
indices.append(h.index(fld))
#### Now, this will work, but only if the field names are an exact match.
return(indices)
def MergeDFsByCols(DF1, DF2, colnames): # <-- Single set of colnames; no need to use indices
pass
# eventually, need to write the merge statement; I could paste the cols together to a string and make that the indices for both DFs, then match on the indices, for example.
def mergeData(annoData, studyData, MasterColNamesList):
####
import pandas as pd
aDF=pd.read_csv(annoData, header=True, sep='\t')
sDF=pd.read_csv(studyData, header=True, sep='\t')
####
annoFieldIdx=selectFields(list(aVT.columns.values), columnNames1) # currently, columnNames1; should be MasterColNamesList
dataFieldIdx=selectFields(list(sD.columns.values), columnNames2)
####
mergeDFsByCols(aVT, sD):
Now, although the above works, it is actually no more automated than the DoIt.py script, because the columnNames1 and 2 are specific to each file and still need to be found manually ...
What I want to be able to do is enter a list of generic strings that, if processed, will result in the correct columns being pulled from both files, then merge the pandas DFs on those columns.
Greatly appreciate your help.

Getting KeyError for pandas df column name that exists

I have
data_combined = pd.read_csv("/path/to/creole_data/data_combined.csv", sep=";", encoding='cp1252')
So, when I try to access these rows:
data_combined = data_combined[(data_combined["wals_code"]=="abk") &(data_combined["wals_code"]=="aco")]
I get a KeyError 'wals_code'. I then checked my list of col names with
print(data_combined.columns.tolist())
and saw the col name 'wals_code' in the list. Here's the first few items from the print out.
[',"wals_code","Order of subject, object and verb","Order of genitive and noun","Order of adjective and noun","Order of adposition and NP","Order of demonstrative and noun","Order of numeral and noun","Order of RC and noun","Order of degree word and adjective"]
Anyone have a clue what is wrong with my file?
The problem is the delimiter you're using when reading the CSV file. With sep=';', you instruct read_csv to use semicolons (;) as the separators for columns (cells and column headers), but it appears from your columns print out that your CSV file actually uses commas (,).
If you look carefully, you'll notice that your columns print out displays actually a list with one long string, not a list of individual strings representing the columns names.
So, use sep=',' instead of sep=';' (or just omit it entirely as , is the default value for sep):
data_combined = pd.read_csv("/path/to/creole_data/data_combined.csv", encoding='cp1252')

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

Sorting csv data by a column of numerical values

I have a csv file that has 10 columns and about 7,000 rows. I have to sort the data based on the 4th column (0, 1, 2, 3), which I know is column #3 with 0 based counting. The column has a header and the data in this column is numeric values. The largest value in this column is: 7548375288960, so that row should be at the top of my results.
My code is below. Something interesting is happening. If I change the reverse=True to reverse=False then the 15 rows printed to the screen are the correct ones based on me manually sorting the csv file in Excel. But when I set reverse=True they are not the correct ones. Below are the first 4 that my print statement puts out:
999016759.26
9989694.0
99841428.0
998313048.0
Here is my code:
def display():
theFile = open("MyData.csv", "r")
mycsv = csv.reader(theFile)
sort = sorted(mycsv, key=operator.itemgetter(3), reverse=True)
for row in islice(sort, 15):
print(row)
Appreciate any help!
OK, I got this solved. A couple of things:
The data in the column, while only containing numerical values, was in string format. To overcome this I did the following in the function that was generating the csv file.
concatDf["ColumnName"] = concatDf["ColumnName"].astype(float)
This converted all of the strings to floats. Then in my display function I changed the sort line of code to the following:
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
Then I got a different error that I realized was trying to covert the header from string to float and this was impossible. To overcome that I added the following line:
next(theFile, None)
This is what the function looks like now:
def display():
theFile = open("MyData.csv", "r")
next(theFile, None)
reader = csv.reader(theFile, delimiter = ",", quotechar='"')
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
for row in islice(sort, 15):
print(row)

How do I take the punctuation off each line of a column of an xlsx file in Python?

I have an excel file (.xlsx) with a column having rows of strings. I used the following code to get the file:
import pandas as pd
df = pd.read_excel("file.xlsx")
db = df['Column Title']
I am removing the punctuation for the first line (row) of the column using this code:
import string
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
I would like to remove the punctuation for each line (until the last row). How would I correctly write this with a loop? Thank you.
Well given that this code is working for one value and producing the right kind of results then you can write it in a loop as
for row in rows(min_row=1, min_col=1, max_row=6, max_col=3):
for cell in row:
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
Change the arguments (number of rows and columns) as per your need.

Resources