How to read or open a qrel format file? - nlp

I was working with TREC qrel file and I would like to have a look at the file. I was wondering how to read a qrel file? or how can I open the file? what is the format> what library should I use?

If you reformat the file into a .txt file you would see that the file would have multiple columns, of which one column would be the relevant judgment.
If you are used to working with CSV files and Python Pandas Dataframes you can opt to follow these steps:
Rename the qrel file with a .txt extension. (Just so that you can read it on a notepad or something)
Read the file as a usual .txt line by line and push it into a CSV file.
Of the top of my head, I have written an easy snippet in Python which you could try:
import pandas as pd
rel_query = []
with open('/content/renamed_qrel.qrel.txt', 'r') as fp:
Lines = fp.readlines()
for line in Lines:
# The line below may need to be changed based on the type of data in the qrel file
rel_query.append(line.split())
qrel_df = pd.DataFrame(rel_query)
NOTE: Although this may/may not be the right way to do it, this surely can help you get started.
I think the right way of doing this would be as follows:
import pandas as pd
df = pd.read_csv('abcd.txt',
sep="\s+", # Or whichever seperator
names=["A", "B", "C", "D"]) # For header names

Related

why can't I see the extra line when I directly open the csv file?

I'm learning python on Windows and I met a problem when creating a csv file.
Here is the code
import csv
with open("data.csv", "w") as file:
writer = csv.writer(file)
writer.writerow(["trasaction_id", "product_id", "price"])
writer.writerow([1000, 1, 5])
When I open data.csv in VSCode, I get an extra line:
trasaction_id,product_id,price
1000,1,5
I know this problem can be solved by adding newline="" in the open() function. But I don't understand why I can't see the extra line when I directly open this file with Notepad?

How to read all csv files that begin with consonant?

import os
for file in os.listdir("/content/drive/MyDrive/BigData2021/Lecture23/datasets"):
if file.endswith(".csv"):
print(os.path.join(file))
cities.csv
airports.csv
data_scientist_salaries.csv
I want to read this CSV file with spark which begins consonant without tell CSV filename. How to do that?
Using wildcard [b-df-hj-np-tv-z]*.csv in the path should do the job:
df = spark.read.csv("/your_directory/datasets/[b-df-hj-np-tv-z]*.csv")

read_csv one file from several files in a gzip?

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Raw output data frame manipulation in python

Using python 3 I need to process qPCR sequencing raw data outputs by searching for the first occurrence of a user defined string and then making a new data frame using all lines after that string. I am trying to find solutions in the pandas doc but so far unsuccessful.
This is a raw output .csv file that I need to process. (couldn't paste complete csv as exceeds character limit, this is lines 40-50 and am hoping this text is useful?). I need to tell pandas to create a new data frame that 1. starts at the line containg the first occurance of str("Sample Name") with that line as header and containing all lines following. And then 2., only including columns ("Sample Name"), ("Target Name"), ("CT").
Could someone please help me so that I can use python to help me analyze biological data?
Many thanks,
Luke
40,Quantification Cycle Method,Ct,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
41,Signal Smoothing On,true,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
42,Stage where Melt Analysis is performed,Stage3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
43,Stage/ Cycle where Ct Analysis is performed,"Stage2, Step2",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
44,User Name,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
46,Well,Well Position,Omit,Sample Name,Target Name,Task,Reporter,Quencher,Quantity,Quantity Mean,SE,RQ,RQ Min,RQ Max,CT,Ct Mean,Ct SD,Delta Ct,Delta Ct Mean,Delta Ct SD,Delta Ct SE,Delta Delta Ct,Automatic Ct Threshold,Ct Threshold,Automatic Baseline,Baseline Start,Baseline End,Amp Status,Comments,Cq Conf,CQCONF,HIGHSD,OUTLIERRG,Tm1,Tm2,Tm3,Tm4
47,1,A1,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.357698440551758,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,17,Amp,,0.9588544573203085,N,Y,N,81.40960693359375,,,
48,2,A2,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,24.05980110168457,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,15,Amp,,0.9592687354496955,N,Y,N,81.40960693359375,,,
49,3,A3,False,WT1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.012556076049805,23.4766845703125,0.5336655378341675,,,,,,True,20959.612776965325,True,3,16,Amp,,0.9592714462250367,N,Y,N,81.40960693359375,,,
50,4,A4,False,fla11fla12-1,AtTubulin,UNKNOWN,SYBR,None,,,,,,,23.803699493408203,24.419523239135742,0.5669151544570923,,,,,,True,20959.612776965325,True,3,17,Amp,,0.9671570584141241,N,Y,N,81.40960693359375,,,
This is the code that I have so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_excel ("2019-02-27_161601 AtWAKL8 different version expressions.xls", sheet_name='Results').fillna(0)
data.to_csv('df1' + '.csv', index=True)
df1 = pd.read_csv ("df1.csv")
You are having trouble with quoting.
grep is a better fit for .csv files rather than .xlsx
You are forking off a shell subprocess with a filename argument,
without correctly quoting the spaces in the filename.
It would be simplest to rename it, turning spaces into dashes,
e.g. 2019-02-27_161601-AtWAKL8-different-version-expressions.xls
As it stands, you are trying to grep the string "Position"
from a file named 2019-02-27_161601,
and from a 2nd file named AtWAKL8,
a 3rd named different, and so on,
which is unlikely to work.
An .xlsx spreadsheet is not the line-oriented
text format that grep expects.
You will be happier if you export or Save As .csv format within Excel,
or if you execute data.to_csv('expressions.csv')

Why won't this Python script replace one variable with another variable?

I have a CSV file with two columns in it, the one of the left being an old string, and the one directly to right being the new one. I have a heap of .xml files that contain the old strings, which I need to replace/update with the new ones.
The script is supposed to open each .xml file one at a time and replace all of the old strings in the CSV file with the new ones. I have tried to use a replace function to replace instances of the old string, called 'column[0]' with the new string, called 'column[1]'. However I must be missing something as this seems to do nothing. If I the first variable in the replace function to an actual string with quotation marks, the replace function works. However if both the terms in the replace function are variables, it doesn't.
Does anyone know what I am doing wrong?
import os
import csv
with open('csv.csv') as csv:
lines = csv.readline()
column = lines.split(',')
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
x=x.replace(column[0],column[1])
print(x)
Example of CSV file:
oldstring1,newstring1
oldstring2,newstring2
Example of .xml file:
Word words words oldstring1 words words words oldstring2
What I want in the new .xml files:
Word words words newstring1 words words words newstring2
The problem over here is you are treating the csv file as normal text file not looping over the all the lines in the csv file.
You need to read file using csv reader
Following code will work for your task
import os
import csv
with open('csv.csv') as csvfile:
reader = csv.reader(csvfile)
fileNames=[f for f in os.listdir('.') if f.endswith('.xml')]
for f in fileNames:
x=open(f).read()
for row in reader:
x=x.replace(row[0],row[1])
print(x)
It looks like this is better done using sed. However.
If we want to use Python, it seems to me that what you want to do is best achieved
reading all the obsolete - replacements pairs and store them in a list of lists,
have a loop over the .xml files, as specified on the command line, using the handy fileinput module, specifying that we want to operate in line and that we want to keep around the backup files,
for every line in each of the .xml s operate all the replacements,
put back the modified line in the original file (using simply a print, thanks to fileinput's magic) (end='' because we don't want to strip each line to preserve eventual white space).
import fileinput
import sys
old_new = [line.strip().split(',') for line in open('csv.csv')]
for line in fileinput.input(sys.argv[1:], inplace=True, backup='.bak'):
for old, new in old_new:
line = line.replace(old, new)
print(line, end='')
If you save the code in replace.py, you will execute it like this
$ python3 replace.py *.xml subdir/*.xml another_one/a_single.xml

Resources