How to remove strings like los30_9_ from the second column of data file? - linux

I have data like the following in a file.
seq
AB los30_9_AAACCTGAGATGTGGC
CGD los28_6_AAACCTGCAGCTTCGG
CGD los28_3_AAACCTGCATAGTAAG
CRG mgj28_3_AAACCTGCATATACGC
CGD lkgd28_11_AAACCTGGTCTTCTCG
CRG lkgd28_3_AAACCTGTCAGTTGAC
AB lkgd35_5_AAACCTGTCTGGTATG
CD los30_9_AAACGGGCAACCGCCA
CD lkgd_8_AAACGGGGTTACCAGT**
How can I remove los30_9_, los28_6, los28_3_, mgj28_3_, lkgd28_11_, lkgd28_3_, lkgd28_3_, lkgd35_5_, los30_9_, lkgd_8_ from the second column of a CSV file?

This Python 3 solution will do, as well as it will respect multiline fields on the csv file.
#!/usr/local/bin/python3
import csv
import re
csvr = csv.reader(open('input.csv'), delimiter = "\t")
next(csvr, None)
for row in csvr:
row[1] = re.sub(r'[a-z0-9]+_[a-z0-9]+_', '', row[1])
print("{}\t{}".format(row[0],row[1]))
Note : Please correct your csv so it contains tabs instead of spaces for this to work. You can open it in Excel and "Save as...".

Related

How do I convert multiple multiline txt files to excel - ensuring each file is its own line, then each line of text is it own row? Python3

Using openpyxl and Path I aim to:
Create multiple multiline .txt files,
then insert .txt content into a .xlsx file ensuring file 1 is in column 1 and each line has its own row.
I thought to create a nested list then loop through it to insert the text. I cannot figure how to ensure that all the nested list string is displayed. This is what I have so far which nearly does what I want however it's just a repeat of the first line of text.
from pathlib import Path
import openpyxl
listOfText = []
wb = openpyxl.Workbook() # Create a new workbook to insert the text files
sheet = wb.active
for txtFile in range(5): # create 5 text files
createTextFile = Path('textFile' + str(txtFile) + '.txt')
createTextFile.write_text(f'''Hello, this is a multiple line text file.
My Name is x.
This is text file {txtFile}.''')
readTxtFile = open(createTextFile)
listOfText.append(readTxtFile.readlines()) # nest the list from each text file into a parent list
textFileList = len(listOfText[txtFile]) # get the number of lines of text from the file. They are all 3 as made above
# Each column displays text from each text file
for row in range(1, txtFile + 1):
for col in range(1, textFileList + 1):
sheet.cell(row=row, column=col).value = listOfText[txtFile][0]
wb.save('importedTextFiles.xlsx')
The output is 4 columns/4 rows. All of which say the same 'Hello, this is a multiple line text file.'
Appreciate any help with this!
The problem is in the for loop while writing, change the line sheet.cell(row=row, column=col).value = listOfText[txtFile][0] to sheet.cell(row=col, column=row).value = listOfText[row-1][col-1] and it will work

Python Ignoring few x number of lines in CSV file

So I'm trying to read from a CSV file that's tab-delimited saved as a .DAT file.
I need to skip everything until I get to the data portion under the Date header.
So 05-29-2012 and everything on that row and rows below it. I found plenty of documentation on how to skip the first few lines, but I don't know how many lines that may be. From the
"Data file created", to the meat of the data may have more lines of text than another file. Could
be 3 rows, could be 10 rows.
I have thousands of these files I'm trying to extract the data out of and plot it. Easy in excel just to cut and paste but I'm going for efficiency here.
This is the code I'm using. I see that data perfectly. Only IF I know how many lines to skip. There will be blanks lines, I get how to bypass those, but if I have text there, that adds the extra lines I can't bypass.
import pandas as pd
import csv
myfile = ('E:\\TTF Data Backup\\1X ARRAY MOD #2.dat')
df = pd.read_csv(myfile, skiprows=3 , delimiter='\t')
print(df.head(20))
Data File
Try this code:
myfile = ('E:\TTF Data Backup\1X ARRAY MOD #2.dat')
skipcnt = 0
with open(myfile) as f: # auto closes after loop
for row in f:
skipcnt += 1
if "Tension" in row and "Elong" in row: # top of header
break;
skipcnt += 3 # skip headers
df = pd.read_csv(myfile, skiprows=skipcnt , delimiter='\t')

Extract some data from a text file

I am not so experienced in Python.
I have a “CompilerWarningsAllProtocol.txt” file that contains something like this:
" adm_1 C:\Work\CompilerWarnings\adm_1.h type:warning Reason:wunused
adm_2 E:\Work\CompilerWarnings\adm_basic.h type:warning Reason:undeclared variable
adm_X C:\Work\CompilerWarnings\adm_X.h type:warning Reason: Unknown ID"
How can I extract these three paths(C:..., E:..., C:...) from the txt file and to fill an Excel column named “Affected Item”.?
Can I do it with re.findall or re.search methods?
For now the script is checkling if in my location exists the input txt file and confirms it. After that it creates the blank excel file with headers, but I don't know how to populate the excel file with these paths written in column " Affected Item" let's say.
thanks for help. I will copy-paste the code:
import os
import os.path
import re
import xlsxwriter
import openpyxl
from jira import JIRA
import pandas as pd
import numpy as np
# Print error message if no "CompilerWarningsAllProtocol.txt" file exists in the folder
inputpath = 'D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt'
if os.path.isfile(inputpath) and os.access(inputpath, os.R_OK):
print(" 'CompilerWarningsAllProtocol.txt' exists and is readable")
else:
print("Either the file is missing or not readable")
# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('CompilerWarningsFresh.xlsx')
worksheet = workbook.add_worksheet('Results')
# Widen correspondingly the columns.
worksheet.set_column('A:A', 20)
worksheet.set_column('B:AZ', 45)
# Create the headers
headers=('Module','Affected Item', 'Issue', 'Class of Issue', 'Issue Root Cause', 'Type of Issue',
'Source of Issue', 'Test sequence', 'Current Issue appearances in module')
# Create the bold headers and font size
format1 = workbook.add_format({'bold': True, 'font_color': 'black',})
format1.set_font_size(14)
format1.set_border()
row=col=0
for item in (headers):
worksheet.write(row, col, item, format1)
col += 1
workbook.close()
I agree with #dantechguy that csv is probably easier (and more light weight) than writing a real xlsx file, but if you want to stick to Excel format, the code below will work. Also, based on the code you've provided, you don't need to import openpyxl, jira, pandas or numpy.
The regex here matches full paths with any drive letter A-Z, followed by "type:warning". If you don't need to check for the warning and simply want to get every path in the file, you can delete everything in the regex after S+. And if you know you'll only ever want drives C and E, just change A-Z to CE.
warningPathRegex = r"[A-Z]:\\\S+(?=\s*type:warning)"
compilerWarningFile = r"D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt"
warningPaths = []
with open(compilerWarningFile, 'r') as f:
fullWarningFile = f.read()
warningPaths = re.findall(warningPathRegex, fullWarningFile)
# ... open Excel file, then before workbook.close():
pathColumn = 1 # Affected item
for num, warningPath in enumerate(warningPaths):
worksheet.write(num + 1, pathColumn, warningPath) # num + 1 to skip header row

How do I take the punctuation off each line of a column of an xlsx file in Python?

I have an excel file (.xlsx) with a column having rows of strings. I used the following code to get the file:
import pandas as pd
df = pd.read_excel("file.xlsx")
db = df['Column Title']
I am removing the punctuation for the first line (row) of the column using this code:
import string
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
I would like to remove the punctuation for each line (until the last row). How would I correctly write this with a loop? Thank you.
Well given that this code is working for one value and producing the right kind of results then you can write it in a loop as
for row in rows(min_row=1, min_col=1, max_row=6, max_col=3):
for cell in row:
translator = str.maketrans('', '', string.punctuation)
sent_pun = db[0].translate(translator)
Change the arguments (number of rows and columns) as per your need.

How to convert a tab delimited text file to a csv file in Python

I have the following problem:
I want to convert a tab delimited text file to a csv file. The text file is the SentiWS dictionary which I want to use for a sentiment analysis ( https://github.com/MechLabEngineering/Tatort-Analyzer-ME/tree/master/SentiWS_v1.8c ).
The code I used to do this is the following:
txt_file = r"SentiWS_v1.8c_Positive.txt"
csv_file = r"NewProcessedDoc.csv"
in_txt = csv.reader(open(txt_file, "r"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'w'))
out_csv.writerows(in_txt)
This code writes everything in one row but I need the data to be in three rows as normally intended from the file itself. There is also a blank line under each data and I don´t know why.
I want the data to be in this form:
Row1 Row2 Row3
Word Data Words
Word Data Words
instead of
Row1
Word,Data,Words
Word,Data,Words
Can anyone help me?
import pandas
It will convert tab delimiter text file into dataframe
dataframe = pandas.read_csv("SentiWS_v1.8c_Positive.txt",delimiter="\t")
Write dataframe into CSV
dataframe.to_csv("NewProcessedDoc.csv", encoding='utf-8', index=False)
Try this:
import csv
txt_file = r"SentiWS_v1.8c_Positive.txt"
csv_file = r"NewProcessedDoc.csv"
with open(txt_file, "r") as in_text:
in_reader = csv.reader(in_text, delimiter = '\t')
with open(csv_file, "w") as out_csv:
out_writer = csv.writer(out_csv, newline='')
for row in in_reader:
out_writer.writerow(row)
There is also a blank line under each data and I don´t know why.
You're probably using a file created or edited in a Windows-based text editor. According to the Python 3 csv module docs:
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

Resources