Extract n characters for the first match of a word in a file - python-3.x

I am a beginner in Python. I have a file having single line of data. My requirement is to extract "n" characters after certain words for their first occurrence only. Also, those words are not sequential.
Data file: {"id":"1234566jnejnwfw","displayId":"1234566jne","author":{"name":"abcd#xyz.com","datetime":15636378484,"displayId":"23423426jne","datetime":4353453453}
I want to fetch value after first match of "displayId" and before "author", i.e.; 1234566jne. Similarly for "datetime".
I tried breaking the line based upon index as the word and putting it into another file for further cleaning up to get the exact value.
tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")
with open("data file") as openfile:
for line in openfile:
tmpFileOpen.write(line[line.index(displayId) + len(displayId):])
However, I am sure this is not a good solution to work further.
Can anyone please help me on this?

This answer should work for any displayId with a similar format as in your question. I decided not to load the JSON file for this answer, because it wasn't needed to accomplish the task.
import re
tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")
with open('data_file.txt', 'r') as input:
lines = input.read()
# Use regex to find the displayId element
# example: "displayId":"1234566jne
# \W matches none words, such as " and :
# \d matches digits
# {6,8} matches digits lengths between 6 and 8
# [a-z] matches lowercased ASCII characters
# {3} matches 3 lowercased ASCII characters
id_patterns = re.compile(r'\WdisplayId\W{3}\d{6,8}[a-z]{3}')
id_results = re.findall(id_patterns, lines)
# Use list comprehension to clean the results
clean_results = ([s.strip('"displayId":"') for s in id_results])
# loop through clean_results list
for id in clean_results:
# Write id to temp file on separate lines
tmpFileOpen.write('{} \n'.format(id))
# output in tmpFileOpen
# 1234566jne
# 23423426jne
This answer does load the JSON file, but this answer will fail if the JSON file format changes.
import json
tmpFile = 'tmpFile.txt'
tmpFileOpen = open(tmpFile, "w+")
# Load the JSON file
jdata = json.loads(open('data_file.txt').read())
# Find the first ID
first_id = (jdata['displayId'])
# Write the first ID to the temp file
tmpFileOpen.write('{} \n'.format(first_id))
# Find the second ID
second_id = (jdata['author']['displayId'])
# Write the second ID to the temp file
tmpFileOpen.write('{} \n'.format(second_id))
# output in tmpFileOpen
# 1234566jne
# 23423426jne

If I understand correctly your question, you can achieve this by doing the following:
import json
tmpFile = "tmpFile.txt"
tmpFileOpen = open(tmpFile, "w+")
with open("data.txt") as openfile:
for line in openfile:
// Loads the json to a dict in order to manipulate it easily
data = json.loads(str(line))
// Here I specify that I want to write to my tmp File only the first 3
// characters of the field `displayId`
tmpFileOpen.write(data['displayId'][:3])
This can be done because the data in your file is JSON, however if the format changes it won't work

Related

How to read many files have a specific format in python

I am a little bit confused in how to read all lines in many files where the file names have format from "datalog.txt.98" to "datalog.txt.120".
This is my code:
import json
file = "datalog.txt."
i = 97
for line in file:
i+=1
f = open (line + str (i),'r')
for row in f:
print (row)
Here, you will find an example of one line in one of those files:
I need really to your help
I suggest using a loop for opening multiple files with different formats.
To better understand this project I would recommend researching the following topics
for loops,
String manipulation,
Opening a file and reading its content,
List manipulation,
String parsing.
This is one of my favourite beginner guides.
To set the parameters of the integers at the end of the file name I would look into python for loops.
I think this is what you are trying to do
# create a list to store all your file content
files_content = []
# the prefix is of type string
filename_prefix = "datalog.txt."
# loop from 0 to 13
for i in range(0,14):
# make the filename variable with the prefix and
# the integer i which you need to convert to a string type
filename = filename_prefix + str(i)
# open the file read all the lines to a variable
with open(filename) as f:
content = f.readlines()
# append the file content to the files_content list
files_content.append(content)
To get rid of white space from file parsing add the missing line
content = [x.strip() for x in content]
files_content.append(content)
Here's an example of printing out files_content
for file in files_content:
print(file)

Parse multiline fasta file using record.id for filenames but not in headers

My current multiline fasta file is as such:
>chr1|chromosome:Mt4.0v2:1:1:52991155:1
ATGC...
>chr2|chromosome:Mt4.0v2:2:1:45729672:1
ATGC...
...and so on.
I need to parse the fasta file into separate files containing only the record.description in the header (everything after the |) followed by the sequence. However, I need to use the record.ids as the filenames (chr1.fasta, chr2.fasta, etc.). Is there any way to do this?
My current attempt at solving this is below. It does produce only the description in the header with the last sequence record.id as the filename. I need seperate files.
from Bio import SeqIO
def yield_records(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.description = record.id = record.id.split('|')[1]
yield record
SeqIO.write(yield_records('/correctedfasta.fasta'), record.id+'.fasta', 'fasta')
Your code has almost everything which is needed. yield can also return more than one value, i.e. you could return both the filename and the record itself, e.g.
yield record.id.split('|')[0], record
but then BioPython would still bite you because the id gets written to the FASTA header. You would therefore need to modify both the id and overwrite the description (it gets concatenated to the id otherwise), or just assign identical values as you did.
A simple solution would be
from Bio import SeqIO
def split_record(record):
old_id = record.id.split('|')[0]
record.id = '|'.join(record.id.split('|')[1:])
record.description = ''
return old_id, record
filename = 'multiline.fa'
for record in SeqIO.parse(filename, 'fasta'):
record = split_record(record)
with open(record[0] + '.fa', 'w') as f:
SeqIO.write(record[1], f, 'fasta')

Python: If a string is found, stop searching for that string, search for the next string, and output the matching strings

This code outputs the matching string once for every time it is in the file that is being searched (so I end up with a huge list if the string is there repeatedly). I only want to know if the strings from my list match, not how many times they match. I do want to know which strings match, so a True/False solution does not work. But I only want them listed once, each, if they match. I do not really understand what the pattern = '|'.join(keywords) part is doing - I got that from someone else's code to get my file to file matching working, but don't know if I need it. Your help would be much appreciated.
# declares the files used
filenames = ['//Katie/Users/kitka/Documents/appreport.txt', '//Dallin/Users/dallin/Documents/appreport.txt' ,
'//Aidan/Users/aidan/Documents/appreport.txt']
# parses each file
for filename in filenames:
# imports the necessary libraries
import os, time, re, smtplib
from stat import * # ST_SIZE etc
# finds the time the file was last modified and error checks
try:
st = os.stat(filename)
except IOError:
print("failed to get information about", filename)
else:
# creates a list of words to search for
keywords = ['LoL', 'javaw']
pattern = '|'.join(keywords)
# searches the file for the strings in the list, sorts them and returns results
results = []
with open(filename, 'r') as f:
for line in f:
matches = re.findall(pattern, line)
if matches:
results.append((line, len(matches)))
results = sorted(results)
# appends results to the archive file
with open("GameReport.txt", "a") as f:
for line in results:
f.write(filename + '\n')
f.write(time.asctime(time.localtime(st[ST_MTIME])) + '\n')
f.write(str(line)+ '\n')
Untested, but this should work. Note that this only keeps track of which words were found, not which words were found in which files. I couldn't figure out whether or not that's what you wanted.
import fileinput
filenames = [...]
keywords = ['LoL', 'javaw']
# a set is like a list but with no duplicates, so even if a keyword
# is found multiple times, it will only appear once in the set
found = set()
# iterate over the lines of all the files
for line in fileinput.input(files=filenames):
for keyword in keywords:
if keyword in line:
found.add(keyword)
print(found)
EDIT
If you want to keep track of which keywords are present in which files, then I'd suggest keeping a set of (filename, keyword) tuples:
filenames = [...]
keywords = ['LoL', 'javaw']
found = set()
for filename in filenames:
with open(filename, 'rt') as f:
for line in f:
for keyword in keywords:
if keyword in line:
found.add((filename, keyword))
for filename, keyword in found:
print('Found the word "{}" in the file "{}"'.format(keyword, filename))

How do I replace the 4th item in a list that is in a file that starts with a particular string?

I need to search for a name in a file and in the line starting with that name, I need to replace the fourth item in the list that is separated my commas. I have began trying to program this with the following code, but I have not got it to work.
with open("SampleFile.txt", "r") as f:
newline=[]
for word in f.line():
newline.append(word.replace(str(String1), str(String2)))
with open("SampleFile.txt", "w") as f:
for line in newline :
f.writelines(line)
#this piece of code replaced every occurence of String1 with String 2
f = open("SampleFile.txt", "r")
for line in f:
if line.startswith(Name):
if line.contains(String1):
newline = line.replace(str(String1), str(String2))
#this came up with a syntax error
You could give some dummy data which would help people to answer your question. I suppose you to backup your data: You can save the edited data to a new file or you can backup the old file to a backup folder before working on the data (think about using "from shutil import copyfile" and then "copyfile(src, dst)"). Otherwise by making a mistake you could easily ruin your data without being able to easily restore them.
You can't replace the string with "newline = line.replace(str(String1), str(String2))"! Think about "strong" as your search term and a line like "Armstrong,Paul,strong,44" - if you replace "strong" with "weak" you would get "Armweak,Paul,weak,44".
I hope the following code helps you:
filename = "SampleFile.txt"
filename_new = filename.replace(".", "_new.")
search_term = "Smith"
with open(filename) as src, open(filename_new, 'w') as dst:
for line in src:
if line.startswith(search_term):
items = line.split(",")
items[4-1] = items[4-1].replace("old", "new")
line = ",".join(items)
dst.write(line)
If you work with a csv-file you should have a look at the csv module.
PS My files contain the following data (the filenames are not in the files!!!):
SampleFile.txt SampleFile_new.txt
Adams,George,m,old,34 Adams,George,m,old,34
Adams,Tracy,f,old,32 Adams,Tracy,f,old,32
Smith,John,m,old,53 Smith,John,m,new,53
Man,Emily,w,old,44 Man,Emily,w,old,44

python csv format all rows to one line

Ive a csv file that I would like to get all the rows in one column. Ive tried importing into MS Excel or Formatting it with Notedpad++ . However with each try it considers a piece of data as a new row.
How can I format file with pythons csv module so that it removes a string "BRAS" and corrects the format. Each row is found between a quote " and delimiter is a pipe |.
Update:
"aa|bb|cc|dd|
ee|ff"
"ba|bc|bd|be|
bf"
"ca|cb|cd|
ce|cf"
The above is supposed to be 3 rows, however my editors see them as 5 rows or 6 and so forth.
import csv
import fileinput
with open('ventoya.csv') as f, open('ventoya2.csv', 'w') as w:
for line in f:
if 'BRAS' not in line:
w.write(line)
N.B I get a unicode error when trying to use in python.
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 18: character maps to <undefined>
This is a quick hack for small input files (the content is read to memory).
#!python2
fnameIn = 'ventoya.csv'
fnameOut = 'ventoya2.csv'
with open(fnameIn) as fin, open(fnameOut, 'w') as fout:
data = fin.read() # content of the input file
data = data.replace('\n', '') # make it one line
data = data.replace('""', '|') # split char instead of doubled ""
data = data.replace('"', '') # remove the first and last "
print data
for x in data.split('|'): # split by bar
fout.write(x + '\n') # write to separate lines
Or if the goal is only to fix the extra (unwanted) newline to form a single-column CSV file, the file can be fixed first, and then read through the csv module:
#!python2
import csv
fnameIn = 'ventoya.csv'
fnameFixed = 'ventoyaFixed.csv'
fnameOut = 'ventoya2.csv'
# Fix the input file.
with open(fnameIn) as fin, open(fnameFixed, 'w') as fout:
data = fin.read() # content of the file
data = data.replace('\n', '') # remove the newlines
data = data.replace('""', '"\n"') # add the newlines back between the cells
fout.write(data)
# It is an overkill, but now the fixed file can be read using
# the csv module.
with open(fnameFixed, 'rb') as fin, open(fnameOut, 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for row in reader:
writer.writerow(row)
For solving this you need not to go to even code.
1: Just open file in Notepad++
2: In first line select from | symble till next line
3: go to replace and replace the selected format with |
Search mode can be normal or extended :)
Well, since the line breaks are consistent, you could go in and do find/replace as suggested, but you could also do a quick conversion with your python script:
import csv
import fileinput
linecount = 0
with open('ventoya.csv') as f, open('ventoya2.csv', 'w') as w:
for line in f:
line = line.rstrip()
# remove unwanted breaks by concatenating pairs of rows
if linecount%2 == 0:
line1 = line
else:
full_line = line1 + line
full_line = full_line.replace(' ','')
# remove spaces from front of 2nd half of line
# if you want comma delimiters, uncomment next line:
# full_line = full_line.replace('|',',')
if 'BRAS' not in full_line:
w.write(full_line + '\n')
linecount += 1
This works for me with the test data, and if you want to change the delimiters while writing to file, you can. The nice thing about doing with code is: 1. you can do it with code (always fun) and 2. you can remove the line breaks and filter content to the written file at the same time.

Resources