I am new to regex and python, I have to find a keyword from a text file and after successful finding the string I have to find the only number from the string. But the number is getting printed 6 times. I only need the first outcome to store in a variable as integer. Here is my full code. And the string I am looking for from the .txt file is "Lost\n7". And the number I want from this string is 7.
import re
with open('test.txt') as f:
for line in f:
# Capture one-or-more characters of non-whitespace after the initial match
# rsrp = re.search(r'RSRP:(\S+)', line)
packet_loss_search = re.search(r'Lost(\S+)',line)
# Did we find a match?
if (packet_loss_search):
# Yes, process it
details = packet_loss_search.group(0)
a=str(details)
#a=a[-1]
#print(a)
temp =re.findall(r'\d+', a)
res = list(map(int, temp))
print(res[0])
OUTPUT:
7
7
7
7
7
7
I'd suggest reading the file into memory as a single string if your expected match(es) span(s) across multiple lines. You could fix the code by replacing it with
import re
with open('test.txt', 'r') as f:
m = re.search(r'Lost\n(\d+)', f.read())
if m: # Check if there is a match
print(m.group(1))
Here, f.read() will read the file contents into a single string, and Lost\n(\d+) will match and capture into Group 1 any one or more digits after Lost + a newline char.
Related
Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.
Hi I've got problem set in cs50 and having difficulties as this is my first week in Python and I would be appreciate if you don't directly write an open answer but forward me to the right functions or method to use.
We've been given a long string sequence in a .txt file, one line and no white spaces. I have to find the longest consecutive sequence of words of given DNA string
example txt:
GGAGGCCAAAGTCTTGTGATATCGGGCAACTCCCCGGGAGGAACACAGGCCCACCGAAAACAGCTTGAAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG
original text is around 5000 characters but it goes like the example below. My task is to find the longest consecutive sequences of 'AGATC' string.
lets say the first consequtive sequence is 23 times, after i kept reading and find another consequtive sequences in 34 times, I have to store the biggest number.
My problem is not to find a way to read and analyse a string in this way. I can read a string can find the total repetitive times and so on but finding the longest repetition is not making sense in every way I've tried. I thought C was hard but I can write this code with C so easily as I we can manipulate strings in so much way in C. At least in C there are ways to read in a size but as far as I see Python reads at once and there is no control over read. In Python it doesn't seem you can make much with, at least in my level of knowledge at the moment :/ Probably Python got one line solutions for this, please don't judge this is my 3rd day and 4th program in Python.
What functions or methods I should look to analyze a string in this way. I've watched videos for a similiar thing but for sequence of single character, not a string. Also bought the Python Crash Course to get some knowledge about the string manipulation but couldn't find anything related in this case. Also checked the Python documentation but obviously it's so much complicated for day 3 in Python.
Could anyone help me please.TIA
here is my not-working and not-making-sense code
import csv
import sys
#check the arguments count
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(1)
#create a dictionary to store str results
SEQ = {
"AGATC": 0,
"AATG": 0,
"TATC": 0
}
counter = 0 #keeps the the length of the sequence
seq = 0 #keeps the longest sequence
DNA = '' ## keeps the key of SEQ, "AGATC" etc.
#find the longest consecutive sequence of DNA
def findSEQ(file, DNA): #get the sequences text file and the string of the key as parameters
for DNA in (DNA, file):
if file[i:i + len(DNA)] == DNA: #if find a match
counter += 1 #count up the sequence
else:
if counter > seq: #if it's not a sequence the next thing it reads
seq = counter
counter = 0
return seq
seq = 0
#open sequence file and read
with open(sys.argv[2],'r') as file:
reader = csv.reader(file)
#find the longest sequence of AGATC
findSEQ("AGATC", file)
#update the seq dictionary
SEQ["AGATC"] = seq
#find the longest sequence of AATG
findSEQ(file, "AATG")
#update the seq dictionary
SEQ["AATG"] = seq
#find the longest sequence of TATC
findSEQ(file, "TATC")
#update the seq dictionary
SEQ["TATC"] = seq
#open and read database
with open(sys.argv[1], "r") as file:
reader = csv.reader(file)
#skip the first row
next(reader)
#compare the seq dictionary results with database
for row in reader:
seq1, seq2, seq3 = row[1], row[2], row[3]
#if found any match print the name
if SEQ[seq1] == row[1] and SEQ[seq2] == row[2] and SEQ[seq3] == row[3]:
print(row[0])
#otherwise print not found
else:
print("Not found any match.")
To elaborate on my comment, please find the following example:
import re
text = 'GGAGGCCAAGATCAAGTCTTGTGATATCGGGCAACTCCCCGGGAAGATCAGATCAGATCGGAACACAGGCCCACCGAAAACAGCTTGAAGATCAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG'
sequence = 'AGATC'
pattern = f'(?:{sequence})+'
findings = sorted(re.findall(pattern, text), key=len)
longest_sequence = len(findings[-1]) / len(sequence)
print(f'longest sequence: {longest_sequence}')
This program uses regex (regular expressions) to find sequences of the pattern you're looking for. It then sorts the findings by length (in an ascending order), allowing you to find the longest sequences in the last index of the list.
I am trying to convert an file.txt into dictionary. I know if the delimiter is only used one time, Then the code is as follows:
dict = {}
with open('file.txt') as input_file:
for line in input_file:
entry = line.split(":")
dict[entry[0].strip()] = entry[1].strip()
However, how do you turn a input file into a dictionary with no clear delimiter?
file.txt:
cats****5
doggie**6
ox******7
output:
dict = {'cats':5, 'doggie':6, 'ox':7}
Thank you for your help :)
You can simply split on your delimeter as before, but take the first and last field:
for line in input_file:
entry = line.split("*")
dict[entry[0].strip()] = entry[-1].strip()
Negative indices fetch elements from the back of the list - the index -1 is the last element, -2 is the second-to-last element, and so on.
You can also use unpacking, which allows for self-documenting variable naming:
for line in input_file:
key, *_, value = line.split("*")
dict[key.strip()] = value.strip()
Here, *_ consumes an arbitrary number of values - but not the first or last, since key and value are before and after it and both consume exactly one value. The symbol * denotes the arbitrary size, while _ is a regular name that is just conventionally used for unused values.
If your delimiter also appears in the value, splitting is not robust. Use a regular expression to define the grammar of your delimiter, and capture key and value. For example, if your delimiter is . and you expect float values, the following works:
import re
kv_pattern = re.compile(r'^(.+?)\.+(.+?)$')
# ^ ^ ^ capture shortest match for any character sequence
# ^ ^ longest match of delimiter sequence
# ^ capture shortest match for any character sequence
data = {}
input_data = ['cats....5.0', 'doggie...6', 'ox.......7.']
for line in input_data:
key, value = kv_pattern.match(line).groups()
data[key.strip()] = value.strip()
I just want to append strings based on my condition. For example all strings starting with http won't be appended but all the other strings in each that has a length of 40 will be appended.
words = []
store1 = []
disregard = ["http","gen"]
for all in glob.glob(r'MYDIR'):
with open(all, "r",encoding="utf-16") as f:
text = f.read()
lines = text.split("\n")
for each in lines:
words += each.split()
for each in words:
if len(each) == 40 and each not in disregard:
store1.append(each)
Update:
if disregard[0] not in each:
works but how can I compare it to all the contents in my list? using disregard only doesnt work
Here is my input text file :
http://1234ashajkhdajkhdajkhdjkaaaaaaad1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
genp://1234ashajkhdajkhdajkhdjkaaaaaaad1
a\a
The only thing that will append will be "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
I think the answers should depend on the number of words you want to disregard.
It's important to define what word means. If the word ends with spaces, should they all be stripped?
One solution could be to create a regular expression from all your words and use that to match the line.
import glob
import re
disregard = ["http","gen"]
pattern = "|".join([re.escape(w) for w in disregard])
for all in glob.glob(r'MYDIR/*'):
with open(all, "r", encoding="utf-16") as f:
matched_words = []
for line in f:
line = line.rstrip("\n")
if len(line) == 40 and not re.match(pattern, line):
matched_words.append(line)
print(matched_words)
The basic structure looks ok, it seems the place where it's breaking is setting up incorrect conditionals. You say you want to check where each line starts with the supplied strings, but then you split each line and check for existence of those strings. Use .startswith() instead. This will also make it so there doesn't have to be a space after "http" in order for that string to be caught.
Also, either the conditional testing should be placed after the loop that builds the words list, or else the words list should be reset at the start of each loop so you're not re-testing words you've already checked.
# adjusted some variable names for clarity
words = []
output = []
disregard = ["http","gen"]
for fname in glob.glob(r'MYDIR'):
with open(fname, "r", encoding="utf-16") as f:
text = f.read()
lines = text.split("\n")
for line in lines:
words += line.split()
for word in words:
if len(word) == 40 and not any([word.startswith(dis) for dis in disregard]):
output.append(each)
i want to search a particular keyword in a .json file and print 10 lines above and below the line in which the searched keyword is present.
Note - the keyword might be present more than once in the file.
So far i have made this -
with open('loggy.json', 'r') as f:
last_lines = deque(maxlen=5)
for ln, line in enumerate(f):
if "out_of_memory" in line:
print(ln)
sys.stdout.writelines(chain(last_lines, [line], islice(f, 5)))
last_lines.append(line)
print("Next Error")
print("No More Errors")
Problem with this is - the number of times it prints the keyword containing line is equal to that number of times the keyword has been found.
it is only printing 5 lines below it, whereas i want it to print five lines above it as well.
If the json file was misused to store really a lot of information, then
processing on-the-fly may be better. In the case, keep the history lines
say in the list that is shortened if it grows above a given limit.
Then use a counter that indicates how many lines must be displayed after
observing a problem:
#!python3
def print_around_pattern(pattern, fname, numlines=10):
"""Prints the lines with the pattern from the fname text file.
The pattern is a string, numline is the number of lines printed before
and after the line with the pattern (with default value 10).
"""
history = []
cnt = 0
with open(fname, encoding='utf8') as fin:
for n, line in enumerate(fin):
history.append(line) # append the line
history = history[-numlines-1:] # keep only the tail, including last line
if pattern in line:
# Print the separator and the history lines including the pattern line.
print('\n{!r} at the line {} ----------------------------'.format(
pattern, n+1))
for h in history:
print('{:03d}: {}'.format(n-numlines, h), end='')
cnt = numlines # set the counter for the next lines
elif cnt > 0:
# The counter indicates we want to see this line.
print('{:03d}: {}'.format(n+1, line), end='')
cnt -= 1 # decrement the counter
if __name__ == '__main__':
print_around_pattern('out_of_memory', 'loggy.json')
##print_around_pattern('out_of_memory', 'loggy.json', 3) # three lines before and after