searching elements of list in file - python-3.x

The list name is disk and its below:
disks
['5000cca025884d5\n', '5000cca025a1ee6\n']
The file name is p and its below:
c0t5000CCA025884D5Cd0 solaris
/scsi_vhci/disk#g5000cca025884d5c
c0t5000CCA025A1EE6Cd0
/scsi_vhci/disk#g5000cca025a1ee6c
c3t50060E8007DB981Ad1
/pci#400/pci#1/pci#0/pci#8/SUNW,emlxs#0/fp#0,0/ssd#w50060e8007db981a,1
c3t50060E8007DB981Ad2
/pci#400/pci#1/pci#0/pci#8/SUNW,emlxs#0/fp#0,0/ssd#w50060e8007db981a,2
c3t50060E8007DB981Ad3
/pci#400/pci#1/pci#0/pci#8/SUNW,emlxs#0/fp#0,0/ssd#w50060e8007db981a,3
c3t50060E8007DB981Ad4
i want to search elements of a list in file

There are a couple of things to look at here:
I haven't actually used re.match() before, but I can see the first issue: Your list of disks has a newline character after every entry, so that will mess up matches. Also, re.match() only matches from the start of the line. Your lines start with numbers, so you need to search during the line, using re.search(). Finally, you should make it case insensitive; one option to d this is to make everything lowercase just as your disks list is.
try adapting your loop as so:
#.strip() will get rid of new lines and .lower() will make the string lowercase
for line in q:
if re.search(disks[0].strip(),line.lower()):
print line
If that doesn't fix it, I would try making it print out disks[0].strip() and line for every iteration of the loop (not just when it matches the if clause) to make sure it's reading in what you think it is.

Related

How to make it so my code remembers what is has written in a text file?

Hello python newbie here.
I have code that prints names into a text file. It takes the names from a website. And on that website, there may be multiple same names. It filters them perfectly without an issue into one name by looking if the name has already written in the text file. But when I run the code again it ignores the names that are already in the text file. It just filters the names it has written on the same session. So my question is how do I make it remember what it has written.
image of the text file
kaupan_nimi = driver.find_element_by_xpath("//span[#class='store_name']").text
with open("mainostetut_yritykset.txt", "r+") as tiedosto:
if kaupan_nimi in tiedosto:
print("\033[33mNimi oli jo tiedostossa\033[0m")
else:
print("\033[32mUusi asiakas vahvistettu!\033[0m")
#Kirjoittaa tekstitiedostoon yrityksen nimen
tiedosto.seek(0)
data = tiedosto.read(100)
if len(data) > 0:
tiedosto.write("\n")
tiedosto.write(kaupan_nimi)
There is the code that I think is the problem. Please correct me if I am wrong.
There are two main issues with your current code.
The first is that you are likely only going to be able to detect duplicated names if they are back to back. That is, if the prior name that you're seeing again was the very last thing written into the file. That's because all the lines in the file except the last one will have newlines at the end of them, but your names do not have newlines. You're currently looking for an exact match for a name as a line, so you'll only ever have a chance to see that with the last line, since it doesn't have a newline yet. If the list of names you are processing is sorted, the duplicates will naturally be clumped together, but if you add in some other list of names later, it probably won't pick up exactly where the last list left off.
The second issue in your code is that it will tend to clobber anything that gets written more than 100 characters into the file, starting every new line at that point, once it starts filling up a bit.
Lets look at the different parts of your code:
if kaupan_nimi in tiedosto:
This is your duplicate check, it treats the file as an iterator and reads each line, checking if kaupan_nimi is an exact match to any of them. This will always fail for most of the lines in the file because they'll end with "\n" while kaupan_nimi does not.
I would suggest instead reading the file only once per batch of names, and keeping a set of names in your program's memory that you can check your names-to-be-added against. This will be more efficient, and won't require repeated reading from the disk, or run into newline issues.
tiedosto.seek(0)
data = tiedosto.read(100)
if len(data) > 0:
tiedosto.write("\n")
This code appears to be checking if the file is empty or not. However, it always leaves the file position just past character 100 (or at the end of the file if there were fewer than 100 characters in it so far). You can probably fit several names in that first 100 characters, but after that, you'll always end up with the names starting at index 100 and going on from there. This means you'll get names written on top of each other.
If you take my earlier advice and keep a set of known names, you could check that set to see if it is empty or not. This doesn't require doing anything to the file, so the position you're operating on it can remain at the end all of the time. Another option is to always end every line in the file with a newline so that you don't need to worry about whether to prepend a newline only if the file isn't empty, since you know that at the end of the file you'll always be writing a fresh line. Just follow each name with a newline and you'll always be doing the right thing.
Here's how I'd put things together:
# if possible, do this only once, at the start of the website reading procedure:
with open("mainostetut_yritykset.txt", "r+") as tiedosto:
known_names = set(name.strip() for name in tiedosto) # names already in the file
# do the next parts in some kind of loop over the names you want to add
for name in something():
if name in known_names: # duplicate found
print("\033[33mNimi oli jo tiedostossa\033[0m")
else: # not a duplicate
print("\033[32mUusi asiakas vahvistettu!\033[0m")
tiedosto.write(kaupan_nimi) # write out the name
tiedosto.write("\n") # and always add a newline afterwards
# alternatively, if you can't have a trailing newline at the end, use:
# if known_names:
# tiedosto.write("\n")
# tiedosto.write(kaupan_nimi)
known_names.add(kaupan_nimi) # update the set of names

Remove 1 or multiple lines with pattern match?

I'm trying to figure out how to edit a feeder txt file. Previously, I had been able to accomplish this by using Word's Replace function using Wildcards. But, the most recent feeder file seems to be too big to open in Word. So I'm having to find some other way to replace the text.
The file looks something like this:
VSTHDR|data|data|data|data
...
VSTPMTH|data|1|
CRDHLDR|data|data|data
ADDR|data|data|data
VSTPMTR|data|data
VSTPMTA|data|
VSTPMTA|data
VSTPMTH|data|2|
CRDHLDR|data|data|data
VSTPMTR|data|data
VSTPMTH|data|3|
VSTPMTR|data|data
VSTPMTA|data
...
VST...
...
ADDR|data|data|data
and repeat. For all but the last VSTPMTH, there is always a CRDHLDR line. Under CRDHLDR, there may or may not be an ADDR line. Then there is always a VSTPMTR. There may or may not be VSTPMTA lines. There will be more lines that start with VST before finally ending with another ADDR line before the next VSTHDR.
My goal is to remove all CRDHLDR lines, and any ADDR lines that immediately follow them. In Word, I was able to use replace all "CRDHLDR*VSTPMTR" with "VSTPMTR".
I thought I had it with
sed '/CRDHLDR/,/^[^V]/d'
but with that, if there wasn't an ADDR line immediately after, it would delete all of the VST lines following.
Another idea I had was to try taking any line that starts with ADDR and add it to the line before it, and then go back through to delete any CRDHLR lines, and then add a newline back in before any remaining ADDR. However, all the scripts I've found for combining lines seem to be restricted by a hold buffer size, which this file quickly exceeds. If you can think of a set of commands to try that maybe reduces the buffer use, I'll happily try it.
The closest I've been able to come to a solution so far has been to run:
sed '/CRDHLDR/,/VSTPMTR/d'
but that removes VSTPMTR which I don't want to delete. If I could get that to delete all but the last line of that selection (instead of the whole selection), that would be PERFECT.
I haven't seen any grep or awk solutions that seem quite right, but I'm willing to try any suggestions.
I think I found a two step answer:
sed '/CRDHLDR/,/VSTPMTR/ {ADDR/d}'
sed '/CRDHLDR/d'
The first line removes ADDR lines that are between CRDHLDR and VSTPMTR, the second line then removes all CRDHLDR lines.

Notepad++ Search and Replace with Multiple Lines, Lookahead, Wildcard Issues?

I have a tricky problem. I need to make a minor change to a large number of xml files (500+). The change involves switching a value from 'false' to 'true.' The line that needs to change looks like this:
<VoltageIsMeasuredLineLine>false</VoltageIsMeasuredLineLine>
And it needs to become:
<VoltageIsMeasuredLineLine>true</VoltageIsMeasuredLineLine>
Unfortunately there are numerous instances of this set of tags in each file, so we can't do a simple find and replace. The thing that makes this set of tags unique is that they come some lines after:
<CID>STATIONNAME.BUS.STATIONNAME.DKV</CID>
However, each file has a different station name, so I had used wildcards to filter them out.
<CID>.*.BUS.*.DKV</CID>
So the code looks like this:
<CID>STATIONNAME.BUS.STATIONNAME.DKV</CID>
<tag>Some Number of Other lines</tag>
<tag>Some Number of Other lines</tag>
<tag>Some Number of Other lines</tag>
<VoltageIsMeasuredLineLine>false</VoltageIsMeasuredLineLine>
And other sections in the code look like:
<CID>STATIONNAME.COLR.STATIONNAME.FCLR</CID>
<tag>Some Number of Other lines</tag>
<tag>Some Number of Other lines</tag>
<tag>Some Number of Other lines</tag>
<VoltageIsMeasuredLineLine>false</VoltageIsMeasuredLineLine>
So I'm using the CID .BUS .DKV line as a starting point. Basically I need to change the first occurance of the VoltageisMeasured line that comes directly AFTER the CID .BUS .DKV line. But there's a lot of other lines in between (none of which are consistent from file to file) that I don't care about and are messing up my search.
I was suggested to try a Lookahead, but it did not work. This it the code I was told to try:
(?!<CID>.*.BUS.*.DKV</CID>(.*?)<VoltageIsMeasuredLineLine>false</VoltageIsMeasuredLineLine>
Hower, that line is also returning the lines without .BUS and .DKV, which are the really important factors in determining this section's uniqueness. How can I modify this Lookahead so that it only returns sections that had the .BUS and .DKV in the CID part?
Another idea I had was to select everything in between the CID and Voltage parts, keep the selections in groups, and then print the first two groups as-is, and replace the third. Like this:
(<CID>.*.BUS.*.DKV</CID>)(.*)(<VoltageIsMeasuredLineLine>false</VoltageIsMeasuredLineLine>)
And replace with
\1\2<VoltageIsMeasuredLineLine>true</VoltageIsMeasuredLineLine>
But something is still wrong with the CID part. I'm sure these wildcards are part of the problem but I've hit a wall. Any help appreciated!
Try the following in Notepad++ (Version >= 6.0) with replace
Activate Option matches newline and
set in Find what:
(<CID>[A-Za-z\.]*BUS[A-Za-z\.]*</CID>.*?<VoltageIsMeasuredLineLine>)false
and in Replace with:
\1true
The assumption is that every STATIONNAME.BUS.STATIONNAME.DKV has one corresponding VoltageIsMeasuredLineLine (as I read from your question)
The trick is, to use greedy search. I look for the first VoltageIsMeasuredLineLine after VoltageIsMeasuredLineLine

Delete text with GREP in Textwrangler

I have the following source code from the Wikipedia page of a list of Games. I need to grab the name of the game from the source, which is located within the title attribute, as follows:
<td><i>007: Quantum of Solace</i><sup id="cite_ref-4" class="reference"><span>[</span>4<span>]</span></sup></td>
As you can see above, in the title attribute there's a string. I need to use GREP to search through every single line for when that occurs, and remove everything excluding:
title="Game name"
I have the following (in TextWrangler) which returns every single occurrence:
title="(.*)"
How can I now set it to remove everything surrounding that, but to ensure it keeps either the string alone, or title="string".
I use a multi-step method to process these kind of files.
First you want to have only one HTML tag per line, GREP works on each line so you want to minimise the need for complicated patterns. I usually replace all: > with >\n
Then you want to develop a pattern for each occurrence of the item you want. In this case 'title=".?"'. Put that in between parentheses (). Then you want add some filling to that statement to find and replace all occurrences of this pattern: .?(title=".?").
Replace everything that matches .?(title=".?").* with \1
Finally, make smart use of the Textwrangler function process lines containing, to filter any remaining rubbish.
Notes
the \1 refers to the first occurrence of a match between () you can also reorder stuff using multiple parentheses and use something like (.?), (.) with \2, \1 to shuffle columns.
Learn how to do lazy regular expressions. The use of ? in these patterns is very powerfull. Basically ? will have the pattern looking for the next occurrence of the next part of the pattern not the latest part that the next part of your pattern occurs.
I've figured this problem out, it was quite simple. Instead of retrieving the content in the title attribute, I'd retrieve the page name.
To ensure I only struck the correct line where the content was, I'd use the following string for searching the code.
(.)/wiki/(.)"
Returning \2
After that, I simply remove any cases where there is HTML code:
<(.*)
Returning ''
Finally, I'll remove the remaining content after the page name:
"(.*)
Returning ''
A bit of cleaning up the spacing and I have a list for all game names.

How do you delete everything but a specific pattern in Vim?

I have an XML file where I only care about the size attribute of a certain element.
First I used
global!/<proto name="geninfo"/d
to delete all lines that I don't care about. That leaves me a whole bunch of lines that look like this:
<proto name="geninfo" pos="0" showname="General information" size="174">
I want to delete everything but the value for "size."
My plan was to use substitute to get rid of everything not matching 'size="[digit]"', the remove the string 'size' and the quotes but I can't figure out how to substitute the negation of a string.
Any idea how to do it, or ideas on a better way to achieve this? Basically I want to end up with a file with one number (the size) per line.
You can use matching groups:
:%s/^.*size="\([0-9]*\)".*$/\1/
This will replace lines that contain size="N" by just N and not touch other lines.
Explanation: this will look for a line that contains some random characters, then somewhere the chain size=", then digits, then ", then some more random characters, then the end of the line. Now what I did is that I wrapped the digits in (escaped) parenthesis. That creates a group. In the second part of the search-and-replace command, I essentially say "I want to replace the whole line with just the contents of that first group" (referred to as \1).
:v:size="\d\+":d|%s:.*size="\([^"]\+\)".*:\1:
The first command (until the | deletes every line which does not match the size="<SOMEDIGIT(S)>" pattern, the second (%s... removes everything before and after size attr's " (and " will also be removed).
HTH

Resources