Could someone tell me what the "line" number represents in the output of a cmp command? I ask this because, first, I can't find it explained anywhere. Second, I am getting results comparing a set of files where the "char" outputs are identical (as expected) but the "line" outputs differ wildly.
The "line" outputs reflect the number of newline characters seen prior to that point in the file.
For a file which is not in a textual format, the "line" output is not likely to be meaningful, and can be ignored; for a file which is in textual format, the line number returned could be used in a text editor to navigate to the area with the difference.
Per the POSIX spec for cmp:
For files which are not text files, line numbers simply reflect the presence of a <newline>, without any implication that the file is organized into lines.
Because by default cmp prints only the first difference seen, the line numbers between both files are guaranteed to be identical at that point. When passed -l, cmp continues beyond the first difference -- but no longer prints line numbers, thus avoiding any ambiguity as to which file's line number count is canonical.
Related
Hello python newbie here.
I have code that prints names into a text file. It takes the names from a website. And on that website, there may be multiple same names. It filters them perfectly without an issue into one name by looking if the name has already written in the text file. But when I run the code again it ignores the names that are already in the text file. It just filters the names it has written on the same session. So my question is how do I make it remember what it has written.
image of the text file
kaupan_nimi = driver.find_element_by_xpath("//span[#class='store_name']").text
with open("mainostetut_yritykset.txt", "r+") as tiedosto:
if kaupan_nimi in tiedosto:
print("\033[33mNimi oli jo tiedostossa\033[0m")
else:
print("\033[32mUusi asiakas vahvistettu!\033[0m")
#Kirjoittaa tekstitiedostoon yrityksen nimen
tiedosto.seek(0)
data = tiedosto.read(100)
if len(data) > 0:
tiedosto.write("\n")
tiedosto.write(kaupan_nimi)
There is the code that I think is the problem. Please correct me if I am wrong.
There are two main issues with your current code.
The first is that you are likely only going to be able to detect duplicated names if they are back to back. That is, if the prior name that you're seeing again was the very last thing written into the file. That's because all the lines in the file except the last one will have newlines at the end of them, but your names do not have newlines. You're currently looking for an exact match for a name as a line, so you'll only ever have a chance to see that with the last line, since it doesn't have a newline yet. If the list of names you are processing is sorted, the duplicates will naturally be clumped together, but if you add in some other list of names later, it probably won't pick up exactly where the last list left off.
The second issue in your code is that it will tend to clobber anything that gets written more than 100 characters into the file, starting every new line at that point, once it starts filling up a bit.
Lets look at the different parts of your code:
if kaupan_nimi in tiedosto:
This is your duplicate check, it treats the file as an iterator and reads each line, checking if kaupan_nimi is an exact match to any of them. This will always fail for most of the lines in the file because they'll end with "\n" while kaupan_nimi does not.
I would suggest instead reading the file only once per batch of names, and keeping a set of names in your program's memory that you can check your names-to-be-added against. This will be more efficient, and won't require repeated reading from the disk, or run into newline issues.
tiedosto.seek(0)
data = tiedosto.read(100)
if len(data) > 0:
tiedosto.write("\n")
This code appears to be checking if the file is empty or not. However, it always leaves the file position just past character 100 (or at the end of the file if there were fewer than 100 characters in it so far). You can probably fit several names in that first 100 characters, but after that, you'll always end up with the names starting at index 100 and going on from there. This means you'll get names written on top of each other.
If you take my earlier advice and keep a set of known names, you could check that set to see if it is empty or not. This doesn't require doing anything to the file, so the position you're operating on it can remain at the end all of the time. Another option is to always end every line in the file with a newline so that you don't need to worry about whether to prepend a newline only if the file isn't empty, since you know that at the end of the file you'll always be writing a fresh line. Just follow each name with a newline and you'll always be doing the right thing.
Here's how I'd put things together:
# if possible, do this only once, at the start of the website reading procedure:
with open("mainostetut_yritykset.txt", "r+") as tiedosto:
known_names = set(name.strip() for name in tiedosto) # names already in the file
# do the next parts in some kind of loop over the names you want to add
for name in something():
if name in known_names: # duplicate found
print("\033[33mNimi oli jo tiedostossa\033[0m")
else: # not a duplicate
print("\033[32mUusi asiakas vahvistettu!\033[0m")
tiedosto.write(kaupan_nimi) # write out the name
tiedosto.write("\n") # and always add a newline afterwards
# alternatively, if you can't have a trailing newline at the end, use:
# if known_names:
# tiedosto.write("\n")
# tiedosto.write(kaupan_nimi)
known_names.add(kaupan_nimi) # update the set of names
I would like to know how can I check if a file is empty in a VERILOG or SYSTEMVERILOG testbench.
I have 2 ideas:
Check the file size using $system() task, and put there a linux command which can tell the number of bits or bytes.
Read the first line using $fgets. If the line equal to 0, it means it's empty.
About the first method, I couldn't get a linux command which tells me just the number. I've tried for example ls -l and wc -c, but they give me much more than the number of bits.
About the second method, I really don't know how to read a specific line, in this case, it would be the first line of the file.
Assuming you've already checked the result from $fopen to see if the file exists, you can use $getc or $gets an see if it returns a code less than 1. But the best option depends on what you plan to do with the file after finding it empty or non empty.
I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.
I tried reading :help errorformat and googling (mostly stackoverflow), but can't understand some of the patterns mentioned there:
%s - "specifies the text to search for to locate the error line. [...]"
um, first of all, trying to understand the sentence at all, where do I put the "text to search", after the %s? before it? or, I don't know, does it maybe taint the whole pattern? WTF?
secondly, what does this pattern actually do, how does it differ from regular text in a pattern, like some kinda set efm+=,foobar? the "foobar" here is for me also "text to search for"... :/
%+ - e.g. I I've seen something like that used in one question: %+C%.%#
does it mean the whole line will be appended to a %m used in an earlier/later multiline pattern? if yes, then what if there was not %.%# (== regexp .*), but, let's say, %+Ccont.: %.%# - would something like that work to capture only stuff after a cont.: string into the %m?
also, what's the difference between %C%.%# and %+C%.%# and %+G?
also, what's the difference between %A and %+A, or %E vs. %+E?
finally, an example for Python in :help errorformat-multi-line ends with the following characters: %\\#=%m -- WTF does the %\\#= mean?
I'd be very grateful for some help understanding this stuff.
Ah, errorformat, the feature everybody loves to hate. :)
Some meta first.
Some Vim commands (such as :make and :cgetexpr) take the output of a compiler and parse it into a quickfix list. errorformat is a string that describes how this parsing is done. It's a list of patterns, each pattern being a sort of hybrid between a regexp and a scanf(3) format. Some of these patterns match single lines in the compiler's output, others try to match multiple lines (%E, %A, %C etc.), others keep various states (%D, %X), others change the way parsing proceeds (%>), while yet others simply produce messages in the qflist (%G), or ignore lines in the input (%-G). Not all combinations make sense, and it's quite likely you won't figure out all details until you look at Vim' sources. shrug
You probably want to write errorformats using let &erf='...' rather than set erf=.... The syntax is much more human-friendly.
You can experiment with errorformat using cgetexpr. cgetexpr expects a list, which it interprets as the lines in the compiler's output. The result is a qflist (or a syntax error).
qflists are lists of errors, each error being a Vim "dictionary". See :help getqflist() for the (simplified) format.
Errors can identify a place in a file, they can be simple messages (if essential data that identifies a place is missing), and they can be valid or invalid (the invalid ones are essentially the leftovers from parsing).
You can display the current qflist with something like :echomsg string(getqflist()), or you can see it in a nice window with :copen (some important details are not shown in the window though). :cc will take you to the place of the first error (assuming the first error in qflist actually refers to an error in a file).
Now to answer your questions.
um, first of all, trying to understand the sentence at all, where do I put the "text to search", after the %s? before it?
You don't. %s reads a line from the compiler's output and translates it to pattern in the qflist. That's all it does. To see it at work, create a file efm.vim with this content:
let &errorformat ='%f:%s:%m'
cgetexpr ['efm.vim:" bar:baz']
echomsg string(getqflist())
copen
cc
" bar baz
" bar
" foo bar
Then run :so%, and try to understand what's going on. %f:%s:%m looks for three fields: a filename, the %s thing, and the message. The input line is efm.vim:" bar:baz, which is parsed into filename efm.vim (that is, current file), pattern ^\V" bar\$, and message baz. When you run :cc Vim tries to find a line matching ^\V" bar\$, and sends you there. That's the next-to-last line in the current file.
secondly, what does this pattern actually do, how does it differ from regular text in a pattern, like some kinda set efm+=,foobar?
set efm+=foobar %m will look for a line in the compiler's output starting with foobar, then assign the rest of the line to the message field in the corresponding error.
%s reads a line from the compiler's output and translates it to a pattern field in the corresponding error.
%+ - e.g. I I've seen something like that used in one question: %+C%.%#
does it mean the whole line will be appended to a %m used in an earlier/later multiline pattern?
Yes, it appends the content of the line matched by %+C to the message produced by an earlier (not later) multiline pattern (%A, %E, %W, or %I).
if yes, then what if there was not %.%# (== regexp .*), but, let's say, %+Ccont.: %.%# - would something like that work to capture only stuff after a cont.: string into the %m?
No. With %+Ccont.: %.%# only the lines matching the regexp ^cont\.: .*$ are considered, the lines not matching it are ignored. Then the entire line is appended to the previous %m, not just the part that follows cont.:.
also, what's the difference between %C%.%# and %+C%.%# and %+G?
%Chead %m trail matches ^head .* trail$, then appends only the middle part to the previous %m (it discards head and trail).
%+Chead %m trail matches ^head .* trail$, then appends the entire line to the previous %m (including head and trail).
%+Gfoo matches a line starting with foo and simply adds the entire line as a message in the qflist (that is, an error that only has a message field).
also, what's the difference between %A and %+A, or %E vs. %+E?
%A and %E start multiline patterns. %+ seems to mean "add the entire line being parsed to message, regardless of the position of %m".
finally, an example for Python in :help errorformat-multi-line ends with the following characters: %\\#=%m -- WTF does the %\\#= mean?
%\\#= translates to the regexp qualifier \#=, "matches preceding atom with zero width".
example code
diff -r -u -P a.c b.c > diff.patch
I've tried to search in man.
man says that diff -u is to unify the pattern of output, what is the meaning of that and when should we use it?
thanks a lot.
From Wikipedia (diff utility):
The unified format (or unidiff) inherits the technical improvements made by the context format, but produces a smaller diff with old and new text presented immediately adjacent. Unified format is usually invoked using the "-u" command line option. This output is often used as input to the patch program. Many projects specifically request that "diffs" be submitted in the unified format, making unified diff format the most common format for exchange between software developers.
...
The format starts with the same two-line header as the context format, except that the original file is preceded by "---" and the new file is preceded by "+++". Following this are one or more change hunks that contain the line differences in the file. The unchanged, contextual lines are preceded by a space character, addition lines are preceded by a plus sign, and deletion lines are preceded by a minus sign.
A hunk begins with range information and is immediately followed with the line additions, line deletions, and any number of the contextual lines. The range information is surrounded by double-at signs, and combines onto a single line what appears on two lines in the context format (above). The format of the range information line is as follows:
## -l,s +l,s ## optional section heading
...
The idea of any format that diff throws at you is to transform a source file into a destination file following a series of steps. Let's see a simple example of how this works with unified format.
Given the following files:
from.txt
a
b
to.txt
a
c
The output of diff -u from.txt to.txt is:
--- frokm.txt 2015-03-17 04:34:47.076997087 -0430
+++ to.txt 2015-03-17 04:35:27.872996388 -0430
## -1,2 +1,2 ##
a
-b
+c
Explanation. Header description:
--- from.txt 2015-03-17 22:42:18.575039925 -0430 <-- from-file time stamp
+++ to.txt 2015-03-17 22:42:10.495040064 -0430 <-- to-file time stamp
This diff contains just one hunk (only one set of changes to turn file form.txt into to.txt):
## -1,2 +1,2 ## <-- A hunk, a block describing chages between both files, there could be several of these in the diff -u output
^ ^
| (+) means that this change starts at line 1 and involves 2 lines in the to.txt file
(-) means that this change starts at line 1 and involves 2 lines of the from.txt file
Next, the list of changes:
a <-- This line remains the same in both files, so it won't be changed
-b <-- This line has to be removed from the "from.txt" file to transform it into the "to.txt" file
+c <-- This line has to be added to the "from.txt" file to transform it into the "to.txt" file
Here are some StackOverflow answers with really nice info about this subject:
https://stackoverflow.com/a/10950496/1041822
https://stackoverflow.com/a/2530012/1041822
And some other useful documentation:
https://linuxacademy.com/blog/linux/introduction-using-diff-and-patch/
http://www.artima.com/weblogs/viewpost.jsp?thread=164293
The term unified was made up. Better, perhaps would have been to call it "concise".
The point of diff -u is that it is a more concise representation than context diff. Quoting from the original description of Wayne Davison's posting of unidiff to comp.sources.misc (volume 14, 31 Aug 90):
I've created a new context diff format that combines the old and new chunks into
one unified hunk. The result? The unified context diff, or "unidiff."
Posting your patch using a unidiff will usually cut its size down by around
25% (I've seen from 12% to 48%, depending on how many redundant context lines
are removed). Even if the diffs are generated with only 2 lines of context,
the savings still average around 20%.
Keep in mind that *no information is lost* by the conversion process. Only
the redundancy of having multiple identical context lines. [...]
Here are some useful links:
How to read a patch or diff and understand its structure to apply it manually
What is the format of a patch file?
Not useful (and misleading)
2.2.2 Unified Format, which appears to omit attribution.