Find duplicate Files - linux

I used to use a program finddupe on Windows (XP) which checked for duplicate files and offered to replace by hardlinks.
This calculated a hash of the 1st 32K, only checking the balance on match. I have the source (for VC++6), but was wondering if there is a Linux/OSX equivalent before I try to port it, although I suspect it may be better to write a new program in a higher level language.

I've found fdupes to be helpful for me.
If you are looking to write your own quick script, I would suggest looping over files and using cmp as it allows you to easily stop comparison after the first mismatched byte.

There are many similar tools. See here
They may not be part of standard distribution.
I have used fslint before and found it to be sufficient for my needs.

Related

ELF, Build-ID, is there a utility to recompute it?

I came across this useful feature in ELF binaries -- Build ID. "It ... is (normally) the SHA1 hash over all code sections in the ELF image." One can read it with GNU utility:
$ readelf -n /bin/bash
...
Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 54967822da027467f21e65a1eac7576dec7dd821
And I wonder if there is an easy way to recompute Build ID yourself? To check if it isn't corrupted etc.
So, I've got an answer from Mark. Since it is an up to date info, I post it here. But basically you guys are right. Indeed there is no tool for computing Build-ID, and the intentions of Build-ID are not (1) identification of the file contents, and not even (2) identification of the executable (code) part of it, but it is for (3) capturing "semantic meaning" of a build, which is the hard bit for formalization. (Numbers are for self-reference.)
Quote from the email:
-- "Is there a user tool recomputing the build-id from the file itself, to
check if it's not corrupted/compromised somehow etc?"
If you have time, maybe you could post an answer there?
Sorry, I don't have a stackoverflow account.
But the answer is: No, there is no such tool because the precise way a
build-id is calculated isn't specified. It just has to be universally
unique. Even the precise length of the build-id isn't specified. There
are various ways using different hashing algorithms a build-id could be
calculated to get a universally unique value. And not all data might
(still be) in the ELF file to recalculate it even if you knew how it was
created originally.
Apparently, the intentions of Build-ID changed
since the Fedora Feature page was written about
it.
And people's opinions diverge on what it is now.
Maybe in your answer you could include status of Build-ID and what it is
now as well?
I think things weren't very precisely formulated. If a tool changes the
build that creates the ELF file so that it isn't a "semantically
identical" binary anymore then it should get a new (recalculated)
build-id. But if a tool changes something about the file that still
results in a "semantically identical" binary then the build-id stays the
same.
What isn't precisely defined is what "semantically identical binary"
means. The intention is that it captures everything that a build was
made from. So if the source files used to generate a binary are
different then you expect different build-ids, even if the binary code
produced might happen to be the same.
This is why when calculating the build-id of a file through a hash
algorithm you use not just the (allocated) code sections, but also the
debuginfo sections (which will contain references to the source file
names).
But if you then for example strip the debuginfo out (and put it into a
separate file) then that doesn't change the build-id (the file was still
created from the same build).
This is also why, even if you knew the precise hashing algorithm used to
calculate the build-id, you might not be able to recalculate the
build-id. Because you might be missing some of the original data used in
the hashing algorithm to calculate the build-id.
Feel free to share this answer with others.
Cheers,
Mark
Also, for people interested in debuginfo (linux performance & tracing, anyone?), he mentioned a couple projects for managing them on Fedora:
https://fedoraproject.org/wiki/Changes/ParallelInstallableDebuginfo
https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo
The build ID is not a hash of the program, but rather a unique identifier for the build, and is to be considered just a "unique blob" — at least at some point it used to be defined as a hash of timestamp and absolute file path, but that's not a guarantee of stability either.
I wonder if there is an easy way to recompute Build ID yourself?
No, there isn't, by design.
The page you linked to itself links to the original description of what build-id is and what it's usable for. That pages says:
But I'd like to specify it explicitly as being a unique identifier good
only for matching, not any kind of checksum that can be verified against
the contents.
(There are external general means for content verification, and I don't
think debuginfo association needs to do that.)
Additional complications are: the linker can take any of:
--build-id
--build-id=sha1
--build-id=md5
--build-id=0xhexstring
So the build id is not necessarily an sha1 sum to begin with.

Writing a Script to match the architecture of system and software

I am trying to write a script where it will cross check to things:
The architecture for which the setup file was intended (32 or 64 bit)
The Architecture of the system.
The second part is quite easy and can be figured out using commands like lscpu and then extracting that specific line using combination of grep and awk or sed. However the first part is proving out to be a complicated one. I tried using the file command but it has a very irregular output. Hence it becomes very difficult extracting a specific column from it. I also tried using objdump though traditionally not used for things like this. However as expected, due to its limitations, it does not recognize most of the file types.
The rest part of the script is dead simple where I would be comparing these values and proceeding with my intended tasks. I would like your help with the Point 1 mentioned above.

Verifying two different build architectures (one a re-write of the other) are functionally equivalent

I'm re-writing a build that produces a number of things (shared/static libraries, jars, executables, etc). The question came up whether there's a way to verify that the results are functionally equivalent without doing a full top-to-bottom test of the resulting software.
However, that is proving to be more difficult to do than I anticipated.
As an example, I expected that the md5 of two objects produced from the same source (sun studio C++ compiler) and command-line parameters would have the same md5 hash, but that isn't the case. I can build the file, rename it, build again, and they have different hashes.
With that said ... is there a way do a quick check to verify that two files produced from separate build architectures of the same source tree (eg, two shared objects) are functionally equivalent?
edit I am sorry, I neglected to mention this is for a debug build ... when debugging flags aren't used the binaries are identical, but they've been using debugging flags by default for so many years their stuff breaks when you remove the debugging flags (part of the reason I'm re-writing the build is to take that particular 'feature' out of the build so we can get some proper testing going)
Windows DLLs have a link timestamp (TimeDateStamp) as part of PE image.
Looking at linker options, I don't see an option to suppress that. So re-linking a DLL (or an EXE) will always produce a different binary.
You could write a tool to zero out these timestamps (always at a fixed offset from file start), and compare MD5s afterwards. But you'll likely discover lots of other differences as well. In particular, any program that uses __DATE__ or __TIME__ builtins will give you trouble.
We've had to work quite hard to achieve bit-identical rebuilds (using GNU toolchain). It's possible (at least for open-source tools, on Linux), but not easy (as you've discovered).
I forgot about this question; I'm revisiting so I can give the answer I came up with.
objcopy can be used to produce a new binary file in different formats. It's been a few years since I worked on this, so the specifics escape me, but here's what I recall:
objcopy can strip various things out (debug info, symbol information, etc), but even after stripping stuff out I was still seeing different hashes between objects.
In the end I found I could convert it from ELF to other formats. I ended up dumping it to another format (I think I chose SREC) that consistently provided the same MD5 for objects built at different times with identical source/flags.
I'm betting I could have done this a better way with objcopy (or perhaps another binutils tool), but it was good enough to satisfy our concerns.

Can a LabVIEW VI tell whether one of its output terminals is wired?

In LabVIEW, is it possible to tell from within a VI whether an output terminal is wired in the calling VI? Obviously, this would depend on the calling VI, but perhaps there is some way to find the answer for the current invocation of a VI.
In C terms, this would be like defining a function that takes arguments which are pointers to where to store output parameters, but will accept NULL if the caller is not interested in that parameter.
As it was said you can't do this in the natural way, but there's a workaround using data value references (requires LV 2009). It is the same idea of giving a NULL pointer to an output argument. The result is given in input as a data value reference (which is the pointer), and checked for Not a Reference by the SubVI. If it is null, do nothing.
Here is the SubVI (case true does nothing of course):
And here is the calling VI:
Images are VI snippets so you can drag and drop on a diagram to get the code.
I'd suggest you're going about this the wrong way. If the compiler is not smart enough to avoid the calculation on its own, make two versions of this VI. One that does the expensive calculation, one that does not. Then make a polymorphic VI that will allow you to switch between them. You already know at design time which version you want (because you're either wiring the output terminal or not), so just use the correct version of the polymorphic VI.
Alternatively, pass in a variable that switches on or off a Case statement for the expensive section of your calculation.
Like Underflow said, the basic answer is no.
You can have a look here to get the what is probably the most official and detailed answer which will ever be provided by NI.
Extending your analogy, you can do this in LV, except LV doesn't have the concept of null that C does. You can see an example of this here.
Note that the code in the link Underflow provided will not work in an executable, because the diagrams are stripped by default when building an EXE and because the RTE does not support some of properties and methods used there.
Sorry, I see I misunderstood the question. I thought you were asking about an input, so the idea I suggested does not apply. The restrictions I pointed do apply, though.
Why do you want to do this? There might be another solution.
Generally, no.
It is possible to do a static analysis on the code using the "scripting" features. This would require pulling the calling hierarchy, and tracking the wire references.
Pulling together a trial of this, there are some difficulties. Multiple identical sub-vi's on the same diagram are difficult to distinguish. Also, terminal references appear to be accessible mostly by name, which can lead to some collisions with identically named terminals of other vi's.
NI has done a bit of work on a variation of this problem; check out this.
In general, the LV compiler optimizes the machine code in such a way that unused code is not even built into the executable.
This does not apply to subVIs (because there's no way of knowing that you won't try to use the value of the indicators somehow, although LV could do it if it removes the FP when building an executable, and possibly does), but there is one way you can get it to apply to a subVI - inline the subVI, which should allow the compiler to see the outputs aren't used. You can also set its priority to subroutine, which will possibly also do this, but I wouldn't recommend that.
Officially, in-lining is only available in LV 2010, but there are ways of accessing the private VI property in older versions. I wouldn't recommend it, though, and it's likely that 2010 has some optimizations in this area that older versions did not.
P.S. In general, the details of the compiling process are not exposed and vary between LV versions as NI tweaks the compiler. The whole process is supposed to have been given a major upgrade in LV 2010 and there should be a webcast on NI's site with some of the details.

Are there any context-sensitive code search tools?

I have been getting very frustrated recently in dealing with a massive bulk of legacy code which I am trying to get familiar with.
Say I try to search for a particular function call, I get loads of results that turn out to be completely irrelevant; some of them are easy to spot, eg a comment saying
// Fixed functionality in foo() so don't need to handle this here any more
But others are much harder to spot manually, because they turn out to be calls from other functions in modules that are only compiled in certain cases, or are part of a much larger block of code that is #if 0'd out in its entirety.
What I'd like would be a search tool that would allow me to search for a term and give me the choice to include or exclude commented out or #if 0'd out code. Then the search results would be displayed alongside a list of #defines that are required in order for that snippet of code to be relevant.
I'm working in C / C++, but other than the specific comment syntax I guess the techniques should be more generally applicable.
Does such a tool exist?
Not entirely what you're after, but I find this quite handy.
GrepWin - A free visual "grep" tool for searching files.
I find it quite helpful because:
Its a separate app (doesn't lock up my editor)
Handles Regular expressions
Its fast
Can specify what folder to search, and what filetypes (handles regex's here too)
Can limit by file size
Can include subdirs (or exclude by regex)
etc.
Almost any decent source browser will let you go to where a function is defined, and/or list all the calls of that function and take you directly to a call site. This will normally be based on a fairly complete parse of the source code so it will ignore comments, code that's excluded by the preprocessor, and so on (in fact, in at least one case, the parser used by the source browser is almost certainly better than the one used in the compiler itself).

Resources