I am new in Git, it really help me in my code projects. I have some numbers files (similar to Excel) I am not sure if it is a good idea to keep track of changes with git. I understand that git cannot see what is different in each file but can detect if there is something different (more or less bytes) For me it can be useful to safe the file when I want and write a title in the commit to remember what I have changed. Is git a good option in those cases?
Indeed, committing binary files in git is not the best thing to do.
But when you want to track it, that's sometimes the thing to do.
Here, Libreoffice is quite better than Microsoft office because they have "flat" format where all the saved file is just a XML file which is much more source control friendly!
But with Excel, you also have the solution to use a converter to be able to diff the file in a human friendly way. See https://stackoverflow.com/a/17106035/717372
Related
Okay everyone: I'm setting up a git repository for researchers to share scripts and data for a research project. The researchers aren't programmers or particularly git-savvy, so I'm hoping to point desktop git clients at a shared repository — everyone has access to this in their local filesystem.
The problem: line endings. We have people using:
Windows (mainly R) (CRLF)
linux and Mac scripts (mainly R and python) (LF only)
Excel on Mac, saving as .CSV (CR only, yes this is an actual thing)
git's autocrlf doesn't understand Mac line endings for some reason, so that doesn't work well for me.
First, I want to track changes to these files without telling people "you can't use the tools you're familiar with" because then they will just store the data and scripts somewhere outside of the repo.
Second, I want to have the git repo not be full of stupid line ending commits and merge conflicts, because I will probably need to solve all the merge conflicts that happen.
Third, I'd like people to not have to manually run some "fix all the line endings" script because that would suck. If this is what I need to do... whatever, I guess.
Assuming "first, normalize the line endings" is the answer, any sense of which ones I should choose?
I'd thought about a pre-commit hook, but it sounds like this would involve somehow getting the same script to run on both Windows and unix, and that sounds terrible. Maybe this is a secretly practical option?
Thanks.
As Marek Vitek said in comments, you may need to write at least a tiny bit of code.
Second, for a bit of clarity, here's how Git itself deals—or doesn't deal—with data transformation:
Data (files) inside commits is sacrosanct. It literally can't be changed, so once something is inside a commit, it is forever.1
Data in the work-tree can and should be in a "host friendly" format. That is, if you're on a Mac running program Pmac that requires that lines end with CR, the data can be in that format. If you're on a Windows box running the equivalent Pwindows that requires that lines end with CR+LF, the data can be in that format.
Conversions to "host format" happen when files move from the index/staging-area to the work-tree. Conversions from "host format" to "internal storage format" happen when files move from the work-tree to the index/staging area.
Most of Git's built in filters do only CRLF to LF, or LF to CRLF, transformations. There is one "bigger" built in filter, called ident (not to be confused with indent), and you can define your own filters called clean and smudge, which can do arbitrary things. This means you can define a smudge filter that, on the Mac (but not on Windows) will (e.g.) change LF to CR. The corresponding Mac-only clean filter might then change CR to LF.
Note that many transformations are not data-preserving on raw binary data: there might be a byte that happens to resemble an LF, or CR, or two in a row that resemble CRLF, but are not meant to be interpreted that way. If you change these, you wreck the binary data. So it's important to apply filtering only to files where a byte that seems to be one of these things, really is one of these things. You can use .gitattributes path name matching, e.g., *.suffix, to infer which files get what filters applied.
The correct filtering actions to apply will, of course, depend on the host.
Merges and "renormalize"
When doing a merge, Git normally just takes the files directly from the pure versions inside each of the commits involved. Since it's Git (and git diff) doing interpretation of lines, you generally want these to have Git's preferred "line" format, i.e., ending with LF (it's OK if they have or lack a CR before the LF as long as all three versions feeding into a three-way merge all have the same CR-before-LF-ness). You can use the "renormalize" setting, though, to make Git do a virtual pass through your smudge-and-then-clean filters before it does the three-way merging. You would need this only when existing commits (base and two branch tips) that you now intend to merge, were stored in a different way from the way you have all agreed now to keep inside the permanent commits. (I have not actually tried any of this, but the principle is straightforward enough.)
1You can remove a commit, but to do so, you must also remove all of that commit's descendants. In practice, this means commits that have been shared / pushed, generally never go away; only private commits can go away or be replaced with new-and-improved commits. It's difficult to get everyone who has commit a9f3c34... to ditch it in favor of the new and improved 07115c3..., even if you can get this word out to everyone.
I have loads of notepad , js , .cs in a folder that I use to refer back to when I'm developing. They are just in a folder on my laptop. Is anyone aware of a better way of storing all this guff in a more stuctured way? Thinking some kind of cloud website or something?
You can use a wiki for this kind of thing. There are wikis that are local, such as TiddlyWiki.
One way or another, to keep things safe, you should use source control, and/or backup to the cloud.
I keep my code samples that aren't project-specific in a revision-controlled directory tree, based on the language they're in; actual projects are also kept in revision control, but are kept separately. I have tons of them now.
For smaller idioms and snippets that are useful or that I forget as I switch between languages for a period of time, I pop them into a wiki, with different pages also based on which language they're in. I don't put whole files in there; I just extract the pieces that I tend to forget and pop them in there.
They do tend to build up as time goes on, so just putting the smaller pieces in is much more efficient for fast lookup.
I have read through the documentation on perforce and the "branching strategy" advice as well.
One thing thats left me baffled, is how a simple concern is does not seem to adequtely adressed.
When I am working on a project that touches many parts of our code base , I cannot checkin my code at the end of the day without checking into the trunk. So do I need to branch in this situation? I want to be able to have the ability to have a history of my changes in a long and hard project, so I can go back when I I make a wrong turn..
The problem with branching I see is that I will be creating copies of almost the entire codebase .. Am I missing an obvious solution here?
thanks
From the standpoint of the Perforce server, branches aren't actually copies of the files. Perforce uses a mechanism called "lazy copy" to minimize disc consumption. From their website, here is the definition of the term:
A method used by Perforce to make
internal copies of files without
duplicating file content in the depot.
Lazy copies minimize the consumption
of disk space by storing references to
the original file instead of copies of
the file.
Best approach to working with perforce is to work in a user/feature branch then you can avoid checking into the trunk whilst still pushing your changes into the depot.
When creating a branch, you don't have to branch the entire trunk or source branch - you only need to branch the files you're working on - you can map the rest of the files into your branch via your client spec.
TBH - just buy & read 'Practical Perforce', it has heaps of useful info on how to do this and is very much worth the money if you're using perforce on a daily basis.
Another very useful feature of perforce is 'jobs' - often described only for bug tracking - it's much flexible and allows you to store a changelist history attached to a tag so allowing you to create 'metatags' and attach revisions to it i.e 'NightlyBuild' or 'BreakingChanges or whatever you want.
HTH.
The closest I know of is shelving, in which you can "shelve" your work in progress, saving a copy on the server. I typically do this to essentially checkpoint my work. I think this comes closest to addressing your need, where you can save your progress at the end of the day.
See here for a tutorial on shelving in p4v.
Or type p4 help shelve for help with the command line.
Evaluate using PDB - Sparse branches. More information here http://www.releng.com/p5layer.html
We are branching out beyond the development team and trying to get other groups within my company to use version control for important documents that need change tracking. One frequent need is for Excel spreadsheets. These are large spreadsheets, modified fairly frequently (weekly or monthly) but with only a small portion of the cells changed each time.
Just sticking the files in subversion (the particular tool we are using) gives a history of changes and keeps old versions. And the TortoiseSVN client makes it easy for non-technical users. Recent versions of TortoiseSVN even contain a script which can be used to perform nice visual diffs between Excel documents.
My remaining concern is disk space. These are large documents. The diffs between versions are small, but I worry that the version control will notice that the file is binary and fall back to storing each version separately. Does anyone know of a solution to this? For instance, a format we could save in in which the diffs would be small so only differences would be saved, or a version control system which is specifically aware of Excel files? I have not yet done performance testing, but our version control server is already badly taxed and if there is a better solution I'd love to know what it is.
Currently SVN cannot efficently store those types of files. There has been some discussion about it though
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=651443
This SO question shows a graph when storing an OpenXML office document. The results were pretty linear
Will Subversion efficiently store OpenXML Office documents?
Although your question wasn't specifically about that format it may still apply. You might just need to run a test in SVN and see what kind of storage it takes. SVN is pretty good at storing binary files, so it might not be too terrible. The SO question above also mentions saving the file as a plain text XML 2003 document, which you might investigate also.
One consideration is using Team Foundation Server for source control (if that's an option), which will just store your delta changes, although it may be a bit heavy for what you're looking for.
From my understanding, binary vs. text doesn't have an impact on the storage size in SVN: http://help.collab.net/index.jsp?topic=/faq/svnbinary.html
I am use tortoiseSVN to synchronize our code.
But recent I found that there is something that is not so convenient.
When i modify a file, let's say a.jsp,
and my colleague might also modify this file, a.jsp,
and this may result in conflict, and any one of use need to checkin the his code first,
and the other one will need to update to latest version, and then resolve the
conflict one by one, and this is really error proned.
So i need some function in tortoise SVN, that can lock the a.jsp when i am editing, and prevent the other collegue to modify the file at the same time.
I have tried "lock" function in tortoiseSVN, but it doesn't work,
when i lock the a.jsp file, my colleague still can modify this file at the same time without any promotion and alert, just like " your colleague are modifying this file, please modify until the check in" ...
is there any better solution ?
Thanks in advance !!
Yes, there is a better solution, it consists of 3 parts:
Never lock, you don't need to
Don't work on the same file, or at least the exact same part of the file, at the same time as someone else
If you do, be happy to merge.
Merging is a typical part of using a source control system like SVN. You shouldn't be afraid of it, you should embrace it happily.
Generally, the merge can be automatic, unless you are working in the extra same area. In this case you must make the changes manually (but the diff tool, in TortiseSVN, will help you with this).
I would suggest that if this is happening a lot, you re-evaluate how you are assigning out work within your project.
As mentioned by others, the most flexible workflow is one where you don't need to lock. However, if you really do want to lock a file, Subversion provides this capability.
Add the property svn:needs-lock to the file(s) for which you want to require locking, and set the value to * (I don't think it actually matters what the property value is). Then when checking out or updating, Subversion will mark those file(s) as read-only in your working copy to remind you to lock them first.
You would lock the file with Subversion (I see you already found this option, but it doesn't do anything for files that don't need locking) which will make the file read-write on disk. Then make your changes. When you check in the file, Subversion will automatically release the lock (but there is an option to keep it if you like). Then the file will be read-only in your working copy again.
Note that locking in Subversion is "advisory" only. This means that anybody else can force acquisition of the lock even though somebody else might already have it. In this case, the workflow is still supported because somebody may still need to merge their changes as they would without locking.
Finally, locking files is the best way to deal with non-text files in Subversion, such as image files or MS Word files. Subversion cannot merge these types of files automatically, so advisory locking helps make sure that two people don't try to edit the same file at the same time.
Tortoise has a "merge" option that you might want to try once you update your code with his changes.
There is a practice amongst SVN users (especially agile SVN users) called "The Check-In Dance". This simple practice can cut down immensely on the amount of conflicts you have when checking code into an SVN repo. It goes like this:
When you're ready to check in some changes to the repo:
1. Do an update first to get everyone else's changes.
2. Run your build script (or just compile if you have no build script)
3. If all is well, commit your changes.
Locking causes it's own set of problems, not the least of which is that people tend to forget to "unlock" the file leaving everyone else totally unable to work if they need to change that file.
Merging conflicts in SVN is fairly easy to do, so using locking should become a non-issue for you once you get used to using TortoiseSVN.
Hope this helps,
Lee