Splitting Files From a Git Repository
We recently started hacking up some Python scripts to be a (figurative) puppet master and couple together two earth models that we’re running. One is a hydrologic model, and another is a glacier model. They both run at different spatial scales (the hydrologic model is very coarse while the glacier model is very high resolution) and for different temporal domains. And they represent the world in very different ways. They are both written in different languages (C++ and Fortran respectively).
In almost every way they non-ideal candidates for coupling. But we’re doing it anyways.
It’s quite a bit of work (and non-trival math) to translate information from model to the other. But we decided to use a high-level language (Python) to coordinate the work. It should take on the order of 1/10 of the work to write the code.
In any case, we started hacking on the scripts right inside our git repository for VIC, but that quickly started making things messy. I wanted to pull all of the Python code (the “conductor”) out of the VIC repo and start a new one. But we had already done about 40 commits worth of work on the scripts, and I wanted to keep that history.
There are a few ways to extract a subset of files from a git repo and start a new branch or repo with them. Most people will point you to git subtree
, but unfortunately, it’s only useful if you’re files are already organized into proper directories. VIC’s were not. All the files are just thrown in together. So we wanted to extract just a couple files.
Fortunately there was a good discussion on stackoverflow and dbr discussed how to do it using git filter-branch
, which essentially runs a shell script on every commit in a repo and rewrites it according to whatever rules you’ve written. So essentially we just iterate through the commits, and remove everything that’s not what we’re interested in.
Start my making a clone of the repo that is a proper copy. This is very important, to use the file:///
syntax, rather than a relative path syntax such as ./VIC
. The latter just makes hard links to the other repo and you may actually mess up your base repo.
git clone file:///home/james/code/git/VIC hydro-conductor
Once you have a proper cloned copy, run your filter, and only select the files with a “.py” extension.
git filter-branch --tree-filter 'for f in *; do if [[ ! $f =~ .*\.py ]]; then rm -rf $f; fi; done' --prune-empty master
# I missed the .hgigore files, because they're not included in the * glob expression
git filter-branch -f --tree-filter 'for f in .*; do if [[ $f == ".hgignore" ]]; then rm -rf $f; fi; done' --prune-empty master
That’s it! Now we have a clean repo with only our “.py” files, but containing their full history.