Mac OS X Lion file versions, part 2

My previous post dealt with how versioning and non-versioning apps interact under the new Mac OS X Lion release. I now turn to some low-level sleuthing. The motivation is, in a way, similar: I am interested in how command-line tools such as rsync may deal with versioning. Again, the bottom line is that these tools will do exactly what they did under Snow Leopard and earlier releases, but that it is likely going to be hard to get them to back up and sync file versions. I need to do further testing on this point; feel free to provide any details you may have.

As for the previous posts, useful background reading can be found at Ars Technica and Krypted.com.

Low-level details and storage space

Here is what I found. First, the state of any versioning-aware app (that is, whether or not a given file has been edited since a version was explicitly saved) is stored in ~/Library/Saved Application State, but the actual versions of files are in a separate directory structure off the root of the current volume, specifically /.DocumentRevisions-V100. That directory looks like this on my system:

d--x--x--x   7 root  wheel   238 Jul 18 22:39 .
drwxr-xr-x  36 root  wheel  1292 Jul 28 23:24 ..
drwx------   5 root  wheel   170 Jul 28 23:27 .cs
drw-------   2 root  wheel    68 Jul 18 22:39 ChunkTemp
d--x--x--x   3 root  wheel   102 Jul 18 22:39 PerUID
drwx------   4 root  wheel   136 Jul 28 23:27 db-V1
drwx--x--x   2 root  wheel    68 Jul 18 22:39 staging

If you drill down the PerUID directory (there is a further subdirectory for each user ID, or UID, and further subdirectories off of that), you will see every single version of every single versioned file. The actual file names are replaced with hexadecimal hash codes, but extensions are preserved. You can actually open these files directly, e.g. using open file_name.ext from the command line. You may think this is very inefficient: imagine a very large Pages or TextEdit file, with versions differing only by a few characters. Does Lion actually save an entire copy of each version? This seems wasteful, and it is also not what version control systems typically do: rather, they save the “deltas,” or “patches” needed to go from one revision to the next. It turns out that Lion is smarter than this, but the details are a bit opaque.

First, as the above Ars Technica article explains, Lion keeps track of “file chunks”, and only actually saves those chunks that have changed; these are stored in binary blobs off the .cs directory. Actually figuring out which chunks have changed is pretty challenging, and again Ars Technica provides hints and links illustrating the principles and heuristics at work. The whole system is pretty sophisticated, but then again, it has to be: versions are stored in your internal hard disk (indeed, your normal working “volume”); if you are editing a movie, you would not want to keep around 10 almost identical copies of a multi-gigabyte file, would you?

Now, you may rememeber that Time Machine does something similar to keep the size of backups under control. Again, Ars Technica comes to our aid. Instead of saving a copy of your entire hard disk each time it runs, Time Machine creates hard links to files and directories that have not changed since the last backup, and only copies files that have been modified. However, all this happens at the file level. Lion’s file versioning instead operates below the file level—it tracks and saves file chunks, which requires quite a bit more cleverness.

What’s intriguing is that the above does not explain why, when you look at files off the PerUID directory, you see what appear to be full copies of each version of any given file. For example, here’s what I get by issuing ls -l /.DocumentRevisions-V100/PerUID/505/3/com.apple.documentVersions:

total 0
-r--r--r--@ 1 marciano  staff   512 Aug  6 14:00 655BB90C-85A2-4F64-A9D0-9C469DABD56E.rtf
-r--r--r--@ 1 marciano  staff   324 Aug  5 17:29 A8DD2DAF-26F1-4CCA-8C9E-1811A379939A.rtf
-r--r--r--@ 1 marciano  staff   324 Aug  5 17:29 D8BFF539-F540-4EDA-A1A2-14E74C2CE5CA.rtf
-r--r--r--@ 1 marciano  staff  5057 Aug  6 14:02 DDA0C14D-DCC6-484F-BDB4-5F99A3923EE3.rtf

Again, you can open each of these files, and you will see the corresponding revision. The last one (dated August 6 at 14:02) is apparently 5057 bytes long. This matches what you see if you issue ls -l in the directory where I keep the file itself:

total 16
-rw-r--r--  1 marciano  staff  5057 Aug  6 14:02 external.rtf

However, notice the first lines in the two displays. According to the ls manual entry, this displays the number of 512-byte blocks actually used in the directory being listed. I assume that Lion saves files in 4096-byte bloks, so a 5057-byte file requires two such blocks—or exactly 16 512-byte blocks. But note that the versions directory requires zero 512-byte blocks, despite the fact that it seemingly contains 4 non-empty files! So, what gives?

Here’s what I think is going on: if any of you readers happen to know, please use the comments section :-) As you can see in this Wikipedia entry, each file in a modern file system is assocated with a so-called “inode.” This, in turn, contains (among other things) a reference to the actual physical location on the hard disk where the file content is stored, called the “inode pointer structure”. When the OS needs to write a file to disk, it figures out where to place it, then creates an inode structure to keep track of it. However, in principle, the OS could also create an inode pointer structure pointing to physical locations where other files are stored. So, in particular, Lion could be actually writing to disk the file chunks it tracks for a particular versioned file, then create an inode whose pointer structure points to these file chunks, and finally associating a file name with that inode. If this is what is going on, then files thus created would not take up any actual space, because they would merely be pointing to file chunks saved elsewhere, whose storage size is already accounted for. Indeed, it may well be that even the “original” file (external.rtf in the above example) has an inode pointer structure pointing to tracked file chunks.

Whatever the mechanism, it is a fact that Lion’s versioning does not take up unnecessary space. To check this, I created a roughly 16MB rtf file containing only text lines. I then added one more line to it. In the above directory off /.DocumentRevisions-V100/PerUID, I see both files, each 16MB in size. At this point, the Finder reported 131,297,953,160 bytes used. Then, I ran TextEdit and, using the Versions UI, deleted the second version. The finder now reported 131,298,051,464 bytes used; there is a slight increase in hard disk usage, most likely due to virtual memory, intermediate files, or whatever—but the key thing is that hard disk usage did not go down by roughly 16MB. Then, I deleted the original version, thus keeping only the file in the regular, visible user folder: the Finder reported 131,298,071,944 bytes used. Finally, I closed TextEdit and deleted the file in my own folder; Finder finally reported 131,281,700,232 bytes used, or about 16MB less. Bingo!

The bottom line is that you can use file versioning and still retain full control on disk usage: Lion works hard not to keep redundant information on disk, and you can always decide to delete old, unused versions, thereby saving space. However, there are some implications for Dropbox (and by implication rsync):

  • first, if you keep your documents in the Dropbox folder, only the “current” revision is backed up: versions live outside the Dropbox folder, and simply do not get picked up.
  • second, if you think that you can address this issue by aliasing the DocumentRevisions-V100 directory, think again. Yes, it may work, but it’s a bad idea. Most likely, Dropbox would not be able to figure out that files off the PerUID directory occupy no space, and would instead create full copies of these files—a waste of useful (and paid-for) storage space. And that’s assuming this works at all—I haven’t tried and do not plan to!

One thing: there is some (apparent) space used in the .cs/ChunkStorage directory. Maybe this gets cleaned up occasionally—I don’t know. There is one file, .../ChunkStorage/0/0/0/1, which contains chunks; it grew much bigger after the above exercise, even though in the end I deleted the file.

A caveat, and a trick.

In the course of my investigations, I deleted the entire ./DocumentRevisions-V100 directory and its contents. Bad idea! I thought that these would be regenerated upon first saving a version of a document, but this is not the case. Actually, versioning apps will simply refuse to save! The aforelinked site krypted.com suggests recreating the DocumentRevisions-V100 directory structure; this does enable saving and versioning, but not file chunk tracking. In fact, none of the files under db-v1, for instance, was recreated upon saving in a versioned app, although version files off of PerUID were created. What worked for me was using Disk Utility to repair the disk (not just the permissions: the entire disk). You need to run Disk Utility from either the recovery partition (hold CMD+R down while you restart) or your installation DVD, if you created one using the procedure described elsewhere on the internet.

Finally, a little trick: sudo -s gives you a root shell, without the need to actually enable the root user. Cool, and very useful if you, for instance, need to navigate a directory structure that is not accessible to regular users, such as /.DocumentRevisions-V100.

About these ads

8 responses to “Mac OS X Lion file versions, part 2

  1. Pingback: Dear Aunt TUAW: Where are my versions? | TUAW - The Unofficial Apple Weblog

  2. Pingback: Dear Aunt TUAW: Where are my versions? » The iPaddict

  3. Pingback: Dear Aunt TUAW: Where are my versions? | Design City

  4. Pingback: Dear Aunt TUAW: Where are my versions?

  5. Pingback: OnlineMagazine » Blog Archive » Dear Aunt TUAW: Where are my versions?

  6. > -r–r–r–@ 1 marciano staff 512 Aug 6 14:00 655BB90C-85A2-4F64-A9D0-9C469DABD56E.rtf

    The @ in the ls -l seems to indicate that extra meta-data exists, though that’s not specifically documented in Apple’s man page for ls; however the -@ option retrieves metadata, so the inference is probably good, and a test (below) shows there _is_ other meta-data. The meta-data probably tells the system where the actual file is, like a symlink but different, and also specifies any chunk changes. Here’s an example of what I saw (uname redacted):

    sudo ls -l@ PerUID/501/1a/com.apple.documentVersions/BF9AF92D-30EA-46EB-9D1F-F411512E94B3.jpg
    -r–r–r–@ 1 xxxx staff 1855179 Sep 1 16:41 PerUID/501/1a/com.apple.documentVersions/BF9AF92D-30EA-46EB-9D1F-F411512E94B3.jpg
    com.apple.genstore.info 91
    com.apple.genstore.orig_perms_v1 1
    com.apple.genstore.origdisplayname 19

    I’d guess the 91 is the chunk-finder info, and that, like many SCM systems, Lion keeps the latest version in the file system, and diffs back to older ones on demand using the chunk info, etc.

    Since ls reports the data space used (not meta-data usage), the total used is shown as 0.

    Not a total WAG, but not a warranteed analysis either.

  7. Pingback: Kjære tante TUAW: Hvor er mine versjoner? | Apple | Norges Nyheter - Bøker, Helse, Apple, Sport, Spill, Data, Strømper, Strømpe, Singel, Truse, Truse, Naken, Sjult, Kamera, Hidden, Suge, Slikke

  8. Sidesprang er en tema som er litt vanskelig å snakke om.
    Om du skal bytte på å pule i rumpa og i musa, så ha på kondom og bytt
    mellom hver gang. Du må bare registerer en profil,
    så er det bare å lete i vei.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s