Tuesday 2 November 2010

Making Large Repositories Smaller

Warning: Perform the below operations with care. Failure to follow correct procedures can damage or destroy your data. As always, create a checkpoint from your Perforce database and make a backup copy of your versioned file archives before proceeding!

How can I decrease the size of my large Perforce repository?

Perforce stores both the history of your data (metadata) and the data itself (versioned files). Metadata and versioned files are stored separately. By default they are both located in the P4ROOT directory, with the metadata stored in the db.* files and the versioned file trees stored in directories named after your depots. Over time, Perforce depots tend to grow. This growth happens as you add files and file revisions to your depot, and as the history of those files updates to track those changes.

There are several ways to reduce your repository size or divide file archives into smaller trees:

Delete unused non-automatic labels and client workspaces. Rebuild your database (db.*) files to reclaim space. Remove large, obsolete, or unwanted binary files from the versioned file tree. Divide large depots into two or more smaller depots.

Each of these operations has its own implications with regard to time, effort, and results.

Over time even smaller Perforce servers can accumulate obsolete labels and workspaces that can take up considerable database space. Deleting unused labels and workspaces can reduce the size of several database tables, sometimes considerably.

Unused workspaces can have large number of db.have entries and (to a lessor extent) db.working and db.lock records associated with files checked out for edit, delete, integrate, and so on. In most cases the user finished a project and deleted the contents of the local workspace root. Because Perforce is never aware of this activity, it does not remove the associated database information. Past experience shows that workspaces that have not been accessed in over 6 to 12 months are obsolete, and can be deleted with proper notification.

Note: Do not confuse the "Access" date of a workspace client with the "Update" date. "Access" refers to the last time the workspace specification was used to sync, modify or submit files; this does not include any instance in which the client was used as an argument, such as with p4 filelog or p4 files. "Update" refers to the last time the workspace specification was changed and saved by the user. It is possible to have a client that was created over a year ago that was accessed today.

Unused labels primarily impact db.label, as each revision tagged or synced to a label is stored in the db.label file.

Note: Automatic labels do not have the same issue as standard labels created using p4 tag or p4 labelsync, as automatic labels use a change number (or other revision specifier) to specify the files associated with the label. See this Knowledge Base article on Static and Automatic Labels for more information.

Since db.label and db.have are typically the largest database tables in many Perforce installations, substantially reducing the size of those tables will have the largest benefit. Deleting other Perforce specifications (such as users, branches, jobs, and so on) might free up space, but it is likely to be negligible in comparison to workspaces and labels.

Note: After removing large labels and workspaces, or large numbers of specifications of any kind, you might need to restore your Perforce database, described below, to realize the full extent of the recovered space.

Perforce metadata is stored in the db.* files using btrees. To keep the btrees balanced and to reclaim any unused space, rebuild them from time to time. The table that usually benefits the most from this is db.have, which can often shrink by 20%. In addition to taking up less space, newly rebuilt db.* files are faster to access thanks to better btree organization.

To rebuild your database files:

Stop your Perforce server. Take a checkpoint. Move your db.* files out of the P4ROOT directory into a temporary location. Recover from the checkpoint to re-create your database files.

Details can be found in the Backup and Recovery chapter of the Perforce System Administrator's Guide.

Note: Perforce server versions later than 2008.2 attempt to reorder the btrees to reduce wasted space, which might reduce the amount of space to be reclaimed using this technique.

Binary files can consume a lot of disk space because they are stored on a per-revision basis, unlike the reverse-delta storage used for text files. To minimize disk space consumption, store only required revisions of binary files. While the metadata for such files is quite small, the file revisions themselves can take up a lot of disk space. You can easily replace these with small stub files that indicate their removal.

If the file revision being replaced has been branched or integrated and is being used as a lazy copy of another file revision in the depot, the integrated revision is replaced as well. You can undo the lazy copy with the p4 snap command. Note: Always consider the impact of a revision that has been branched multiple times. For example, consider a 100MB file that has been branched 3 times -- while replacing the initial revision will save 100MB, undoing the lazy copy to three branches will result in increasing your disk space usage by 200MB.

Note: Replacing the versioned file with a stub file will only work with binary file types. Files stored as RCS reverse deltas, such as text files, cannot be modified using this procedure.

To replace a revision of a file that is stored as an individual compressed revision in your versioned file tree, perform the following steps:

Create a text file with descriptive text, for example: The original version of this file has been deleted to save space!Compress the text file with a gzip compression utility: If the file is not stored compressed, as with the ubinary file type, skip this step. For binary files of type "apple", gzip compression alone does not work. Instead, use Perforce to store the stub as an "apple" type file somewhere in a Perforce depot. Then use that stub file as it exists in the depot's versioned file tree as the source file in the remaining steps of this procedure. Find the file representing the desired revision within the versioned file tree: File revisions are stored on a per revision basis in a directory that has the full name of the file with ,d appended. For Perforce Servers versions earlier than 2006.2, if the depot path of the the file revision you are looking for is "//depot/path/too/big.exe#12" and the server's P4ROOT directory is "/perforce" the file for this revision would be located in the following directory: /perforce/depot/path/too/big.exe,d/1.12.gzFor Perforce Server versions 2006.2 and later, files are stored in the database archive keyed to the file name and pending change number, rather than the file name and revision number. The pending change number can be found using "p4 fstat -Oc": # p4 fstat -Oc //depot/path/too/big.exe#12... depotFile //depot/path/too/big.exe... clientFile /Users/perforce/path/too/big.exe... isMapped ... headAction edit... headType binary... headTime 1239673838... headRev 12... headChange 713... headModTime 1210827015... haveRev 12... lbrFile //depot/path/too/big.exe... lbrRev 1.713... lbrType binaryThe lbrRev field shows the details of the pending change in which the archive file is stored. If this server's P4ROOT is /perforce, then this file //depot/path/too/big.exe#12 would be located in the following directory; /perforce/depot/path/too/big.exe,d/1.713.gzIf the file is not stored compressed, as with the ubinary storage format, the file name does not have .gz at the end.
Replace the desired versioned file in your versioned file tree with the new, smaller file that you created in step 1. Run p4 verify -v on the file to cause the server to replace the MD5 hash value of the old file with that of the new file to prevent "BAD" errors during the next verification. For example: p4 verify -v //depot/path/to/big.exe#12Run a test sync of the file to ensure you performed all the steps properly. For example, sync "//depot/path/to/big.exe#12" and verify that it is now the text file you created in step 1.

To automate the process, you can write a script that traverses your versioned file tree and performs this substitution on files meeting specified criteria, such as age, size, number of revisions, and so on. As of Perforce Server release 2004.2 and later you can create a commit trigger that performs the substitution. Please contact Perforce Technical Support at support@perforce.com for additional assistance with this approach.

Dividing a depot into more than one depot allows you to separate your data into logical groups, resulting in each depot being smaller and more manageable than the original depot. Dividing a depot removes all metadata relationships between the resultant pieces. Any integration references between the pieces are gone when you are finished, as if the files never had any relationship whatsoever.

Dividing a depot is easiest if done at the branch level. For example, consider a depot with the following two main branches:

//depot/path_A/...//depot/path_B/...Assume a server root of "/perforce".

The steps required to split a depot into two depots at the branch level are as follows:

Take a checkpoint on the source server, where the path_A data is to reside. Set up a server executable for the path_B server in another location (possibly on a different server machine) where the path_B data is to reside. Copy the versioned files for path_B onto the new server location (use of your preferred tar/zip utility can make this easier). Restore the path_B checkpoint in this new location. Remove the path_B data from the path_A server using a p4 obliterate command.
(See the note on optimal obliteration at the end of this document before proceeding!) Remove the path_A data from the path_B server using a p4 obliterate command.
(Again, see the note on optimal obliteration at the end of this document before proceeding!) Start the servers.

If you have any questions about this procedure, please contact Perforce Technical Support at support@perforce.com.

Obliterating using server releases prior to 2005.1 can take an extremely long time. Detailed instructions on the use of the p4 obliterate command can be found in the Perforce Command Reference.

To optimize the operation and to ensure you obliterate only the desired files, use a client specification with a view that matches only the files you want to obliterate, similar to the following:

View: //depot/path_to_remove/... //client_name/path_to_remove/...and a client-syntax p4 obliterate command as follows: p4 obliterate //client_name/...

To perform the requested action, you must specify the -y flag when you issue the p4 obliterate command. Without this flag the command defaults to "report" mode, indicating which files and database records would be purged when using the -y flag.

2 users have rated this article 2.5 out of 5

View the original article here

No comments:

Post a Comment