Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH WALK III: Add 'git backfill' command #1820

Open
wants to merge 5 commits into
base: api-upstream
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
/git-apply
/git-archimport
/git-archive
/git-backfill
/git-bisect
/git-blame
/git-branch
Expand Down
68 changes: 68 additions & 0 deletions Documentation/git-backfill.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
git-backfill(1)
===============

NAME
----
git-backfill - Download missing objects in a partial clone


SYNOPSIS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 08:07:16PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Users may want to specify a minimum batch size for their needs. This is only
> a minimum: the path-walk API provides a list of OIDs that correspond to the
> same path, and thus it is optimal to allow delta compression across those
> objects in a single server request.

Okay, here you explicitly say that this is a minimum batch size, so this
is by design and with a proper reason. Good.

> We could consider limiting the request to have a maximum batch size in the
> future. For now, we let the path-walk API batches determine the
> boundaries.

Should we maybe rename `--batch-size` to `--min-batch-size` so that it
does not become awkward if we ever want to have a maximum batch size, as
well? Also helps to set expectations with the user.

[snip]
> Based on these experiments, a batch size of 50,000 was chosen as the
> default value.

Thanks for all the data, this is really helpful!

> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index 0e10f066fef..9b0bae04e9d 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server.
>  By default, `git backfill` downloads all blobs reachable from the `HEAD`
>  commit. This set can be restricted or expanded using various options.
>  
> +OPTIONS
> +-------
> +
> +--batch-size=<n>::
> +	Specify a minimum size for a batch of missing objects to request
> +	from the server. This size may be exceeded by the last set of
> +	blobs seen at a given path. Default batch size is 16,000.

This is stale: s/16,000/50,000/

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index e5f2000d5e0..127333daef8 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix,
> struct reposit
>                 .batch_size = 50000,
>         };
>         struct option options[] = {
> +               OPT_INTEGER(0, "batch-size", &ctx.batch_size,
> +                           N_("Minimun number of objects to request at a time")),

s/Minimun/Minimum

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 12/16/24 3:01 AM, Patrick Steinhardt wrote:
> On Fri, Dec 06, 2024 at 08:07:16PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> Users may want to specify a minimum batch size for their needs. This is only
>> a minimum: the path-walk API provides a list of OIDs that correspond to the
>> same path, and thus it is optimal to allow delta compression across those
>> objects in a single server request.
>
> Okay, here you explicitly say that this is a minimum batch size, so this
> is by design and with a proper reason. Good.
>
>> We could consider limiting the request to have a maximum batch size in the
>> future. For now, we let the path-walk API batches determine the
>> boundaries.
>
> Should we maybe rename `--batch-size` to `--min-batch-size` so that it
> does not become awkward if we ever want to have a maximum batch size, as
> well? Also helps to set expectations with the user.

Good idea. Will do.

>> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
>> index 0e10f066fef..9b0bae04e9d 100644
>> --- a/Documentation/git-backfill.txt
>> +++ b/Documentation/git-backfill.txt
>> @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server.
>>   By default, `git backfill` downloads all blobs reachable from the `HEAD`
>>   commit. This set can be restricted or expanded using various options.
>>
>> +OPTIONS
>> +-------
>> +
>> +--batch-size=<n>::
>> +	Specify a minimum size for a batch of missing objects to request
>> +	from the server. This size may be exceeded by the last set of
>> +	blobs seen at a given path. Default batch size is 16,000.
>
> This is stale: s/16,000/50,000/

Thanks!

>> diff --git a/builtin/backfill.c b/builtin/backfill.c
>> index e5f2000d5e0..127333daef8 100644
>> --- a/builtin/backfill.c
>> +++ b/builtin/backfill.c
>> @@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix,
>> struct reposit
>>                  .batch_size = 50000,
>>          };
>>          struct option options[] = {
>> +               OPT_INTEGER(0, "batch-size", &ctx.batch_size,
>> +                           N_("Minimun number of objects to request at a time")),
>
> s/Minimun/Minimum

Thanks for the careful eye for detail.

-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 08:07:17PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> One way to significantly reduce the cost of a Git clone and later fetches is
> to use a blobless partial clone and combine that with a sparse-checkout that
> reduces the paths that need to be populated in the working directory. Not
> only does this reduce the cost of clones and fetches, the sparse-checkout
> reduces the number of objects needed to download from a promisor remote.
> 
> However, history investigations can be expensie as computing blob diffs will

s/expensie/expensive

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 20, 2024 at 04:29:51PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Users may want to specify a minimum batch size for their needs. This is only
> a minimum: the path-walk API provides a list of OIDs that correspond to the
> same path, and thus it is optimal to allow delta compression across those
> objects in a single server request.
> 
> We could consider limiting the request to have a maximum batch size in the
> future. For now, we let the path-walk API batches determine the
> boundaries.
> 
> To get a feeling for the value of specifying the --batch-size parameter,

This should say `--min-batch-size`.

> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index ece887831f6..e392517869c 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
>  SYNOPSIS
>  --------
>  [verse]
> -'git backfill' [<options>]
> +'git backfill' [--batch-size=<n>]
>  
>  DESCRIPTION
>  -----------

Here, as well.

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 177fd4286c7..ddccececc36 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -21,14 +21,14 @@
>  #include "path-walk.h"
>  
>  static const char * const builtin_backfill_usage[] = {
> -	N_("git backfill [<options>]"),
> +	N_("git backfill [--batch-size=<n>]"),

And here.

> @@ -111,9 +111,11 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  	struct backfill_context ctx = {
>  		.repo = repo,
>  		.current_batch = OID_ARRAY_INIT,
> -		.batch_size = 50000,
> +		.min_batch_size = 50000,
>  	};

Nit: it would be nice to adjust the name of this variable in the
preceding commit already so that it doesn't have to change again over
here.

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 20, 2024 at 04:29:52PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 136ec08fb0e..c7456a9c1c0 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -12,6 +12,7 @@
>  #include "object.h"
>  #include "oid-array.h"
>  #include "prio-queue.h"
> +#include "repository.h"
>  #include "revision.h"
>  #include "string-list.h"
>  #include "strmap.h"
> @@ -173,6 +174,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
>  		if (type == OBJ_TREE)
>  			strbuf_addch(&path, '/');
>  
> +		if (ctx->info->pl) {
> +			int dtype;
> +			enum pattern_match_result match;
> +			match = path_matches_pattern_list(path.buf, path.len,
> +							  path.buf + base_len, &dtype,
> +							  ctx->info->pl,
> +							  ctx->repo->index);
> +
> +			if (ctx->info->pl->use_cone_patterns &&
> +			    match == NOT_MATCHED)
> +				continue;
> +			else if (!ctx->info->pl->use_cone_patterns &&
> +				 type == OBJ_BLOB &&
> +				 match != MATCHED)

For my own understanding: is there as pecific reason why one of the
branches uses `== NOT_MATCHED` whereas the other one uses `!= MATCHED`?

> diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
> index 7f2d409c5bc..61e845e5ec2 100644
> --- a/t/helper/test-path-walk.c
> +++ b/t/helper/test-path-walk.c
> @@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids,
>  
>  int cmd__path_walk(int argc, const char **argv)
>  {
> -	int res;
> +	int res, stdin_pl = 0;
>  	struct rev_info revs = REV_INFO_INIT;
>  	struct path_walk_info info = PATH_WALK_INFO_INIT;
>  	struct path_walk_test_data data = { 0 };
> @@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
>  			 N_("toggle inclusion of tree objects")),
>  		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
>  			 N_("toggle pruning of uninteresting paths")),
> +		OPT_BOOL(0, "stdin-pl", &stdin_pl,
> +			 N_("read a pattern list over stdin")),
>  		OPT_END(),
>  	};
>  

I was about to suggest giving this a more descriptive name, as it might
be confusing for anybody not intimately familiar with the code. But then
I noticed that this is part of the test helper, only, so it doesn't
matter as much. So feel free to ignore.

Patrick

--------
[verse]
'git backfill' [--batch-size=<n>] [--[no-]sparse]

DESCRIPTION
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index 640144187d3..0e10f066fef 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -14,6 +14,30 @@ SYNOPSIS
>  DESCRIPTION
>  -----------
>  
> +Blobless partial clones are created using `git clone --filter=blob:none`
> +and then configure the local repository such that the Git client avoids
> +downloading blob objects unless they are required for a local operation.
> +This initially means that the clone and later fetches download reachable
> +commits and trees but no blobs. Later operations that change the `HEAD`
> +pointer, such as `git checkout` or `git merge`, may need to download
> +missing blobs in order to complete their operation.

Okay.

> +In the worst cases, commands that compute blob diffs, such as `git blame`,
> +become very slow as they download the missing blobs in single-blob
> +requests to satisfy the missing object as the Git command needs it. This
> +leads to multiple download requests and no ability for the Git server to
> +provide delta compression across those objects.
> +
> +The `git backfill` command provides a way for the user to request that
> +Git downloads the missing blobs (with optional filters) such that the
> +missing blobs representing historical versions of files can be downloaded
> +in batches. The `backfill` command attempts to optimize the request by
> +grouping blobs that appear at the same path, hopefully leading to good
> +delta compression in the packfile sent by the server.

Hm. So we're asking the user to fix a usability issue of git-blame(1),
don't we? Ideally, git-blame(1) itself should know to transparently
batch the blobs it requires to compute its output, shouldn't it? That
usecase alone doesn't yet convince me that git-backfill(1) is a good
idea as I'd think we should rather fix the underlying issue.

So are there other usecases for git-backfill(1)? I can imagine that it
might be helpful in the context of scripts that know they'll operate on
a bunch of blobs.

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 38e6aaeaa03..e5f2000d5e0 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -1,16 +1,116 @@
>  #include "builtin.h"
> +#include "git-compat-util.h"
>  #include "config.h"
>  #include "parse-options.h"
>  #include "repository.h"
> +#include "commit.h"
> +#include "hex.h"
> +#include "tree.h"
> +#include "tree-walk.h"
>  #include "object.h"
> +#include "object-store-ll.h"
> +#include "oid-array.h"
> +#include "oidset.h"
> +#include "promisor-remote.h"
> +#include "strmap.h"
> +#include "string-list.h"
> +#include "revision.h"
> +#include "trace2.h"
> +#include "progress.h"
> +#include "packfile.h"
> +#include "path-walk.h"
>  
>  static const char * const builtin_backfill_usage[] = {
>  	N_("git backfill [<options>]"),
>  	NULL
>  };
>  
> +struct backfill_context {
> +	struct repository *repo;
> +	struct oid_array current_batch;
> +	size_t batch_size;
> +};
> +
> +static void clear_backfill_context(struct backfill_context *ctx)
> +{
> +	oid_array_clear(&ctx->current_batch);
> +}

Nit: our style guide says that this should rather be
`backfill_context_clear()`.

> +static void download_batch(struct backfill_context *ctx)
> +{
> +	promisor_remote_get_direct(ctx->repo,
> +				   ctx->current_batch.oid,
> +				   ctx->current_batch.nr);
> +	oid_array_clear(&ctx->current_batch);
> +
> +	/*
> +	 * We likely have a new packfile. Add it to the packed list to
> +	 * avoid possible duplicate downloads of the same objects.
> +	 */
> +	reprepare_packed_git(ctx->repo);
> +}
> +
> +static int fill_missing_blobs(const char *path UNUSED,
> +			      struct oid_array *list,
> +			      enum object_type type,
> +			      void *data)
> +{
> +	struct backfill_context *ctx = data;
> +
> +	if (type != OBJ_BLOB)
> +		return 0;
> +
> +	for (size_t i = 0; i < list->nr; i++) {
> +		off_t size = 0;
> +		struct object_info info = OBJECT_INFO_INIT;
> +		info.disk_sizep = &size;
> +		if (oid_object_info_extended(ctx->repo,
> +					     &list->oid[i],
> +					     &info,
> +					     OBJECT_INFO_FOR_PREFETCH) ||
> +		    !size)
> +			oid_array_append(&ctx->current_batch, &list->oid[i]);
> +	}
> +
> +	if (ctx->current_batch.nr >= ctx->batch_size)
> +		download_batch(ctx);

Okay, so the batch size is just "best effort". If we walk a tree that
makes us exceed the batch size then we wouldn't issue a fetch during the
tree walk. Is there any specific reason for this behaviour?

In any case, as long as this is properly documented I think this should
be fine in general.

> +	return 0;
> +}
> +
> +static int do_backfill(struct backfill_context *ctx)
> +{
> +	struct rev_info revs;
> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
> +	int ret;
> +
> +	repo_init_revisions(ctx->repo, &revs, "");
> +	handle_revision_arg("HEAD", &revs, 0, 0);
> +
> +	info.blobs = 1;
> +	info.tags = info.commits = info.trees = 0;
> +
> +	info.revs = &revs;
> +	info.path_fn = fill_missing_blobs;
> +	info.path_fn_data = ctx;
> +
> +	ret = walk_objects_by_path(&info);
> +
> +	/* Download the objects that did not fill a batch. */
> +	if (!ret)
> +		download_batch(ctx);
> +
> +	clear_backfill_context(ctx);

Are we leaking `revs` and `info`?

> +	return ret;
> +}
> +
>  int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
>  {
> +	struct backfill_context ctx = {
> +		.repo = repo,
> +		.current_batch = OID_ARRAY_INIT,
> +		.batch_size = 50000,
> +	};
>  	struct option options[] = {
>  		OPT_END(),
>  	};
> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  
>  	repo_config(repo, git_default_config, NULL);
>  
> -	die(_("not implemented"));
> -
> -	return 0;
> +	return do_backfill(&ctx);
>  }

The current iteration only backfills blobs as far as I can see. Do we
maybe want to keep the door open for future changes in git-backfill(1)
by implementing this via a "blob" subcommand?

Patrick

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 12/16/24 3:01 AM, Patrick Steinhardt wrote:
> On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:

>> +The `git backfill` command provides a way for the user to request that
>> +Git downloads the missing blobs (with optional filters) such that the
>> +missing blobs representing historical versions of files can be downloaded
>> +in batches. The `backfill` command attempts to optimize the request by
>> +grouping blobs that appear at the same path, hopefully leading to good
>> +delta compression in the packfile sent by the server.
>
> Hm. So we're asking the user to fix a usability issue of git-blame(1),
> don't we? Ideally, git-blame(1) itself should know to transparently
> batch the blobs it requires to compute its output, shouldn't it? That
> usecase alone doesn't yet convince me that git-backfill(1) is a good
> idea as I'd think we should rather fix the underlying issue.

I've looked into making this change for 'git blame' and it is a
nontrivial change. I'm not sure on the timeline that we could expect
'git blame' to be improved.

But you're right that this is not enough justification on its own. It's
an example of a kind of command that has these problems, including 'git
log [-p|-L]'.

> So are there other usecases for git-backfill(1)? I can imagine that it
> might be helpful in the context of scripts that know they'll operate on
> a bunch of blobs.

One major motivation is that it can essentially provide a way to do
resumable clones: get the commits and trees in one go with a blobless
clone, then download the blobs in batches. If something interrupts the
'git backfill' command, then restarting it will only repeat the most
recent batch.

Finally, in a later patch we can see that the --sparse option allows a
user to operate as if they have a full clone but where they only include
blob data within their sparse-checkout, providing reduced disk space and
network time to get to that state.

>> +	if (ctx->current_batch.nr >= ctx->batch_size)
>> +		download_batch(ctx);
>
> Okay, so the batch size is just "best effort". If we walk a tree that
> makes us exceed the batch size then we wouldn't issue a fetch during the
> tree walk. Is there any specific reason for this behaviour?
>
> In any case, as long as this is properly documented I think this should
> be fine in general.

The main reason is that we expect the server to return a packfile that
has many delta relationships within the objects at a given path. If we
split the batch in the middle of a path batch, then we are artificially
introducing breaks in the delta chains that could be wasteful.

However, this batching pattern could be problematic if there are a
million versions of a single file and the batch is too large to download
efficiently. This "absolute max batch size" is currently left as a
future extension.

>> +	clear_backfill_context(ctx);
>
> Are we leaking `revs` and `info`?

At the moment. Will fix.

>> +	return ret;
>> +}
>> +
>>   int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
>>   {
>> +	struct backfill_context ctx = {
>> +		.repo = repo,
>> +		.current_batch = OID_ARRAY_INIT,
>> +		.batch_size = 50000,
>> +	};
>>   	struct option options[] = {
>>   		OPT_END(),
>>   	};
>> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>>
>>   	repo_config(repo, git_default_config, NULL);
>>
>> -	die(_("not implemented"));
>> -
>> -	return 0;
>> +	return do_backfill(&ctx);
>>   }
>
> The current iteration only backfills blobs as far as I can see. Do we
> maybe want to keep the door open for future changes in git-backfill(1)
> by implementing this via a "blob" subcommand?

I think that one tricky part is to ask "what could be missing?". With
Git's partial clone, it seems that we could have treeless or depth-
based tree restrictions. Technically, there could also be clones that
restrict to a set of sparse-checkout patterns, but I'm not aware of any
server that will respect those kinds of clones.  At the moment, the tree
walk would fault in any missing trees as they are seen, but this is
extremely inefficient.

I think that the path-walk API could be adjusted to be more careful to
check for the existence of an object before automatically loading it.
That would allow for batched downloads of missing trees, though a
second scan would be required to get the next layer of objects.

I'm not sure a subcommand is the right way to solve for this potential
future, but instead later we could adjust the logic to be better for
these treeless or tree-restricted clones.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 20, 2024 at 04:29:50PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 38e6aaeaa03..177fd4286c7 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
[snip]
> +static int fill_missing_blobs(const char *path UNUSED,
> +			      struct oid_array *list,
> +			      enum object_type type,
> +			      void *data)
> +{
> +	struct backfill_context *ctx = data;
> +
> +	if (type != OBJ_BLOB)
> +		return 0;
> +
> +	for (size_t i = 0; i < list->nr; i++) {
> +		off_t size = 0;
> +		struct object_info info = OBJECT_INFO_INIT;
> +		info.disk_sizep = &size;
> +		if (oid_object_info_extended(ctx->repo,
> +					     &list->oid[i],
> +					     &info,
> +					     OBJECT_INFO_FOR_PREFETCH) ||
> +		    !size)

So this is the object existence test? Is there a reason why we don't use
`repo_has_object_file()`, or its `_with_flags()` variant if we need to
pass `OBJECT_INFO_FOR_PREFETCH`?

> +			oid_array_append(&ctx->current_batch, &list->oid[i]);
> +	}
> +
> +	if (ctx->current_batch.nr >= ctx->batch_size)
> +		download_batch(ctx);
> +
> +	return 0;
> +}
> +
> +static int do_backfill(struct backfill_context *ctx)
> +{
> +	struct rev_info revs;
> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
> +	int ret;
> +
> +	repo_init_revisions(ctx->repo, &revs, "");
> +	handle_revision_arg("HEAD", &revs, 0, 0);
> +
> +	info.blobs = 1;
> +	info.tags = info.commits = info.trees = 0;

Nit: this should be unnecessary as PATH_WALK_INFO_INIT already
initialized those fields for us, right?

> +	info.revs = &revs;
> +	info.path_fn = fill_missing_blobs;
> +	info.path_fn_data = ctx;
> +
> +	ret = walk_objects_by_path(&info);
> +
> +	/* Download the objects that did not fill a batch. */
> +	if (!ret)
> +		download_batch(ctx);
> +
> +	backfill_context_clear(ctx);

Nit: I think it's a bit funny that we're cleaning up the context over
here rather than in the caller.

Patrick

-----------

Blobless partial clones are created using `git clone --filter=blob:none`
and then configure the local repository such that the Git client avoids
downloading blob objects unless they are required for a local operation.
This initially means that the clone and later fetches download reachable
commits and trees but no blobs. Later operations that change the `HEAD`
pointer, such as `git checkout` or `git merge`, may need to download
missing blobs in order to complete their operation.

In the worst cases, commands that compute blob diffs, such as `git blame`,
become very slow as they download the missing blobs in single-blob
requests to satisfy the missing object as the Git command needs it. This
leads to multiple download requests and no ability for the Git server to
provide delta compression across those objects.

The `git backfill` command provides a way for the user to request that
Git downloads the missing blobs (with optional filters) such that the
missing blobs representing historical versions of files can be downloaded
in batches. The `backfill` command attempts to optimize the request by
grouping blobs that appear at the same path, hopefully leading to good
delta compression in the packfile sent by the server.

In this way, `git backfill` provides a mechanism to break a large clone
into smaller chunks. Starting with a blobless partial clone with `git
clone --filter=blob:none` and then running `git backfill` in the local
repository provides a way to download all reachable objects in several
smaller network calls than downloading the entire repository at clone
time.

By default, `git backfill` downloads all blobs reachable from the `HEAD`
commit. This set can be restricted or expanded using various options.

OPTIONS
-------

--min-batch-size=<n>::
Specify a minimum size for a batch of missing objects to request
from the server. This size may be exceeded by the last set of
blobs seen at a given path. The default minimum batch size is
50,000.

--[no-]sparse::
Only download objects if they appear at a path that matches the
current sparse-checkout. If the sparse-checkout feature is enabled,
then `--sparse` is assumed and can be disabled with `--no-sparse`.

SEE ALSO
--------
linkgit:git-clone[1].

GIT
---
Part of the linkgit:git[1] suite
11 changes: 10 additions & 1 deletion Documentation/technical/api-path-walk.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,17 @@ better off using the revision walk API instead.
the revision walk so that the walk emits commits marked with the
`UNINTERESTING` flag.

`pl`::
This pattern list pointer allows focusing the path-walk search to
a set of patterns, only emitting paths that match the given
patterns. See linkgit:gitignore[5] or
linkgit:git-sparse-checkout[1] for details about pattern lists.
When the pattern list uses cone-mode patterns, then the path-walk
API can prune the set of paths it walks to improve performance.

Examples
--------

See example usages in:
`t/helper/test-path-walk.c`
`t/helper/test-path-walk.c`,
`builtin/backfill.c`
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1203,6 +1203,7 @@ BUILTIN_OBJS += builtin/am.o
BUILTIN_OBJS += builtin/annotate.o
BUILTIN_OBJS += builtin/apply.o
BUILTIN_OBJS += builtin/archive.o
BUILTIN_OBJS += builtin/backfill.o
BUILTIN_OBJS += builtin/bisect.o
BUILTIN_OBJS += builtin/blame.o
BUILTIN_OBJS += builtin/branch.o
Expand Down
1 change: 1 addition & 0 deletions builtin.h
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ int cmd_am(int argc, const char **argv, const char *prefix, struct repository *r
int cmd_annotate(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_apply(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_archive(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
Expand Down
151 changes: 151 additions & 0 deletions builtin/backfill.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
/* We need this macro to access core_apply_sparse_checkout */
#define USE_THE_REPOSITORY_VARIABLE

#include "builtin.h"
#include "git-compat-util.h"
#include "config.h"
#include "parse-options.h"
#include "repository.h"
#include "commit.h"
#include "dir.h"
#include "environment.h"
#include "hex.h"
#include "tree.h"
#include "tree-walk.h"
#include "object.h"
#include "object-store-ll.h"
#include "oid-array.h"
#include "oidset.h"
#include "promisor-remote.h"
#include "strmap.h"
#include "string-list.h"
#include "revision.h"
#include "trace2.h"
#include "progress.h"
#include "packfile.h"
#include "path-walk.h"

static const char * const builtin_backfill_usage[] = {
N_("git backfill [--batch-size=<n>] [--[no-]sparse]"),
NULL
};

struct backfill_context {
struct repository *repo;
struct oid_array current_batch;
size_t min_batch_size;
int sparse;
};

static void backfill_context_clear(struct backfill_context *ctx)
{
oid_array_clear(&ctx->current_batch);
}

static void download_batch(struct backfill_context *ctx)
{
promisor_remote_get_direct(ctx->repo,
ctx->current_batch.oid,
ctx->current_batch.nr);
oid_array_clear(&ctx->current_batch);

/*
* We likely have a new packfile. Add it to the packed list to
* avoid possible duplicate downloads of the same objects.
*/
reprepare_packed_git(ctx->repo);
}

static int fill_missing_blobs(const char *path UNUSED,
struct oid_array *list,
enum object_type type,
void *data)
{
struct backfill_context *ctx = data;

if (type != OBJ_BLOB)
return 0;

for (size_t i = 0; i < list->nr; i++) {
off_t size = 0;
struct object_info info = OBJECT_INFO_INIT;
info.disk_sizep = &size;
if (oid_object_info_extended(ctx->repo,
&list->oid[i],
&info,
OBJECT_INFO_FOR_PREFETCH) ||
!size)
oid_array_append(&ctx->current_batch, &list->oid[i]);
}

if (ctx->current_batch.nr >= ctx->min_batch_size)
download_batch(ctx);

return 0;
}

static int do_backfill(struct backfill_context *ctx)
{
struct rev_info revs;
struct path_walk_info info = PATH_WALK_INFO_INIT;
int ret;

if (ctx->sparse) {
CALLOC_ARRAY(info.pl, 1);
if (get_sparse_checkout_patterns(info.pl)) {
path_walk_info_clear(&info);
return error(_("problem loading sparse-checkout"));
}
}

repo_init_revisions(ctx->repo, &revs, "");
handle_revision_arg("HEAD", &revs, 0, 0);

info.blobs = 1;
info.tags = info.commits = info.trees = 0;

info.revs = &revs;
info.path_fn = fill_missing_blobs;
info.path_fn_data = ctx;

ret = walk_objects_by_path(&info);

/* Download the objects that did not fill a batch. */
if (!ret)
download_batch(ctx);

backfill_context_clear(ctx);
path_walk_info_clear(&info);
release_revisions(&revs);
return ret;
}

int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
{
struct backfill_context ctx = {
.repo = repo,
.current_batch = OID_ARRAY_INIT,
.min_batch_size = 50000,
.sparse = 0,
};
struct option options[] = {
OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
N_("Minimum number of objects to request at a time")),
OPT_BOOL(0, "sparse", &ctx.sparse,
N_("Restrict the missing objects to the current sparse-checkout")),
OPT_END(),
};

if (argc == 2 && !strcmp(argv[1], "-h"))
usage_with_options(builtin_backfill_usage, options);

argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
0);

repo_config(repo, git_default_config, NULL);

if (ctx.sparse < 0)
ctx.sparse = core_apply_sparse_checkout;

return do_backfill(&ctx);
}
1 change: 1 addition & 0 deletions command-list.txt
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ git-annotate ancillaryinterrogators
git-apply plumbingmanipulators complete
git-archimport foreignscminterface
git-archive mainporcelain
git-backfill mainporcelain history
git-bisect mainporcelain info
git-blame ancillaryinterrogators complete
git-branch mainporcelain history
Expand Down
10 changes: 3 additions & 7 deletions dir.c
Original file line number Diff line number Diff line change
Expand Up @@ -1088,10 +1088,6 @@ static void invalidate_directory(struct untracked_cache *uc,
dir->dirs[i]->recurse = 0;
}

static int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl);

/* Flags for add_patterns() */
#define PATTERN_NOFOLLOW (1<<0)

Expand Down Expand Up @@ -1181,9 +1177,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
return 0;
}

static int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl)
int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl)
{
char *orig = buf;
int i, lineno = 1;
Expand Down
3 changes: 3 additions & 0 deletions dir.h
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
int add_patterns_from_blob_to_list(struct object_id *oid,
const char *base, int baselen,
struct pattern_list *pl);
int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl);
void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
void add_pattern(const char *string, const char *base,
int baselen, struct pattern_list *pl, int srcpos);
Expand Down
1 change: 1 addition & 0 deletions git.c
Original file line number Diff line number Diff line change
Expand Up @@ -509,6 +509,7 @@ static struct cmd_struct commands[] = {
{ "annotate", cmd_annotate, RUN_SETUP },
{ "apply", cmd_apply, RUN_SETUP_GENTLY },
{ "archive", cmd_archive, RUN_SETUP_GENTLY },
{ "backfill", cmd_backfill, RUN_SETUP },
{ "bisect", cmd_bisect, RUN_SETUP },
{ "blame", cmd_blame, RUN_SETUP },
{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
Expand Down
28 changes: 23 additions & 5 deletions path-walk.c
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include "object.h"
#include "oid-array.h"
#include "prio-queue.h"
#include "repository.h"
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
Expand Down Expand Up @@ -173,6 +174,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
if (type == OBJ_TREE)
strbuf_addch(&path, '/');

if (ctx->info->pl) {
int dtype;
enum pattern_match_result match;
match = path_matches_pattern_list(path.buf, path.len,
path.buf + base_len, &dtype,
ctx->info->pl,
ctx->repo->index);

if (ctx->info->pl->use_cone_patterns &&
match == NOT_MATCHED)
continue;
else if (!ctx->info->pl->use_cone_patterns &&
type == OBJ_BLOB &&
match != MATCHED)
continue;
}

if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
CALLOC_ARRAY(list, 1);
list->type = type;
Expand Down Expand Up @@ -583,10 +601,10 @@ void path_walk_info_init(struct path_walk_info *info)
memcpy(info, &empty, sizeof(empty));
}

void path_walk_info_clear(struct path_walk_info *info UNUSED)
void path_walk_info_clear(struct path_walk_info *info)
{
/*
* This destructor is empty for now, as info->revs
* is not owned by 'struct path_walk_info'.
*/
if (info->pl) {
clear_pattern_list(info->pl);
free(info->pl);
}
}
11 changes: 11 additions & 0 deletions path-walk.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

struct rev_info;
struct oid_array;
struct pattern_list;

/**
* The type of a function pointer for the method that is called on a list of
Expand Down Expand Up @@ -48,6 +49,16 @@ struct path_walk_info {
* walk the children of such trees.
*/
int prune_all_uninteresting;

/**
* Specify a sparse-checkout definition to match our paths to. Do not
* walk outside of this sparse definition. If the patterns are in
* cone mode, then the search may prune directories that are outside
* of the cone. If not in cone mode, then all tree paths will be
* explored but the path_fn will only be called when the path matches
* the sparse-checkout patterns.
*/
struct pattern_list *pl;
};

#define PATH_WALK_INFO_INIT { \
Expand Down
Loading
Loading