23074 Commits

Author SHA1 Message Date
Taylor Blau
521263568b Merge branch 'ds/path-walk-1' into jch
Introduce a new API to visit objects in batches based on a common
path, or by type.

* ds/path-walk-1:
  path-walk: mark trees and blobs as UNINTERESTING
  path-walk: visit tags and cached objects
  path-walk: allow consumer to specify object types
  t6601: add helper for testing path-walk API
  test-lib-functions: add test_cmp_sorted
  path-walk: introduce an object walk by path
2024-11-01 15:40:55 -04:00
Taylor Blau
2f42bd7ce8 Merge branch 'as/show-index-uninitialized-hash' into jch
Regression fix for 'show-index' when run outside of a repository.

* as/show-index-uninitialized-hash:
  show-index: fix uninitialized hash function
2024-11-01 15:40:29 -04:00
Taylor Blau
ddedb0e9d1 Merge branch 'ps/leakfixes-part-9' into jch
More leakfixes.

* ps/leakfixes-part-9: (22 commits)
  list-objects-filter-options: work around reported leak on error
  builtin/merge: release outbut buffer after performing merge
  dir: fix leak when parsing "status.showUntrackedFiles"
  t/helper: fix leaking buffer in "dump-untracked-cache"
  t/helper: stop re-initialization of `the_repository`
  sparse-index: correctly free EWAH contents
  dir: release untracked cache data
  combine-diff: fix leaking lost lines
  builtin/tag: fix leaking key ID on failure to sign
  transport-helper: fix leaking import/export marks
  builtin/commit: fix leaking cleanup config
  trailer: fix leaking strbufs when formatting trailers
  trailer: fix leaking trailer values
  builtin/commit: fix leaking change data contents
  upload-pack: fix leaking URI protocols
  pretty: clear signature check
  diff-lib: fix leaking diffopts in `do_diff_cache()`
  revision: fix leaking bloom filters
  builtin/grep: fix leak with `--max-count=0`
  grep: fix leak in `grep_splice_or()`
  ...
2024-11-01 15:40:28 -04:00
Taylor Blau
f0176fb495 Merge branch 'jc/optional-path' into jch
Teach configuration values of type "pathname" a new ':(optional)'
suffix.

* jc/optional-path:
  parseopt: values of pathname type can be prefixed with :(optional)
  config: values of pathname type can be prefixed with :(optional)
  t7500: make each piece more independent
2024-11-01 15:40:26 -04:00
Taylor Blau
71cf160e07 Merge branch 'bc/drop-ancient-libcurl-and-perl' into jch
Drop support for older libcURL and Perl.

* bc/drop-ancient-libcurl-and-perl:
  gitweb: make use of s///r
  Require Perl 5.26.0
  INSTALL: document requirement for libcurl 7.61.0
  git-curl-compat: remove check for curl 7.56.0
  git-curl-compat: remove check for curl 7.53.0
  git-curl-compat: remove check for curl 7.52.0
  git-curl-compat: remove check for curl 7.44.0
  git-curl-compat: remove check for curl 7.43.0
  git-curl-compat: remove check for curl 7.39.0
  git-curl-compat: remove check for curl 7.34.0
  git-curl-compat: remove check for curl 7.25.0
  git-curl-compat: remove check for curl 7.21.5
2024-11-01 15:40:25 -04:00
Taylor Blau
04bc973222 Merge branch 'ew/cat-file-optim' into jch
"git cat-file --batch" has been optimized.

* ew/cat-file-optim:
  cat-file: use writev(2) if available
  cat-file: batch_write: use size_t for length
  cat-file: batch-command uses content_limit
  object_info: content_limit only applies to blobs
  packfile: packed_object_info avoids packed_to_object_type
  cat-file: use delta_base_cache entries directly
  packfile: inline cache_or_unpack_entry
  packfile: fix off-by-one in content_limit comparison
  packfile: allow content-limit for cat-file
  packfile: move sizep computation
2024-11-01 15:40:25 -04:00
Taylor Blau
96e57426b2 Merge branch 'jc/too-many-arguments' into jch
Error message clarification.

* jc/too-many-arguments:
  miscellaneous: avoid "too many arguments"
  notes: avoid "too many arguments"
  cat-file: avoid "too many arguments"
  refs: avoid "too many arguments"
2024-11-01 15:40:24 -04:00
Taylor Blau
7be00466ce Merge branch 'cc/promisor-remote-capability' into jch
The v2 protocol learned to allow the server to advertise possible
promisor remotes, and the client to respond with what promissor
remotes it uses, so that the server side can omit objects that the
client can lazily obtain from these other promissor remotes.

Comments?  I got an impression that this is premature without
finishing the discussion on a larger picture.
cf. <ZvpZv_fed_su4w2-@pks.im>

* cc/promisor-remote-capability:
  promisor-remote: check advertised name or URL
  Add 'promisor-remote' capability to protocol v2
  strbuf: refactor strbuf_trim_trailing_ch()
  version: refactor strbuf_sanitize()
2024-11-01 15:40:23 -04:00
Taylor Blau
5640c87545 Merge branch 'ps/reftable-detach' into jch
Isolates the reftable subsystem from the rest of Git's codebase by
using fewer pieces of Git's infrastructure.

* ps/reftable-detach:
  reftable/system: provide thin wrapper for lockfile subsystem
  reftable/stack: drop only use of `get_locked_file_path()`
  reftable/system: provide thin wrapper for tempfile subsystem
  reftable/stack: stop using `fsync_component()` directly
  reftable/system: stop depending on "hash.h"
  reftable: explicitly handle hash format IDs
  reftable/system: move "dir.h" to its only user
2024-11-01 15:40:22 -04:00
Taylor Blau
5215bb8aa3 Merge branch 'ps/mingw-rename' into jch
Teaches the MinGW compatibility layer to support POSIX semantics for
atomic renames when other process(es) have a file opened at the
destination path.

* ps/mingw-rename:
  compat/mingw: support POSIX semantics for atomic renames
  compat/mingw: allow deletion of most opened files
  compat/mingw: share file handles created via `CreateFileW()`
2024-11-01 15:40:21 -04:00
Taylor Blau
ce4ffbd3b9 Merge branch 'ej/cat-file-remote-object-info' into jch
"git cat-file --batch" and friends can optionally ask a remote
server about objects it does not have.

* ej/cat-file-remote-object-info:
  cat-file: add remote-object-info to batch-command
  cat-file: add declaration of variable i inside its for loop
  transport: add client support for object-info
  serve: advertise object-info feature
  fetch-pack: move fetch initialization
  fetch-pack: refactor packet writing
2024-11-01 15:40:20 -04:00
Taylor Blau
ff12b9087a Merge branch 'jk/left-right-bitmap' into jch
When called with '--left-right' and '--use-bitmap-index', 'rev-list'
will produce output without any left/right markers, which has been
corrected.

* jk/left-right-bitmap:
  rev-list: skip bitmap traversal for --left-right
2024-11-01 15:39:38 -04:00
Taylor Blau
1597170f39 Merge branch 'ps/upgrade-clar' into jch
Buildfix and upgrade of Clar to a newer version.

* ps/upgrade-clar:
  cmake: set up proper dependencies for generated clar headers
  cmake: fix compilation of clar-based unit tests
  Makefile: extract script to generate clar declarations
  Makefile: adjust sed command for generating "clar-decls.h"
  t/unit-tests: update clar to 206accb
2024-11-01 15:39:37 -04:00
Taylor Blau
1c5a712f26 Merge branch 'jk/dumb-http-finalize'
The dumb-http code regressed when the result of re-indexing a pack
yielded an *.idx file that differs in content from the *.idx file it
downloaded from the remote. This has been corrected by no longer
relying on the *.idx file we got from the remote.

* jk/dumb-http-finalize:
  packfile: use oidread() instead of hashcpy() to fill object_id
  packfile: use object_id in find_pack_entry_one()
  packfile: convert find_sha1_pack() to use object_id
  http-walker: use object_id instead of bare hash
  packfile: warn people away from parse_packed_git()
  packfile: drop sha1_pack_index_name()
  packfile: drop sha1_pack_name()
  packfile: drop has_pack_index()
  dumb-http: store downloaded pack idx as tempfile
  t5550: count fetches in "previously-fetched .idx" test
  midx: avoid duplicate packed_git entries
2024-11-01 12:53:32 -04:00
Taylor Blau
aebc4bd8ce Merge branch 'ak/more-typofixes'
More typofixes.

* ak/more-typofixes:
  t: fix typos
2024-11-01 12:53:29 -04:00
Taylor Blau
43ac23945c Merge branch 'rs/grep-lookahead'
Fix 'git grep' regression on macOS by disabling lookahead when
encountering invalid UTF-8 byte sequences.

* rs/grep-lookahead:
  grep: disable lookahead on error
2024-11-01 12:53:28 -04:00
Taylor Blau
81a5461518 Merge branch 'ak/t1016-cleanup'
Test cleanup.

* ak/t1016-cleanup:
  t1016: clean up style
2024-11-01 12:53:27 -04:00
Taylor Blau
59dc0ab83c Merge branch 'ua/atoi'
Replace various calls to atoi() with strtol_i() and strtoul_ui(), and
add improved error handling.

* ua/atoi:
  imap: replace atoi() with strtol_i() for UIDVALIDITY and UIDNEXT parsing
  merge: replace atoi() with strtol_i() for marker size validation
  daemon: replace atoi() with strtoul_ui() and strtol_i()
2024-11-01 12:53:26 -04:00
Taylor Blau
20ab7fa3b6 Merge branch 'sa/notes-edit'
Teach 'git notes add' and 'git notes append' a new '-e' flag,
instructing them to open the note in $GIT_EDITOR before saving.

* sa/notes-edit:
  notes: teach the -e option to edit messages in editor
2024-11-01 12:53:25 -04:00
Taylor Blau
8237b49ade Merge branch 'sk/t9101-cleanup'
Test cleanup.

* sk/t9101-cleanup:
  t9101: ensure no whitespace after redirect
2024-11-01 12:53:24 -04:00
Taylor Blau
07c6066f82 Merge branch 'kh/mv-breakage'
Demonstrate an assertion failure in 'git mv'.

* kh/mv-breakage:
  t7001: add failure test which triggers assertion
2024-11-01 12:53:21 -04:00
Taylor Blau
a524cc77ad Merge branch 'ua/t3404-cleanup'
Test update.

* ua/t3404-cleanup:
  t3404: replace test with test_line_count()
  t3404: avoid losing exit status with focus on `git show` and `git cat-file`
2024-11-01 12:53:18 -04:00
Taylor Blau
268fd2fe58 Merge branch 'ps/platform-compat-fixes'
Various platform compatibility fixes split out of the larger effort
to use Meson as the primary build tool.

* ps/platform-compat-fixes:
  t6006: fix prereq handling with `test_format ()`
  http: fix build error on FreeBSD
  builtin/credential-cache: fix missing parameter for stub function
  t7300: work around platform-specific behaviour with long paths on MinGW
  t5500, t5601: skip tests which exercise paths with '[::1]' on Cygwin
  t3404: work around platform-specific behaviour on macOS 10.15
  t1401: make invocation of tar(1) work with Win32-provided one
  t/lib-gpg: fix setup of GNUPGHOME in MinGW
  t/lib-gitweb: test against the build version of gitweb
  t/test-lib: wire up NO_ICONV prerequisite
  t/test-lib: fix quoting of TEST_RESULTS_SAN_FILE
2024-11-01 12:53:17 -04:00
Jeff King
16a186fede rev-list: skip bitmap traversal for --left-right
Running:

  git rev-list --left-right --use-bitmap-index one...two

will produce output without any left-right markers, since the bitmap
traversal returns only a single set of reachable commits. Instead we
should refuse to use bitmaps here and produce the correct output using a
traditional traversal.

This is probably not the only remaining option that misbehaves with
bitmaps, but it's particularly egregious in that it feels like it
_could_ work. Doing two separate traversals for the left/right sides and
then taking the symmetric set differences should yield the correct
answer, but our traversal code doesn't know how to do that.

It's not clear if naively doing two separate traversals would always be
a performance win. A traditional traversal only needs to walk down to
the merge base, but bitmaps always fill out the full reachability set.
So depending on your bitmap coverage, we could end up walking old bits
of history twice to fill out the same uninteresting bits on both sides.
We'd also of course end up with a very large --boundary set, if the user
asked for that.

So this might or might not be something worth implementing later. But
for now, let's make sure we don't produce the wrong answer if somebody
tries it.

The test covers this, but also the same thing with "--count" (which is
what I originally tried in a real-world case). Ironically the
try_bitmap_count() code already realizes that "--left-right" won't work
there. But that just causes us to fall back to the regular bitmap
traversal code, which itself doesn't handle counting (we produce a list
of objects rather than a count).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-11-01 11:02:27 -04:00
Derrick Stolee
c99f26cb1c path-walk: mark trees and blobs as UNINTERESTING
When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.

This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.

To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-31 17:14:21 -04:00
Derrick Stolee
8fa5e62bac path-walk: visit tags and cached objects
The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.

When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".

When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.

The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.

The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.

Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.

Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-31 17:14:21 -04:00
Derrick Stolee
ca51c2d6d2 path-walk: allow consumer to specify object types
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.

Future changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-31 17:14:21 -04:00
Derrick Stolee
35e6cf99d9 t6601: add helper for testing path-walk API
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-31 17:14:21 -04:00
Derrick Stolee
44425c47ef test-lib-functions: add test_cmp_sorted
This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-31 17:14:21 -04:00
Eric Ju
999bed85bd cat-file: add remote-object-info to batch-command
Since the `info` command in cat-file --batch-command prints object info
for a given object, it is natural to add another command in cat-file
--batch-command to print object info for a given object from a remote.
Add `remote-object-info` to cat-file --batch-command.

While `info` takes object ids one at a time, this creates overhead when
making requests to a server so `remote-object-info` instead can take
multiple object ids at once.

cat-file --batch-command is generally implemented in the following
manner:

 - Receive and parse input from user
 - Call respective function attached to command
 - Get object info, print object info

In --buffer mode, this changes to:

 - Receive and parse input from user
 - Store respective function attached to command in a queue
 - After flush, loop through commands in queue
    - Call respective function attached to command
    - Get object info, print object info

Notice how the getting and printing of object info is accomplished one
at a time. As described above, this creates a problem for making
requests to a server. Therefore, `remote-object-info` is implemented in
the following manner:

 - Receive and parse input from user
 If command is `remote-object-info`:
    - Get object info from remote
    - Loop through and print each object info
 Else:
    - Call respective function attached to command
    - Parse input, get object info, print object info

And finally for --buffer mode `remote-object-info`:
 - Receive and parse input from user
 - Store respective function attached to command in a queue
 - After flush, loop through commands in queue:
    If command is `remote-object-info`:
        - Get object info from remote
        - Loop through and print each object info
    Else:
        - Call respective function attached to command
        - Get object info, print object info

To summarize, `remote-object-info` gets object info from the remote and
then loop through the object info passed in, printing the info.

In order for remote-object-info to avoid remote communication overhead
in the non-buffer mode, the objects are passed in as such:

remote-object-info <remote> <oid> <oid> ... <oid>

rather than

remote-object-info <remote> <oid>
remote-object-info <remote> <oid>
...
remote-object-info <remote> <oid>

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Helped-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Eric Ju  <eric.peijian@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-31 16:59:58 -04:00
Taylor Blau
6aa984af3b Merge branch 'sk/t7011-cleanup'
Test cleanup.

* sk/t7011-cleanup:
  t7011: ensure no whitespace after redirect
2024-10-30 13:08:07 -04:00
Taylor Blau
8d305fdaac Merge branch 'co/t6050-pipefix'
Avoid losing exit status by having Git command being tested on the
upstream side of a pipe.

* co/t6050-pipefix:
  t6050: avoid pipes with upstream Git commands
2024-10-30 13:08:06 -04:00
Taylor Blau
cec4461e3f Merge branch 'ks/t4205-fixup'
Testfix.

* ks/t4205-fixup:
  t4205: fix typo in 'NUL termination with --stat'
2024-10-30 13:08:05 -04:00
Taylor Blau
bc627658b0 Merge branch 'ps/reftable-strbuf'
Implements a new reftable-specific strbuf replacement to reduce
reftable's dependency on Git-specific data structures.

* ps/reftable-strbuf:
  reftable: handle trivial `reftable_buf` errors
  reftable/stack: adapt `stack_filename()` to handle allocation failures
  reftable/record: adapt `reftable_record_key()` to handle allocation failures
  reftable/stack: adapt `format_name()` to handle allocation failures
  t/unit-tests: check for `reftable_buf` allocation errors
  reftable/blocksource: adapt interface name
  reftable: convert from `strbuf` to `reftable_buf`
  reftable/basics: provide new `reftable_buf` interface
  reftable: stop using `strbuf_addf()`
  reftable: stop using `strbuf_addbuf()`
2024-10-30 13:08:01 -04:00
Patrick Steinhardt
dd6003f200 t6006: fix prereq handling with test_format ()
In df383b5842 (t/test-lib: wire up NO_ICONV prerequisite, 2024-10-16) we
have introduced a new NO_ICONV prerequisite that makes us skip tests in
case Git is not compiled with support for iconv. This change subtly
broke t6006: while the test suite still passes, some of its tests won't
execute because they run into an error.

    ./t6006-rev-list-format.sh: line 92: test_expect_%e: command not found

The broken tests use `test_format ()`, and the mentioned commit simply
prepended the new prerequisite to its arguments. But that does not work,
as the function is not aware of prereqs at all and will now treat all of
its arguments incorrectly.

Fix this by making the function aware of prereqs by accepting an
optional fourth argument. Adapt the callsites accordingly.

Reported-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-28 13:44:38 -04:00
Abhijeet Sonar
014299066b show-index: fix uninitialized hash function
As stated in the docs, show-index should use SHA1 as the default hash algorithm
when run outsize of a repository.  However, 'the_hash_algo' is currently left
uninitialized if we are not in a repository and no explicit hash function is
specified, causing a crash.  Fix it by falling back to SHA1 when it is found
uninitialized. Also add test that verifies this behaviour.

Signed-off-by: Abhijeet Sonar <abhijeet.nkt@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-27 20:09:04 -04:00
Patrick Steinhardt
ae0f757d2f compat/mingw: support POSIX semantics for atomic renames
By default, Windows restricts access to files when those files have been
opened by another process. As explained in the preceding commits, these
restrictions can be loosened such that reads, writes and/or deletes of
files with open handles _are_ allowed.

While we set up those sharing flags in most relevant code paths now, we
still don't properly handle POSIX-style atomic renames in case the
target path is open. This is failure demonstrated by t0610, where one of
our tests spawns concurrent writes in a reftable-enabled repository and
expects all of them to succeed. This test fails most of the time because
the process that has acquired the "tables.list" lock is unable to rename
it into place while other processes are busy reading that file.

Windows 10 has introduced the `FILE_RENAME_FLAG_POSIX_SEMANTICS` flag
that allows us to fix this usecase [1]. When set, it is possible to
rename a file over a preexisting file even when the target file still
has handles open. Those handles must have been opened with the
`FILE_SHARE_DELETE` flag, which we have ensured in the preceding
commits.

Careful readers might have noticed that [1] does not mention the above
flag, but instead mentions `FILE_RENAME_POSIX_SEMANTICS`. This flag is
not for use with `SetFileInformationByHandle()` though, which is what we
use. And while the `FILE_RENAME_FLAG_POSIX_SEMANTICS` flag exists, it is
not documented on [2] or anywhere else as far as I can tell.

Unfortunately, we still support Windows systems older than Windows 10
that do not yet have this new flag. Our `_WIN32_WINNT` SDK version still
targets 0x0600, which is Windows Vista and later. And even though that
Windows version is out-of-support, bumping the SDK version all the way
to 0x0A00, which is Windows 10 and later, is not an option as it would
make it impossible to compile on Windows 8.1, which is still supported.
Instead, we have to manually declare the relevant infrastructure to make
this feature available and have fallback logic in place in case we run
on a Windows version that does not yet have this flag.

On another note: `mingw_rename()` has a retry loop that is used in case
deleting a file failed because it's still open in another process. One
might be pressed to not use this loop anymore when we can use POSIX
semantics. But unfortunately, we have to keep it around due to our
dependence on the `FILE_SHARE_DELETE` flag. While we know to set that
sharing flag now, other applications may not do so and may thus still
cause sharing violations when we try to rename a file.

This fixes concurrent writes in the reftable backend as demonstrated in
t0610, but may also end up fixing other usecases where Git wants to
perform renames.

[1]: https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/ns-ntifs-_file_rename_information
[2]: https://learn.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_rename_info

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-27 19:51:35 -04:00
Jeff King
479ab76c9f packfile: use object_id in find_pack_entry_one()
The main function we use to search a pack index for an object is
find_pack_entry_one(). That function still takes a bare pointer to the
hash, despite the fact that its underlying bsearch_pack() function needs
an object_id struct. And so we end up making an extra copy of the hash
into the struct just to do a lookup.

As it turns out, all callers but one already have such an object_id. So
we can just take a pointer to that struct and use it directly. This
avoids the extra copy and provides a more type-safe interface.

The one exception is get_delta_base() in packfile.c, when we are chasing
a REF_DELTA from inside the pack (and thus we have a pointer directly to
the mmap'd pack memory, not a struct). We can just bump the hashcpy()
from inside find_pack_entry_one() to this one caller that needs it.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-25 17:35:46 -04:00
Jeff King
63aca3f7f1 dumb-http: store downloaded pack idx as tempfile
This patch fixes a regression in b1b8dfde69 (finalize_object_file():
implement collision check, 2024-09-26) where fetching a v1 pack idx file
over the dumb-http protocol would cause the fetch to fail.

The core of the issue is that dumb-http stores the idx we fetch from the
remote at the same path that will eventually hold the idx we generate
from "index-pack --stdin". The sequence is something like this:

  0. We realize we need some object X, which we don't have locally, and
     nor does the other side have it as a loose object.

  1. We download the list of remote packs from objects/info/packs.

  2. For each entry in that file, we download each pack index and store
     it locally in .git/objects/pack/pack-$hash.idx (the $hash is not
     something we can verify yet and is given to us by the remote).

  3. We check each pack index we got to see if it has object X. When we
     find a match, we download the matching .pack file from the remote
     to a tempfile. We feed that to "index-pack --stdin", which
     reindexes the pack, rather than trusting that it has what the other
     side claims it does. In most cases, this will end up generating the
     exact same (byte-for-byte) pack index which we'll store at the same
     pack-$hash.idx path, because the index generation and $hash id are
     computed based on what's in the packfile. But:

       a. The other side might have used other options to generate the
          index. For instance we use index v2 by default, but long ago
          it was v1 (and you can still ask for v1 explicitly).

       b. The other side might even use a different mechanism to
          determine $hash. E.g., long ago it was based on the sorted
          list of objects in the packfile, but we switched to using the
          pack checksum in 1190a1acf8 (pack-objects: name pack files
          after trailer hash, 2013-12-05).

The regression we saw in the real world was (3a). A recent client
fetching from a server with a v1 index downloaded that index, then
complained about trying to overwrite it with its own v2 index. This
collision is otherwise harmless; we know we want to replace the remote
version with our local one, but the collision check doesn't realize
that.

There are a few options to fix it:

  - we could teach index-pack a command-line option to ignore only pack
    idx collisions, and use it when the dumb-http code invokes
    index-pack. This would be an awkward thing to expose users to and
    would involve a lot of boilerplate to get the option down to the
    collision code.

  - we could delete the remote .idx file right before running
    index-pack. It should be redundant at that point (since we've just
    downloaded the matching pack). But it feels risky to delete
    something from our own .git/objects based on what the other side has
    said. I'm not entirely positive that a malicious server couldn't lie
    about which pack-$hash.idx it has and get us to delete something
    precious.

  - we can stop co-mingling the downloaded idx files in our local
    objects directory. This is a slightly bigger change but I think
    fixes the root of the problem more directly.

This patch implements the third option. The big design questions are:
where do we store the downloaded files, and how do we manage their
lifetimes?

There are some additional quirks to the dumb-http system we should
consider. Remember that in step 2 we downloaded every pack index, but in
step 3 we may only download some of the matching packs. What happens to
those other idx files now? They sit in the .git/objects/pack directory,
possibly waiting to be used at a later date. That may save bandwidth for
a subsequent fetch, but it also creates a lot of weird corner cases:

  - our local object directory now has semi-untrusted .idx files sitting
    around, without their matching .pack

  - in case 3b, we noted that we might not generate the same hash as the
    other side. In that case even if we download the matching pack,
    our index-pack invocation will store it in a different
    pack-$hash.idx file. And the unmatched .idx will sit there forever.

  - if the server repacks, it may delete the old packs. Now we have
    these orphaned .idx files sitting around locally that will never be
    used (nor deleted).

  - if we repack locally we may delete our local version of the server's
    pack index and not realize we have it. So we'll download it again,
    even though we have all of the objects it mentions.

I think the right solution here is probably some more complex cache
management system: download the remote .idx files to their own storage
directory, mark them as "seen" when we get their matching pack (to avoid
re-downloading even if we repack), and then delete them when the
server's objects/info/refs no longer mentions them.

But since the dumb http protocol is so ancient and so inferior to the
smart http protocol, I don't think it's worth spending a lot of time
creating such a system. For this patch I'm just downloading the idx
files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
deleted when we exit (and due to the name, any we miss due to a crash,
etc, should eventually be removed by "git gc" runs based on timestamps).

That is slightly worse for one case: if we download an idx but not the
matching pack, we won't retain that idx for subsequent runs. But the
flip side is that we're making other cases better (we never hold on to
useless idx files forever). I suspect that worse case does not even come
up often, since it implies that the packs are generated to match
distinct parts of history (i.e., in practice even in a repo with many
packs you're going to end up grabbing all of those packs to do a clone).
If somebody really cares about that, I think the right path forward is a
managed cache directory as above, and this patch is providing the first
step in that direction anyway (by moving things out of the objects/pack/
directory).

There are two test changes. One demonstrates the broken v1 index case
(it double-checks the resulting clone with fsck to be careful, but prior
to this patch it actually fails at the clone step). The other tweaks the
expectation for a test that covers the "slightly worse" case to
accommodate the extra index download.

The code changes are fairly simple. We stop using finalize_object_file()
to copy the remote's index file into place, and leave it as a tempfile.
We give the tempfile a real ".idx" name, since the packfile code expects
that, and thus we make sure it is out of the usual packs/ directory (so
we'd never mistake it for a real local .idx).

We also have to change parse_pack_index(), which creates a temporary
packed_git to access our index (we need this because all of the pack idx
code assumes we have that struct). It reads the index data from the
tempfile, but prior to this patch would speculatively write the
finalized name into the packed_git struct using the pack-$hash we expect
to use.

I was mildly surprised that this worked at all, since we call
verify_pack_index() on the packed_git which mentions the final name
before moving the file into place! But it works because
parse_pack_index() leaves the mmap-ed data in the struct, so the
lazy-open in verify_pack_index() never triggers, and we read from the
tempfile, ignoring the filename in the struct completely. Hacky, but it
works.

After this patch, parse_pack_index() now uses the index filename we pass
in to derive a matching .pack name. This is OK to change because there
are only two callers, both in the dumb http code (and the other passes
in an existing pack-$hash.idx name, so the derived name is going to be
pack-$hash.pack, which is what we were using anyway).

I'll follow up with some more cleanups in that area, but this patch is
sufficient to fix the regression.

Reported-by: fox <fox.gbr@townlong-yak.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-25 17:35:46 -04:00
Jeff King
019b21d402 t5550: count fetches in "previously-fetched .idx" test
We have a test in t5550 that looks at index fetching over dumb http. It
creates two branches, each of which is completely stored in its own
pack, then fetches the branches independently. What should (and does)
happen is that the first fetch grabs both .idx files and one .pack file,
and then the fetch of the second branch re-uses the previously
downloaded .idx files (fetching none) and grabs the now-required .pack
file.

Since the next few patches will be touching this area of the code, let's
beef up the test a little by checking that we're downloading the
expected items at each step.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-25 17:35:46 -04:00
Jeff King
8b5763e8fa midx: avoid duplicate packed_git entries
When we scan a pack directory to load the idx entries we find into the
packed_git list, we skip any of them that are contained in a midx. We
then load them later lazily if we actually need to access the
corresponding pack, referencing them both from the midx struct and the
packed_git list.

The lazy-load in the midx code checks to see if the midx already
mentions the pack, but doesn't otherwise check the packed_git list. This
makes sense, since we should have added any pack to both lists.

But there's a loophole! If we call close_object_store(), that frees the
midx entirely, but _not_ the packed_git structs, which we must keep
around for Reasons[1]. If we then try to look up more objects, we'll
auto-load the midx again, which won't realize that we've already loaded
those packs, and will create duplicate entries in the packed_git list.

This is possibly inefficient, because it means we may open and map the
pack redundantly. But it can also lead to weird user-visible behavior.
The case I found is in "git repack", which closes and reopens the midx
after repacking and then calls update_server_info(). We end up writing
the duplicate entries into objects/info/packs.

We could obviously de-dup them while writing that file, but it seems
like a violation of more core assumptions that we end up with these
duplicate structs at all.

We can avoid the duplicates reasonably efficiently by checking their
names in the pack_map hash. This annoyingly does require a little more
than a straight hash lookup due to the naming conventions, but it should
only happen when we are about to actually open a pack. I don't think one
extra malloc will be noticeable there.

[1] I'm not entirely sure of all the details, except that we generally
    assume the packed_git structs never go away. We noted this
    restriction in the comment added by 6f1e9394e2 (object: fix leaking
    packfiles when closing object store, 2024-08-08), but it's somewhat
    vague. At any rate, if you try freeing the structs in
    close_object_store(), you can observe segfaults all over the test
    suite. So it might be fixable, but it's not trivial.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-25 17:35:46 -04:00
Taylor Blau
55d12c24d7 Merge branch 'wm/shortlog-hash'
Teaches 'shortlog' to explicitly use SHA-1 when operating outside of
a repository.

* wm/shortlog-hash:
  builtin/shortlog: explicitly set hash algo when there is no repo
2024-10-25 14:02:49 -04:00
Taylor Blau
448022a7fb Merge branch 'bf/t-readme-mention-reftable'
Doc update.

* bf/t-readme-mention-reftable:
  t/README: add missing value for GIT_TEST_DEFAULT_REF_FORMAT
2024-10-25 14:02:21 -04:00
Taylor Blau
f25bb60393 Merge branch 'ak/typofix'
More typofixes.

* ak/typofix:
  t: fix typos
2024-10-25 14:02:08 -04:00
Taylor Blau
4d334e5205 Merge branch 'ak/typofixes'
Typofixes.

* ak/typofixes:
  t: fix typos
  t/helper: fix a typo
  t/perf: fix typos
  t/unit-tests: fix typos
  contrib: fix typos
  compat: fix typos
2024-10-25 14:02:04 -04:00
Taylor Blau
55bc7d54ab Merge branch 'ps/ci-gitlab-windows'
Enable Windows-based CI in GitLab.

* ps/ci-gitlab-windows:
  gitlab-ci: exercise Git on Windows
  gitlab-ci: introduce stages and dependencies
  ci: handle Windows-based CI jobs in GitLab CI
  ci: create script to set up Git for Windows SDK
  t7300: work around platform-specific behaviour with long paths on MinGW
2024-10-25 14:01:21 -04:00
Taylor Blau
6cbcc68ea7 Merge branch 'db/submodule-fetch-with-remote-name-fix'
A "git fetch" from the superproject going down to a submodule used
a wrong remote when the default remote names are set differently
between them.

* db/submodule-fetch-with-remote-name-fix:
  submodule: correct remote name with fetch
2024-10-25 14:01:09 -04:00
Usman Akinyemi
e36f009e69 merge: replace atoi() with strtol_i() for marker size validation
Replace atoi() with strtol_i() for parsing conflict-marker-size to
improve error handling. Invalid values, such as those containing letters
now trigger a clear error message.
Update the test to verify invalid input handling.

Signed-off-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-24 14:03:44 -04:00
Usman Akinyemi
cc4023477f daemon: replace atoi() with strtoul_ui() and strtol_i()
Replace atoi() with strtoul_ui() for --timeout and --init-timeout
(non-negative integers) and with strtol_i() for --max-connections
(signed integers). This improves error handling and input validation
by detecting invalid values and providing clear error messages.
Update tests to ensure these arguments are properly validated.

Signed-off-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-24 14:03:43 -04:00
Andrew Kreimer
f56f9d6c0b t: fix typos
Fix typos and grammar in documentation, comments, etc.

Via codespell.

Reported-by: Kristoffer Haugsbakk <kristofferhaugsbakk@fastmail.com>
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
2024-10-24 12:45:53 -04:00