Semantic Diff Problems (Mini-Summit)

SES / Alignment are biggest time / space consumers.
Profiling small subsets of code paths rather than the full context.
Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries).

n^3 the size of the tree.
Can try bounded SES (looks ahead by a fixed size of nodes).
Identify more comparisons we can skip (i.e. don't compare functions with array literals).
Does not look like there are more easy wins here (algorithm is already implemented to prevent unnecessary comparisions).
In some cases, the diffing is expensive because we don't have more fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions).
Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not).
This could result in us missing a function rename though.
Not a catchall, but it can help increase performance in a larger number of cases.

Random Walk Similarity.
computes approximation to the minimal edit script.
O(log N) rather than O(n^3).
RWS does not rely on identifiers.
RWS solves our performance problem in the general form.
Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES).

Performance of DS is dependent on diffing (Diff Terms, Interpreter, cost functions)

Pairing has been fantastic.
SES algorithm requires some context and background to understand the code at the general / macro level.
Plan a bit before pairing to gain context.

Test on a couple file server nodes and run semantic diff on javascript repos.
Collect repos, files, shas that contain error nodes to gain a % of error rates and expose errors in tree sitter grammars.
If sources have errors, can we use a parser that validates the source is correct?
Configure a script that is as language independent as possible that can automate the error collection process but allows us to specify an independent validating parser for each language.