From 25f8b922aba725fdad08fa93051b8540aa7338b0 Mon Sep 17 00:00:00 2001 From: Rick Winfrey Date: Tue, 21 Jun 2016 12:26:48 -0400 Subject: [PATCH 1/2] Add mini summit problem notes --- weekly/2016-06-21.md | 70 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 weekly/2016-06-21.md diff --git a/weekly/2016-06-21.md b/weekly/2016-06-21.md new file mode 100644 index 000000000..6b2ed2c08 --- /dev/null +++ b/weekly/2016-06-21.md @@ -0,0 +1,70 @@ +# Semantic Diff Problems (Mini-Summit) + +### Performance (most significant problem) + + - SES / Alignment are biggest time / space consumers + - Profiling small subsets of code paths rather than the full context. + - Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries) + +#### Alignment Performance + + - Has to visit each child of each remaining line + +#### [SES](https://github.com/github/semantic-diff/files/22485/An.O.ND.Difference.Algorithm.and.its.Variations.pdf) Performance + + - n^3 the size of the tree + - Can try bounded SES (looks ahead by a fixed size of nodes) + - Identify more comparisons we can skip (i.e. don't compare functions with array literals) + - Does not look like there are more easy wins here (algorithm is already implemented to prevent unnecessary comparisions). + - In some cases, the diffing is expensive because we don't have more + fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions) + - Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not) + - This could result in us missing a function rename though + - Not a catchall, but it can help increase performance in a larger number of cases + +#### [RWS](https://github.com/github/semantic-diff/files/325837/RWS-Diff.Flexible.and.Efficient.Change.Detection.in.Hierarchical.Data.pdf) Performance + + - Random Walk Similarity + - computes approximation to the minimal edit script + - O(log N) rather than O(n^3) + - RWS does not rely on identifiers + - RWS solves our performance problem in the general form + - Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES) + +#### Diff Summaries Performance + + - Performance of DS is dependent on diffing (Diff Terms, Interpreter, cost functions) + +### Failing too hard when we fail (request is not completing if Semantic Diff fails) + + - How can we fail better on dotcom? + - How can we fail better when parsing? (both in Semantic Diff and dotcom) + +### Responsiveness + + - Async fetch diff summaries / diffs / progressive diffs or diff summaries + +### Improving grammars (getting Ruby parser fixed, testing C parser) + +### Measuring effectiveness of grammars + +### Tooling + + - Why isn't parallelization of SES having the expected effect? + - Should focus on low hanging fruit but we're not going to write a debugger. + +### Time limitations with respect to solutions and team + +### Ramp up time is extremely variable. + +### Onboarding + + - SES algorithm requires some context and background to understand the code at a macro. + - Plan a bit before pairing to gain context + +### Pre-launch Ideas + + - Test on a couple file server nodes and run semantic diff on javascript repos. + - Collect repos, files, shas that contain error nodes to gain a % of error rates and expose errors in tree sitter grammars. + - If sources have errors, can we use a parser that validates the source is correct? + - Configure a script that is as language independent as possible that can automate the error collection process but allows us to specify an independent validating parser for each language. From 5c25ae4719e479570cd62471cd24d556a0699de2 Mon Sep 17 00:00:00 2001 From: Rick Winfrey Date: Tue, 21 Jun 2016 12:34:19 -0400 Subject: [PATCH 2/2] Small changes / typos --- weekly/2016-06-21.md | 56 ++++++++++++++++++++++++-------------------- 1 file changed, 30 insertions(+), 26 deletions(-) diff --git a/weekly/2016-06-21.md b/weekly/2016-06-21.md index 6b2ed2c08..14242a5fa 100644 --- a/weekly/2016-06-21.md +++ b/weekly/2016-06-21.md @@ -2,41 +2,41 @@ ### Performance (most significant problem) - - SES / Alignment are biggest time / space consumers + - SES / Alignment are biggest time / space consumers. - Profiling small subsets of code paths rather than the full context. - - Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries) + - Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries). -#### Alignment Performance +##### Alignment performance - - Has to visit each child of each remaining line + - Has to visit each child of each remaining line. -#### [SES](https://github.com/github/semantic-diff/files/22485/An.O.ND.Difference.Algorithm.and.its.Variations.pdf) Performance +##### [SES](https://github.com/github/semantic-diff/files/22485/An.O.ND.Difference.Algorithm.and.its.Variations.pdf) Performance - - n^3 the size of the tree - - Can try bounded SES (looks ahead by a fixed size of nodes) - - Identify more comparisons we can skip (i.e. don't compare functions with array literals) + - n^3 the size of the tree. + - Can try bounded SES (looks ahead by a fixed size of nodes). + - Identify more comparisons we can skip (i.e. don't compare functions with array literals). - Does not look like there are more easy wins here (algorithm is already implemented to prevent unnecessary comparisions). - - In some cases, the diffing is expensive because we don't have more - fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions) - - Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not) - - This could result in us missing a function rename though - - Not a catchall, but it can help increase performance in a larger number of cases + - In some cases, the diffing is expensive because we don't have more fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions). + - Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not). + - This could result in us missing a function rename though. + - Not a catchall, but it can help increase performance in a larger number of cases. -#### [RWS](https://github.com/github/semantic-diff/files/325837/RWS-Diff.Flexible.and.Efficient.Change.Detection.in.Hierarchical.Data.pdf) Performance +##### [RWS](https://github.com/github/semantic-diff/files/325837/RWS-Diff.Flexible.and.Efficient.Change.Detection.in.Hierarchical.Data.pdf) Performance - - Random Walk Similarity - - computes approximation to the minimal edit script - - O(log N) rather than O(n^3) - - RWS does not rely on identifiers - - RWS solves our performance problem in the general form - - Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES) + - Random Walk Similarity. + - computes approximation to the minimal edit script. + - O(log N) rather than O(n^3). + - RWS does not rely on identifiers. + - RWS solves our performance problem in the general form. + - Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES). -#### Diff Summaries Performance +##### Diff summaries performance - Performance of DS is dependent on diffing (Diff Terms, Interpreter, cost functions) -### Failing too hard when we fail (request is not completing if Semantic Diff fails) +### Failing too hard + - Request is not completing if Semantic Diff fails. - How can we fail better on dotcom? - How can we fail better when parsing? (both in Semantic Diff and dotcom) @@ -44,9 +44,12 @@ - Async fetch diff summaries / diffs / progressive diffs or diff summaries -### Improving grammars (getting Ruby parser fixed, testing C parser) +### Improving grammars -### Measuring effectiveness of grammars + - Fix Ruby parser. + - Testing and verifying other grammars. + +### Measure effectiveness of grammars ### Tooling @@ -59,8 +62,9 @@ ### Onboarding - - SES algorithm requires some context and background to understand the code at a macro. - - Plan a bit before pairing to gain context + - Pairing has been fantastic. + - SES algorithm requires some context and background to understand the code at the general / macro level. + - Plan a bit before pairing to gain context. ### Pre-launch Ideas