Merge branch 'master' into random-walk-similarity

2024-12-24 23:42:31 +03:00 · 2016-06-22 15:19:32 -04:00 · 2016-06-22 15:19:32 -04:00 · fa64af054d
commit fa64af054d
parent f1d190326c 67ab1f11a2
1 changed files with 74 additions and 0 deletions
--- a/weekly/2016-06-21.md
+++ b/weekly/2016-06-21.md
@ -0,0 +1,74 @@
+# Semantic Diff Problems (Mini-Summit)
+
+### Performance (most significant problem)
+
+  - SES / Alignment are biggest time / space consumers.
+  - Profiling small subsets of code   paths rather than the full context.
+  - Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries).
+
+##### Alignment performance
+
+  - Has to visit each child of each remaining line.
+
+##### [SES](https://github.com/github/semantic-diff/files/22485/An.O.ND.Difference.Algorithm.and.its.Variations.pdf) Performance
+
+  - n^3 the size of the tree.
+  - Can try bounded SES (looks ahead by a fixed size of nodes).
+  - Identify more comparisons we can skip (i.e. don't compare functions with array literals).
+  - Does not look like there are more easy wins here (algorithm is already implemented to prevent unnecessary comparisions).
+  - In some cases, the diffing is expensive because we don't have more fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions).
+  - Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not).
+  - This could result in us missing a function rename though.
+  - Not a catchall, but it can help increase performance in a larger number of cases.
+
+##### [RWS](https://github.com/github/semantic-diff/files/325837/RWS-Diff.Flexible.and.Efficient.Change.Detection.in.Hierarchical.Data.pdf) Performance
+
+  - Random Walk Similarity.
+  - computes approximation to the minimal edit script.
+  - O(log N) rather than O(n^3).
+  - RWS does not rely on identifiers.
+  - RWS solves our performance problem in the general form.
+  - Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES).
+
+##### Diff summaries performance
+
+  - Performance of DS is dependent on diffing (Diff Terms, Interpreter, cost functions)
+
+### Failing too hard
+
+  - Request is not completing if Semantic Diff fails.
+  - How can we fail better on dotcom?
+  - How can we fail better when parsing? (both in Semantic Diff and dotcom)
+
+### Responsiveness
+
+  - Async fetch diff summaries / diffs / progressive diffs or diff summaries
+
+### Improving grammars
+
+  - Fix Ruby parser.
+  - Testing and verifying other grammars.
+
+### Measure effectiveness of grammars
+
+### Tooling
+
+  - Why isn't parallelization of SES having the expected effect?
+  - Should focus on low hanging fruit   but we're not going to write a debugger.
+
+### Time limitations with respect to solutions and team
+
+### Ramp up time is extremely variable.
+
+### Onboarding
+
+  - Pairing has been fantastic.
+  - SES algorithm requires some context and background to understand the code at the general / macro level.
+  - Plan a bit before pairing to gain context.
+
+### Pre-launch Ideas
+
+  - Test on a couple file server nodes and run semantic diff on javascript repos.
+  - Collect repos, files, shas that contain error nodes to gain a % of error rates and expose errors in tree sitter grammars.
+  - If sources have errors, can we use a parser that validates the source is correct?
+  - Configure a script that is as language independent as possible that can automate the error collection process but allows us to specify an independent validating parser for each language.