semantic/weekly/2016-06-21.md

# Semantic Diff Problems (Mini-Summit)

### Performance (most significant problem)

  - SES / Alignment are biggest time / space consumers.
  - Profiling small subsets of code   paths rather than the full context.
  - Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries).

##### Alignment performance

  - Has to visit each child of each remaining line.

##### [SES](https://github.com/github/semantic-diff/files/22485/An.O.ND.Difference.Algorithm.and.its.Variations.pdf) Performance

  - n^3 the size of the tree.
  - Can try bounded SES (looks ahead by a fixed size of nodes).
  - Identify more comparisons we can skip (i.e. don't compare functions with array literals).
  - Does not look like there are more easy wins here (algorithm is already implemented to prevent unnecessary comparisions).
  - In some cases, the diffing is expensive because we don't have more fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions).
  - Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not).
  - This could result in us missing a function rename though.
  - Not a catchall, but it can help increase performance in a larger number of cases.

##### [RWS](https://github.com/github/semantic-diff/files/325837/RWS-Diff.Flexible.and.Efficient.Change.Detection.in.Hierarchical.Data.pdf) Performance

  - Random Walk Similarity.
  - computes approximation to the minimal edit script.
  - O(log N) rather than O(n^3).
  - RWS does not rely on identifiers.
  - RWS solves our performance problem in the general form.
  - Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES).

##### Diff summaries performance

  - Performance of DS is dependent on diffing (Diff Terms, Interpreter, cost functions)

### Failing too hard

  - Request is not completing if Semantic Diff fails.
  - How can we fail better on dotcom?
  - How can we fail better when parsing? (both in Semantic Diff and dotcom)

### Responsiveness

  - Async fetch diff summaries / diffs / progressive diffs or diff summaries

### Improving grammars

  - Fix Ruby parser.
  - Testing and verifying other grammars.

### Measure effectiveness of grammars

### Tooling

  - Why isn't parallelization of SES having the expected effect?
  - Should focus on low hanging fruit   but we're not going to write a debugger.

### Time limitations with respect to solutions and team

### Ramp up time is extremely variable.

### Onboarding

  - Pairing has been fantastic.
  - SES algorithm requires some context and background to understand the code at the general / macro level.
  - Plan a bit before pairing to gain context.

### Pre-launch Ideas

  - Test on a couple file server nodes and run semantic diff on javascript repos.
  - Collect repos, files, shas that contain error nodes to gain a % of error rates and expose errors in tree sitter grammars.
  - If sources have errors, can we use a parser that validates the source is correct?
  - Configure a script that is as language independent as possible that can automate the error collection process but allows us to specify an independent validating parser for each language.
Add mini summit problem notes 2016-06-21 19:26:48 +03:00			`# Semantic Diff Problems (Mini-Summit)`

			`### Performance (most significant problem)`

Small changes / typos 2016-06-21 19:34:19 +03:00			`- SES / Alignment are biggest time / space consumers.`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00			`- Profiling small subsets of code paths rather than the full context.`
Small changes / typos 2016-06-21 19:34:19 +03:00			`- Adding more criterion benchmarks for code paths not currently profiled (like Diff Summaries).`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`##### Alignment performance`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`- Has to visit each child of each remaining line.`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`##### [SES](https://github.com/github/semantic-diff/files/22485/An.O.ND.Difference.Algorithm.and.its.Variations.pdf) Performance`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`- n^3 the size of the tree.`
			`- Can try bounded SES (looks ahead by a fixed size of nodes).`
			`- Identify more comparisons we can skip (i.e. don't compare functions with array literals).`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00			`- Does not look like there are more easy wins here (algorithm is already implemented to prevent unnecessary comparisions).`
Small changes / typos 2016-06-21 19:34:19 +03:00			`- In some cases, the diffing is expensive because we don't have more fine-grain identifiers for certain diffs. (e.g. a test file with 100 statement expressions).`
			`- Diffing against identifiers (use the edit distance to determine whether to compare terms with SES or not).`
			`- This could result in us missing a function rename though.`
			`- Not a catchall, but it can help increase performance in a larger number of cases.`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`##### [RWS](https://github.com/github/semantic-diff/files/325837/RWS-Diff.Flexible.and.Efficient.Change.Detection.in.Hierarchical.Data.pdf) Performance`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`- Random Walk Similarity.`
			`- computes approximation to the minimal edit script.`
			`- O(log N) rather than O(n^3).`
			`- RWS does not rely on identifiers.`
			`- RWS solves our performance problem in the general form.`
			`- Can allow us to diff patches of patches (something we cannot do currently with our implementation of SES).`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`##### Diff summaries performance`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
			`- Performance of DS is dependent on diffing (Diff Terms, Interpreter, cost functions)`

Small changes / typos 2016-06-21 19:34:19 +03:00			`### Failing too hard`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`- Request is not completing if Semantic Diff fails.`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00			`- How can we fail better on dotcom?`
			`- How can we fail better when parsing? (both in Semantic Diff and dotcom)`

			`### Responsiveness`

			`- Async fetch diff summaries / diffs / progressive diffs or diff summaries`

Small changes / typos 2016-06-21 19:34:19 +03:00			`### Improving grammars`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
Small changes / typos 2016-06-21 19:34:19 +03:00			`- Fix Ruby parser.`
			`- Testing and verifying other grammars.`

			`### Measure effectiveness of grammars`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
			`### Tooling`

			`- Why isn't parallelization of SES having the expected effect?`
			`- Should focus on low hanging fruit but we're not going to write a debugger.`

			`### Time limitations with respect to solutions and team`

			`### Ramp up time is extremely variable.`

			`### Onboarding`

Small changes / typos 2016-06-21 19:34:19 +03:00			`- Pairing has been fantastic.`
			`- SES algorithm requires some context and background to understand the code at the general / macro level.`
			`- Plan a bit before pairing to gain context.`
Add mini summit problem notes 2016-06-21 19:26:48 +03:00
			`### Pre-launch Ideas`

			`- Test on a couple file server nodes and run semantic diff on javascript repos.`
			`- Collect repos, files, shas that contain error nodes to gain a % of error rates and expose errors in tree sitter grammars.`
			`- If sources have errors, can we use a parser that validates the source is correct?`
			`- Configure a script that is as language independent as possible that can automate the error collection process but allows us to specify an independent validating parser for each language.`