semantic/2017-01-06.md at b80667f5b48505cfc10540a46b68e7084cb5f8dd

github/semantic

Fork 0

mirror of https://github.com/github/semantic.git synced 2024-12-28 09:21:35 +03:00

Rick Winfrey 13790348d4 Clarify homology description

2017-01-06 14:28:16 -08:00

4.0 KiB

Raw Blame History

January 6, 2017

Welcome back @robrix!

In place of our usual format we had an informal conversation identifying status of projects, concerns, and potential projects for the future. Also @joshvera gave us an excellent introduction to finding persistent homologies.

Below are the main talking points of the conversation (speakers are not identified):

Want to identify a mile stone for what to do next for diff summaries.

What's the minimal work necessary to get this in front of customers?
Dependents feature shipped only with Ruby support, so shipping only with JavaScript, Markdown, Ruby (?!) should be fine.
Possible rubric for evaluating a summary statement's value: can a screen reader easily understand the statement?
Consider limiting diff summary statements to only those statements with significant meaning.
- Create heuristics to determine what "significant" means in the context of diff summaries.
View diff summaries through other lenses (i.e. security). Example: does a change to this regex represent a security concern?
Par down feature to essential 2 or 3 aspects that drive most value for customer. Use this to drive conversations with product going forward.

What project or work can we identify and prioritize next for data-science?

Current approach to adding parsers is not scalable. Can we generate mappings somehow programmatically? @robrix possibly to look into this problem starting next week.

Persistent Homologies

Motivating problem:

The problem with all machine learning models is error correction must be done via an error-correcting step. Even then, bias and skewing occur as a natural bi-product of the data samples used to train and test the models.

Motivating questions:

Is there a way to find significant dimensions in a high dimensional data set that do not rely on probability?

Can we use linear functions and stochastic properties to determine what those significant features are?

Homologies:

Given a data set, a homology is a cycle or loop of points in that data set that are significant given a significance function (e.g. distance function).

With these data points, or topological invariants, we can use filtration to further refine the cycles or loops to find the most significant cycles. These are the strongest, or longest living, cycles. This helps us see what points in the n-dimensional data set are most important in relation to one another (rather than in relation to an assumption that is reified through a probabilistic test or approach).

How does one measure how "good" a persistent homology is? One can use a stability distance function to measure the stability of the distance function used to identify cycles.

Future project ideas

Semantic index: a semantically structured index of a GitHub project. Like ctags, the goal would be to create a link-based index of symbols, names or definitions of classes, methods, functions and vars.

This would allow a GitHub user to browse a project based on its semantic index to understand the projects layout, including its classes, methods contained within a class, and from where methods are invoked.

Other possible uses of this could be:

Visualize the weight of a class or method name based on its frequency of invocation in a project.
Visualize the churn weight of a class or method based on its frequency of change over time.
Security implications / risk assessment for the health and stability of a project: given that we know the most important classes and methods of a project (based on point 1), those that are most significant and change the most (point 2) represent a risk to the stability of the project. Projects with main code paths with high degree of churn represent projects that may not provide production level stability and safety for end users. Or indicate that they are still "works in progress" and not ready to be used in a production environment.
Indexed project would link to the source code (e.g. method name links to current master's method definition).

4.0 KiB Raw Blame History

4.0 KiB

Raw Blame History