8000 Merge remote-tracking branch 'origin/master' into stack-graph · github/semantic@4f1e253 · GitHub
[go: up one dir, main page]

Skip to content
This repository was archived by the owner on Apr 1, 2025. It is now read-only.

Commit 4f1e253

Browse files
author
Patrick Thomson
committed
Merge remote-tracking branch 'origin/master' into stack-graph
2 parents 8125780 + 6636264 commit 4f1e253

File tree

18 files changed

+8191
-5818
lines changed

18 files changed

+8191
-5818
lines changed

Dockerfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ RUN go get github.com/golang/protobuf/proto && \
1616

1717
COPY --from=haskell /root/.cabal/bin/proto-lens-protoc /usr/local/bin/proto-lens-protoc
1818

19+
# Bit of a hack so that proto-lens-protoc actually runs
20+
COPY --from=haskell /opt/ghc/8.8.1/lib/ghc-8.8.1/* /opt/ghc/8.8.1/lib/ghc-8.8.1/
21+
1922
ENTRYPOINT ["/protobuf/bin/protoc", "-I/protobuf", "--plugin=protoc-gen-haskell=/usr/local/bin/proto-lens-protoc"]
2023

2124
# Build semantic

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,6 @@ Available options:
8585
| | PHP | 🚧 | 🚧 | 🚧 | 🚧| 🚧 | | | |
8686
| | Java | 🚧 | N/A | 🚧 | 🚧 || | | |
8787
| | JSON || N/A || N/A | N/A | N/A | N/A| |
88-
| | Java | 🚧 | 🚧 | 🚧 | 🔶 || | | |
8988
| | JSX |||| 🔶 | | | | |
9089
| | Haskell | 🚧 | 🚧 | 🚧 | 🔶 | 🚧 | | | |
9190
| | Markdown | 🚧 | 🚧 | 🚧 | 🚧 | N/A | N/A | N/A |   |

docs/codegen.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
@ -1,216 +0,0 @@
2+
# CodeGen Documentation
3+
4+
CodeGen is the process for auto-generating language-specific, strongly-typed ASTs to be used in [Semantic](https://github.com/github/semantic-code/blob/d9f91a05dc30a61b9ff8c536d75661d417f3c506/design-docs/precise-code-navigation.md).
5+
6+
### Prerequisites
7+
To get started, first make sure your language has:
8+
9+
1. An existing [tree-sitter](http://tree-sitter.github.io/tree-sitter/) parser;
10+
2. An existing Cabal package in [tree-sitter](http://tree-sitter.github.io/tree-sitter/) for said language. This will provide an interface into tree-sitter's C source. [Here](https://github.com/tree-sitter/haskell-tree-sitter/tree/master/tree-sitter-python) is an example of a library for Python, a supported language that the remaining documentation will refer to.
11+
12+
### CodeGen Pipeline
13+
14+
During parser generation, tree-sitter produces a JSON file that captures the structure of a language's grammar. Based on this, we're able to derive datatypes representing surface languages, and then use those datatypes to generically build ASTs. This automates the engineering effort [historically required for adding a new language](https://github.com/github/semantic/blob/master/docs/adding-new-languages.md).
15+
16+
The following steps provide a high-level outline of the process:
17+
18+
1. [**Deserialize.**](https://github.com/github/semantic/blob/master/semantic-ast/src/AST/Deserialize.hs) First, we deserialize the `node-types.json` file for a given language into the desired shape of datatypes via parsing capabilities afforded by the [Aeson](http://hackage.haskell.org/package/aeson) library. There are four distinct types represented in the node-types.json file takes on: sums, products, named leaves and anonymous leaves.
19+
2. [**Generate Syntax.**](https://github.com/github/semantic/blob/master/semantic-ast/src/AST/GenerateSyntax.hs) We then use Template Haskell to auto-generate language-specific, strongly-typed datatypes that represent various language constructs. This API exports the top-level function `astDeclarationsForLanguage` to auto-generate datatypes at compile-time, which is is invoked by a given language [AST](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/AST.hs) module.
20+
3. [**Unmarshal.**](https://github.com/github/semantic/blob/master/semantic-ast/src/AST/Unmarshal.hs) Unmarshaling is the process of iterating over tree-sitter’s parse trees using its tree cursor API, and producing Haskell ASTs for the relevant nodes. We parse source code from tree-sitter and unmarshal the data we get to build these ASTs generically. This file exports the top-level function `parseByteString`, which takes source code and a language as arguments, and produces an AST.
21+
22+
Here is an example that describes the relationship between a Python identifier represented in the tree-sitter generated JSON file, and a datatype generated by Template Haskell based on the provided JSON:
23+
24+
| Type | JSON | TH-generated code |
25+
|----------|--------------|------------|
26+
|Named leaf|<code>{<br>"type": "identifier",<br>"named": true<br>}|<code>data TreeSitter.Python.AST.Identifier a<br>= TreeSitter.Python.AST.Identifier {TreeSitter.Python.AST.ann :: a,<br>TreeSitter.Python.AST.bytes :: text-1.2.3.1:Data.Text.Internal.Text} -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Show a => Show (TreeSitter.Python.AST.Identifier a) -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Ord a => Ord (TreeSitter.Python.AST.Identifier a) -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Eq a => Eq (TreeSitter.Python.AST.Identifier a) -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Traversable TreeSitter.Python.AST.Identifier -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Functor TreeSitter.Python.AST.Identifier -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Foldable TreeSitter.Python.AST.Identifier -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance Unmarshal TreeSitter.Python.AST.Identifier -- Defined at TreeSitter/Python/AST.hs:10:1<br>instance SymbolMatching TreeSitter.Python.AST.Identifier -- Defined at TreeSitter/Python/AST.hs:10:1|
27+
28+
The remaining document provides more details on generating ASTs, inspecting datatypes, tests, and information on decisions pertaining to relevant APIs.
29+
___
30+
31+
### Table of Contents
32+
- [CodeGen Documentation](#codegen-documentation)
33+
- [Prerequisites](#prerequisites)
34+
- [CodeGen Pipeline](#codegen-pipeline)
35+
- [Table of Contents](#table-of-contents)
36+
- [Generating ASTs](#generating-asts)
37+
- [Inspecting auto-generated datatypes](#inspecting-auto-generated-datatypes)
38+
- [Tests](#tests)
39+
- [Additional notes](#additional-notes)
40+
___
41+
42+
### Generating ASTs
43+
44+
To parse source code and produce ASTs locally:
45+
46+
1. Load the REPL for a given language package:
47+
48+
```
49+
cabal new-repl lib:semantic-python
50+
```
51+
52+
2. Set language extensions, `OverloadedStrings` and `TypeApplications`, and import relevant modules, `AST.Unmarshal`, `Source.Range` and `Source.Span`:
53+
54+
```
55+
:seti -XOverloadedStrings
56+
:seti -XTypeApplications
57+
58+
import Source.Span
59+
import Source.Range
60+
import AST.Unmarshal
61+
```
62+
63+
3. You can now call `parseByteString`, passing in the desired language you wish to parse (in this case Python is given by the argument `Language.Python.Grammar.tree_sitter_python`), and the source code (in this case an integer `1`). Since the function is constrained by `(Unmarshal t, UnmarshalAnn a)`, you can use type applications to provide a top-level node `t`, an entry point into the tree, in addition to a polymorphic annotation `a` used to represent range and span. In this case, that top-level root node is `Module`, and the annotation is given by `Span` and `Range` as defined in the [semantic-source](https://github.com/github/semantic/tree/master/semantic-source/src/Source) package:
64+
65+
```
66+
TS.parseByteString @Language.Python.AST.Module @(Source.Span.Span, Source.Range.Range) Language.Python.Grammar.tree_sitter_python "1"
67+
```
68+
69+
This generates the following AST:
70+
71+
```
72+
Right (Module {ann = (Span {start = Pos {line = 0, column = 0}, end = Pos {line = 0, column = 1}},Range {start = 0, end = 1}), extraChildren = [R1 (SimpleStatement {getSimpleStatement = L1 (R1 (R1 (L1 (ExpressionStatement {ann = (Span {start = Pos {line = 0, column = 0}, end = Pos {line = 0, column = 1}},Range {start = 0, end = 1}), extraChildren = L1 (L1 (Expression {getExpression = L1 (L1 (L1 (PrimaryExpression {getPrimaryExpression = R1 (L1 (L1 (L1 (Integer {ann = (Span {start = Pos {line = 0, column = 0}, end = Pos {line = 0, column = 1}},Range {start = 0, end = 1}), text = "1"}))))})))})) :| []}))))})]})
73+
```
74+
75+
### Inspecting auto-generated datatypes
76+
77+
Datatypes are derived from a language and its `node-types.json` file using the `GenerateSyntax` API. These datatypes can be viewed in the REPL just as they would for any other datatype, using `:i` after loading the language-specific `AST.hs` module for a given language.
78+
79+
```
80+
:l semantic-python/src/Language/Python/AST.hs
81+
Ok, six modules loaded.
82+
*Language.Python.AST Source.Span Source.Range> :i Module
83+
```
84+
85+
This shows us the auto-generated `Module` datatype:
86+
87+
```Haskell
88+
data Module a
89+
= Module {Language.Python.AST.ann :: a,
90+
Language.Python.AST.extraChildren :: [(GHC.Generics.:+:)
91+
10000 CompoundStatement SimpleStatement a]}
92+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
93+
instance Show a => Show (Module a)
94+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
95+
instance Ord a => Ord (Module a)
96+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
97+
instance Eq a => Eq (Module a)
98+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
99+
instance Traversable Module
100+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
101+
instance Functor Module
102+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
103+
instance Foldable Module
104+
-- Defined at /Users/aymannadeem/github/semantic/semantic-python/src/Language/Python/AST.hs:23:1
105+
```
106+
107+
### Tests
108+
109+
As of right now, Hedgehog tests are minimal and only in place for the Python library.
110+
111+
To run tests:
112+
113+
`cabal v2-test semantic-python`
114+
115+
### Additional notes
116+
117+
- [GenerateSyntax](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter/src/TreeSitter/GenerateSyntax.hs) provides a way to pre-define certain datatypes for which Template Haskell is not used. Any datatypes among the node types which have already been defined in the module where the splice is run will be skipped, allowing customization of the representation of parts of the tree. While this gives us flexibility, we encourage that this is used sparingly, as it imposes extra maintenance burden, particularly when the grammar is changed. This may be used to e.g. parse literals into Haskell equivalents (e.g. parsing the textual contents of integer literals into `Integer`s), and may require defining `TS.UnmarshalAnn` or `TS.SymbolMatching` instances for (parts of) the custom datatypes, depending on where and how the datatype occurs in the generated tree, in addition to the usual `Foldable`, `Functor`, etc. instances provided for generated datatypes.
118+
- Annotations are captured by a polymorphic parameter `a`
119+
- [Unmarshal](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter/src/TreeSitter/Unmarshal.hs) defines both generic and non-generic classes. This is because generic behaviors are different than what we get non-generically, and in the case of ` Maybe`, `[]`we actually preference doing things non-generically. Since `[]` is a sum, the generic behavior for `:+:` would be invoked and expect that wed have repetitions represented in the parse tree as right-nested singly-linked lists (ex., `(a (b (c (d))))`) rather than as just consecutive sibling nodes (ex., `(a b c ...d)`, which is what our trees have). We want to match the latter.

proto/semantic.proto

Lines changed: 42 additions & 0 deletions
< 179B td data-grid-cell-id="diff-6c74ad291c5426dc291226b3e0d5f6042a3b2c611f865989da9b88e1b3276a09-176-217-1" data-selected="false" role="gridcell" style="background-color:var(--diffBlob-additionNum-bgColor, var(--diffBlob-addition-bgColor-num));text-align:center" tabindex="-1" valign="top" class="focusable-grid-cell diff-line-number position-relative left-side">217
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,14 @@ message ParseTreeGraphResponse {
2727
repeated ParseTreeFileGraph files = 1;
2828
}
2929

30+
message StackGraphRequest {
31+
repeated Blob blobs = 1;
32+
}
33+
34+
message StackGraphResponse {
35+
repeated StackGraphFile files = 1;
36+
}
37+
3038
message ParseTreeFileGraph {
3139
string path = 1;
3240
string language = 2;
@@ -174,3 +182,37 @@ message Span {
174182
Position start = 1;
175183
Position end = 2;
176184
}
185+
186+
message StackGraphFile {
187+
string path = 1;
188+
string language = 2;
189+
repeated StackGraphNode nodes = 3;
190+
repeated StackGraphPath paths = 4;
191+
repeated ParseError errors = 5;
192+
}
193+
194+
message StackGraphNode {
195+
int64 id = 1;
196+
string name = 2;
197+
string line = 3;
198+
string kind = 4;
199+
Span span = 5;
200+
enum NodeType {
201+
ROOT_SCOPE = 0;
202+
JUMP_TO_SCOPE = 1;
203+
EXPORTED_SCOPE = 2;
204+
DEFINITION = 3;
205+
REFERENCE = 4;
206+
}
207+
NodeType node_type = 6;
208+
}
209+
210+
message StackGraphPath {
211+
repeated string starting_symbol_stack = 1;
212+
int64 starting_scope_stack_size = 2;
213+
int64 from = 3;
214+
string edges = 4;
215+
int64 to = 5;
216+
repeated int64 ending_scope_stack = 6;
+
repeated string ending_symbol_stack = 7;
218+
}

script/ghci-flags

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ function flags {
7070

7171
# disable automatic selection of packages
7272
echo "-hide-all-packages"
73+
echo "-package proto-lens-jsonpb"
7374

7475
# run cabal and emit package flags from the environment file, removing comments & prefixing with -
7576
cabal v2-exec -v0 bash -- -c 'cat "$GHC_ENVIRONMENT"' | grep -v '^--' | sed -e 's/^/-/'

script/protoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ PARENT_DIR=$(dirname $(pwd))
1212

1313
export PROJECT="github.com/github/semantic"
1414

15-
# Generate Haskell for semantic's protobuf types
15+
# Generate Haskell for semantic's protobuf types. See the entrypoint in
16+
# Dockerfile for where the protoc pluggins are configured.
1617
docker run --rm --user $(id -u):$(id -g) -v $(pwd):/go/src/$PROJECT -w /go/src/$PROJECT \
1718
semantic-protoc --proto_path=proto \
1819
--haskell_out=./src \

0 commit comments

Comments
 (0)
0