Skip to content

Need guidance on adding new filter, more metadata to index/select/templates #1010

@theodoreb

Description

@theodoreb

I'm using zoekt to index 9k repositories of Drupal contributed modules at: https://search.tresbien.tech it's been a great fit so far. Very nice to set up and surprisingly fast. For now I need to make an additional Ajax query to the JSON api to get all I need to display the results and I'd like to avoid it, as well as make it possible to build a query based on some of the extra metadata I'm indexing. I wanted to know what to expect from zoekt and the built-in webserver/template. I don't have a separate frontend stack where I could massage the data. Let me know if that should be the way to go based on what I'm after :)

## Context

We need to search for code across all our repositories when we change an API to have an idea of the impact it'll have on the ecosystem. It informs things like the level of backwards compatibility to implement, how much publicity we should do, or if we should reach out directly to impacted projects before/after the change.

To do this efficiently we need to have some extra informations about the project and it's releases. So the plan is to display extra metadata about the repository on every result like shown below. On the second row in the file metadata header, it show the release version with the associated core compatibility (as a PHP composer version constraint), the there is the usage, and whether it's covered by our security team or not:

or with multiple branches match:

## Current implementation

Currently I'm adding that information to the repository during indexing (having #432 would help, I'd be able to drop a good chunk of code) and I can correctly take it out when querying the json endpoint:

"RawConfig": {

  // This is the "supported releases" from the project, with their drupal core compatibility
  "drupal-core": "3.5.x:^9.5 || ^10 || ^11;3.6.x:^9.5 || ^10 || ^11",

  // Simple yes/no value.
  "drupal-security": "covered",

  // The number of installs for every release of the module
  "drupal-usage": "3.0.x:5013;3.1.x:6058;3.2.x:3138;3.3.x:8302;3.4.x:32757;3.5.x:42092;3.6.x:167949;3.x:104;8.x-1.x:9374;8.x-2.x:19031",

  "name": "admin_toolbar",

  // Sum of all the usage data
  "priority": "293818",

  "web-url": "https://git.drupalcode.org/project/admin_toolbar",
  "web-url-type": "gitlab"
},

We have some branch-specific data going on, what I can index today is the priority mapped to the sum of installs but I'd like to have the priority depend on the branches that are matched in the response, for example:

branch 3.5.x:
  - core compatibility: ^9.5 || ^10 || ^11
  - usage: 42092
branch 3.6.x: 
  - core compatibility: ^9.5 || ^10 || ^11
  - usage: 167949

Questions

  1. Would it make sense to expose the repository metadata to results.html.tpl so I can refer to it from the templates instead of having to do an ajax request to get the result from the repo list endpoint? making the repository RawConfig accessible from the results.html.tpl file.
  2. Language detection, I see that detecting the language happens in https://github.com/sourcegraph/zoekt/blob/main/languages/languages.go, we have PHP files that are named with the extensions: .module, .theme, .install and a few others. Ideally they'd show up when using lang:php. Could that live in a config file somewhere for zoekt to pick up or would it be a case of improving upstream?
  3. Ideally I'd like to have a custom core: 11 query that uses the version constraint to filter results and a security: yes to further filter things. I had a look at Feature request: is it possible to add a query filter on "topics:" #783 and add topic:XXX support to find data by github topic #939 but that seems as static as the archived/fork/public special keywords. Trying to make it more generic might make it too generic? I don't know, I could live with a config file to declare the additional filters.
  4. Could a different sorting method be used? I'd be interested to have a sort based on usage/priority exclusively.
  5. Branch-specific data: linked to the question above, for example if a search matched only the 3.5.x branch, the priority used would be 42092, if the search matched both branches, the priority would be 210041 (sum of the two branches).
  6. In the same spirit, having branch-specific rawconfig, so when I'm in the template I can get the metadata associated with the branches matched for that result.

I used LLM based tooling to do the set up and it's very happy to patch zoekt code and compile a custom version to implement some of the things above. I don't want that headache to maintain a fork, so I wanted to know what was the level of customization to expect, or if patching/maintaining a fork would be the preferred solution here.

I'm not a go developer but lately a few go tools started being useful for me, so I'd be happy to take on a few things and get to know go a bit better :)

In any case, it's already very useful as-is. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions