srgn - a code surgeon

A grep-like tool which understands source code syntax and allows for manipulation in addition to search.

Like grep, regular expressions are a core primitive. Unlike grep, additional capabilities allow for higher precision, with options for manipulation. This allows srgn to operate along dimensions regular expressions and IDE tooling (Rename all, Find all references, ...) alone cannot, complementing them.

srgn is organized around actions to take (if any), acting only within precise, optionally language grammar-aware scopes. In terms of existing tools, think of it as a mix of tr, sed, ripgrep and tree-sitter, with a design goal of simplicity: if you know regex and the basics of the language you are working with, you are good to go.

Quick walkthrough

Tip

All code snippets displayed here are verified as part of unit tests using the actual srgn binary. What is showcased here is guaranteed to work.

The most simple srgn usage works similar to tr:

$ echo 'Hello World!' | srgn '[wW]orld' 'there' # replacement
Hello there!

Matches for the regular expression pattern '[wW]orld' (the scope) are replaced (the action) by the second positional argument. Zero or more actions can be specified:

$ echo 'Hello World!' | srgn '[wW]orld' # zero actions: input returned unchanged
Hello World!
$ echo 'Hello World!' | srgn --upper '[wW]orld' 'you' # two actions: replacement, afterwards uppercasing
Hello YOU!

Replacement is always performed first and specified positionally. Any other actions are applied after and given as command line flags.

Multiple scopes

Similarly, more than one scope can be specified: in addition to the regex pattern, a language grammar-aware scope can be given, which scopes to syntactical elements of source code (think, for example, "all bodies of class definitions in Python"). If both are given, the regular expression pattern is then only applied within that first, language scope. This enables search and manipulation at precision not normally possible using plain regular expressions, and serving a dimension different from tools such as Rename all in IDEs.

For example, consider this (pointless) Python source file:

"""Module for watching birds and their age."""

from dataclasses import dataclass


@dataclass
class Bird:
    """A bird!"""

    name: str
    age: int

    def celebrate_birthday(self):
        print("?")
        self.age += 1

    @classmethod
    def from_egg(egg):
        """Create a bird from an egg."""
        pass  # No bird here yet!


def register_bird(bird: Bird, db: Db) -> None:
    assert bird.age >= 0
    with db.tx() as tx:
        tx.insert(bird)

which can be searched using:

$ cat birds.py | srgn --python 'class' 'age'
11:    age: int
15:        self.age += 1

The string age was sought and found only within Python class definitions (and not, for example, in function bodies such as register_bird, where age also occurs and would be nigh impossible to exclude from consideration in vanilla grep). By default, this 'search mode' also prints line numbers. Search mode is entered if no actions are specified, and a language such as --python is given¹—think of it like 'ripgrep but with syntactical language elements'.

Searching can also be performed across lines, for example to find methods (aka def within class) lacking docstrings:

$ cat birds.py | srgn --python 'class' 'def .+:ns+[^"s]{3}' # do not try this pattern at home
13:    def celebrate_birthday(self):
14:        print("?")

Note how this does not surface either from_egg (has a docstring) or register_bird (not a method, def outside class).

Multiple language scopes

Language scopes themselves can be specified multiple times as well. For example, in the Rust snippet

pub enum Genre {
    Rock(Subgenre),
    Jazz,
}

const MOST_POPULAR_SUBGENRE: Subgenre = Subgenre::Something;

pub struct Musician {
    name: String,
    genres: Vec<Subgenre>,
}

multiple items can be surgically drilled down into as

$ cat music.rs | srgn --rust 'pub-enum' --rust 'type-identifier' 'Subgenre' # AND'ed together
2:    Rock(Subgenre),

where only lines matching all criteria are returned, acting like a logical and between all conditions. Note that conditions are evaluated left-to-right, precluding some combinations from making sense: for example, searching for a Python class body inside of Python doc-strings usually returns nothing. The inverse works as expected however:

$ cat birds.py | srgn --py 'class' --py 'doc-strings' 
8:    """A bird!"""
19:        """Create a bird from an egg."""

No docstrings outside class bodies are surfaced!

The -j flag changes this behavior: from intersecting left-to-right, to running all queries independently and joining their results, allowing you to search multiple ways at once:

$ cat birds.py | srgn -j --python 'comments' --python 'doc-strings' 'bird[^s]'
8:    """A bird!"""
19:        """Create a bird from an egg."""
20:        pass  # No bird here yet!

The pattern bird[^s] was found inside of comments or docstrings likewise, not just "docstrings within comments".

Working recursively

If standard input is not given, srgn knows how to find relevant source files automatically, for example in this repository:

$ srgn --python 'class' 'age'
docs/samples/birds
11:    age: int
15:        self.age += 1

docs/samples/birds.py
9:    age: int
13:        self.age += 1

It recursively walks its current directory, finding files based on file extensions and shebang lines, processing at very high speed. For example, srgn --go strings 'd+' finds and prints all ~140,000 runs of digits in literal Go strings inside the Kubernetes codebase of ~3,000,000 lines of Go code within 3 seconds on 12 cores of M3. For more on working with many files, see below.

Combining actions and scopes

Scopes and actions can be combined almost arbitrarily (though many combinations are not going to be use- or even meaningful). For example, consider this Python snippet (for examples using other supported languages see below):

"""GNU module."""

def GNU_says_moo():
    """The GNU function -> say moo -> ✅"""

    GNU = """
      GNU
    """  # the GNU...

    print(GNU + " says moo")  # ...says moo

against which the following command is run:

cat gnu.py | srgn --titlecase --python 'doc-strings' '(?' '$1: GNU ? is not Unix'

The anatomy of that invocation is:

--titlecase (an action) will Titlecase Everything Found In Scope
--python 'doc-strings' (a scope) will scope to (i.e., only take into consideration) docstrings according to the Python language grammar
'(? (a scope) sees only what was already scoped by the previous option, and will narrow it down further. It can never extend the previous scope. The regular expression scope is applied after any language scope(s).
(? is negative lookbehind syntax, demonstrating how this advanced feature is available. Strings of GNU prefixed by The will not be considered.



'$1: GNU ? is not Unix' (an action) will replace each matched
occurrence (i.e., each input section found to be in scope) with this string. Matched
occurrences are patterns of '(? only within Python docstrings.
Notably, this replacement string demonstrates:


dynamic variable binding and substitution using $1, which carries
the contents captured by the first capturing regex group. That's ([a-z]+), as
(? is not capturing.

full Unicode support (?).


The command makes use of multiple scopes (language and regex pattern) and multiple
actions (replacement and titlecasing). The result then reads
"""Module: GNU ? Is Not Unix."""

def GNU_says_moo():
    """The GNU function -> say moo -> ✅"""

    GNU = """
      GNU
    """  # the GNU...

    print(GNU + " says moo")  # ...says moo
where the changes are limited to:
- """GNU module."""
+ """Module: GNU ? Is Not Unix."""

def GNU_says_moo():
    """The GNU -> say moo -> ✅"""

Warning
While srgn is in beta (major version 0), make sure to only
(recursively) process files you can safely
restore.
Search mode does not overwrite files, so is always safe.

See below for the full help output of the tool.

Note
Supported languages are

C
C#
Go
HCL (Terraform)
Python
Rust
TypeScript


Installation
Prebuilt binaries
Download a prebuilt binary from the
releases.
cargo-binstall
This crate provides its binaries in a format
compatible
with cargo-binstall:

Install the Rust toolchain
Run cargo install cargo-binstall (might take a while)
Run cargo binstall srgn (couple seconds, as it downloads prebuilt
binaries from GitHub)

These steps are guaranteed to work™, as they are tested in
CI. They also work if no prebuilt binaries are available
for your platform, as the tool will fall back to compiling from
source.
Homebrew
A formula is available via:
brew install srgn

Nix
Available via unstable:
nix-shell -p srgn

Arch Linux
Available via the AUR.
MacPorts
A port is available:
sudo port install srgn

CI (GitHub Actions)
All GitHub Actions runner
images come with cargo
preinstalled, and cargo-binstall provides a convenient GitHub
Action:
jobs:
  srgn:
    name: Install srgn in CI
    # All three major OSes work
    runs-on: ubuntu-latest
    steps:
      - uses: cargo-bins/cargo-binstall@main
      - name: Install binary
        run: >
          cargo binstall
          --no-confirm
          srgn
      - name: Use binary
        run: srgn --version
The above concludes in just 5 seconds
total, as no
compilation is required. For more context, see cargo-binstall's advise on
CI.
Cargo (compile from source)

Install the Rust toolchain
A C compiler is required:


On Linux, gcc works.


On macOS, use clang.


On Windows, MSVC works.
Select "Desktop development with C++" on installation.



Run cargo install srgn


Cargo (as a Rust library)
cargo add srgn

See here for more.
Shell completions
Various
shells
are supported for shell completion scripts. For example, append eval "$(srgn --completions zsh)" to ~/.zshrc for completions in ZSH. An interactive session can
then look like:

Walkthrough
The tool is designed around scopes and actions. Scopes narrow down the parts of
the input to process. Actions then perform the processing. Generally, both scopes and
actions are composable, so more than one of each may be passed. Both are optional (but
taking no action is pointless); specifying no scope implies the entire input is in
scope.
At the same time, there is considerable overlap with plain
tr: the tool is designed to have close correspondence in the most common use
cases, and only go beyond when needed.
Actions
The simplest action is replacement. It is specially accessed (as an argument, not an
option) for compatibility with tr, and general ergonomics. All other actions are
given as flags, or options should they take a value.
Replacement
For example, simple, single-character replacements work as in tr:
$ echo 'Hello, World!' | srgn 'H' 'J'
Jello, World!
The first argument is the scope (literal H in this case). Anything matched by it is
subject to processing (replacement by J, the second argument, in this case). However,
there is no direct concept of character classes as in tr. Instead, by
default, the scope is a regular expression pattern, so its
classes can be used to
similar effect:
$ echo 'Hello, World!' | srgn '[a-z]' '_'
H____, W____!
The replacement occurs greedily across the entire match by default (note the UTS
character class,
reminiscent of tr's
[:alnum:]):
$ echo 'ghp_oHn0As3cr3T!!' | srgn 'ghp_[[:alnum:]]+' '*' # A GitHub token
*!!
Advanced regex features are
supported, for
example lookarounds:
$ echo 'ghp_oHn0As3cr3T' | srgn '(?<=ghp_)[[:alnum:]]+' '*'
ghp_*
Take care in using these safely, as advanced patterns come without certain safety and
performance guarantees. If they
aren't used, performance is not
impacted.
The replacement is not limited to a single character. It can be any string, for example
to fix this quote:
$ echo '"Using regex, I now have no issues."' | srgn 'no issues' '2 problems'
"Using regex, I now have 2 problems."
The tool is fully Unicode-aware, with useful support for certain advanced
character
classes:
$ echo 'Mood: ?' | srgn '?' '?'
Mood: ?
$ echo 'Mood: ???? :(' | srgn 'p{Emoji_Presentation}' '?'
Mood: ???? :(
Variables
Replacements are aware of variables, which are made accessible for use through regex
capture groups. Capture groups can be numbered, or optionally named. The zeroth capture
group corresponds to the entire match.
$ echo 'Swap It' | srgn '(w+) (w+)' '$2 $1' # Regular, numbered
It Swap
$ echo 'Swap It' | srgn '(w+) (w+)' '$2 $1$1$1' # Use as many times as you'd like
It SwapSwapSwap
$ echo 'Call +1-206-555-0100!' | srgn 'Call (+?d-d{3}-d{3}-d{4}).+' 'The phone number in "$0" is: $1.' # Variable `0` is the entire match
The phone number in "Call +1-206-555-0100!" is: +1-206-555-0100.
A more advanced use case is, for example, code refactoring using named capture groups
(perhaps you can come up with a more useful one...):
$ echo 'let x = 3;' | srgn 'let (?[a-z]+) = (?.+);' 'const $var$var = $expr + $expr;'
const xx = 3 + 3;
As in bash, use curly braces to disambiguate variables from immediately adjacent
content:
$ echo '12' | srgn '(d)(d)' '$2${1}1'
211
$ echo '12' | srgn '(d)(d)' '$2$11' # will fail (`11` is unknown)
$ echo '12' | srgn '(d)(d)' '$2${11' # will fail (brace was not closed)
Beyond replacement
Seeing how the replacement is merely a static string, its usefulness is limited. This is
where tr's secret sauce
ordinarily comes into play: using its character classes, which are valid in the second
position as well, neatly translating from members of the first to the second. Here,
those classes are instead regexes, and only valid in first position (the scope). A
regular expression being a state machine, it is impossible to match onto a 'list of
characters', which in tr is the second (optional) argument. That concept is out the
window, and its flexibility lost.
Instead, the offered actions, all of them fixed, are used. A peek at the most
common use cases for tr reveals that the provided set of
actions covers virtually all of them! Feel free to file an issue if your use case is not
covered.
Onto the next action.
Deletion
Removes whatever is found from the input. Same flag name as in tr.
$ echo 'Hello, World!' | srgn -d '(H|W|!)'
ello, orld

Note
As the default scope is to match the entire input, it is an error to specify
deletion without a scope.

Squeezing
Squeezes repeats of characters matching the scope into single occurrences. Same flag
name as in tr.
$ echo 'Helloooo Woooorld!!!' | srgn -s '(o|!)'
Hello World!
If a character class is passed, all members of that class are squeezed into whatever
class member was encountered first:
$ echo 'The number is: 3490834' | srgn -s 'd'
The number is: 3
Greediness in matching is not modified, so take care:
$ echo 'Winter is coming... ???' | srgn -s '?+'
Winter is coming... ???

Note
The pattern matched the entire run of suns, so there's nothing to squeeze. Summer
prevails.

Invert greediness if the use case calls for it:
$ echo 'Winter is coming... ???' | srgn -s '?+?' '☃️'
Winter is coming... ☃️

Note
Again, as with deletion, specifying squeezing without an explicit scope
is an error. Otherwise, the entire input is squeezed.

Character casing
A good chunk of tr usage falls into this category. It's
very straightforward.
$ echo 'Hello, World!' | srgn --lower
hello, world!
$ echo 'Hello, World!' | srgn --upper
HELLO, WORLD!
$ echo 'hello, world!' | srgn --titlecase
Hello, World!
Normalization
Decomposes input according to Normalization Form
D, and then discards
code points of the Mark
category
(see examples). That roughly means:
take fancy character, rip off dangly bits, throw those away.
$ echo 'Naïve jalapeño ärgert mgła' | srgn -d 'P{ASCII}' # Naive approach
Nave jalapeo rgert mga
$ echo 'Naïve jalapeño ärgert mgła' | srgn --normalize # Normalize is smarter
Naive jalapeno argert mgła
Notice how mgła is out of scope for NFD, as it is "atomic" and thus not decomposable
(at least that's what ChatGPT whispers in my ear).
Symbols
This action replaces multi-character, ASCII symbols with appropriate single-code point,
native Unicode counterparts.
$ echo '(A --> B) != C --- obviously' | srgn --symbols
(A ⟶ B) ≠ C — obviously
Alternatively, if you're only interested in math, make use of scoping:
$ echo 'A <= B --- More is--obviously--possible' | srgn --symbols '<='
A ≤ B --- More is--obviously--possible
As there is a 1:1 correspondence between an
ASCII symbol and its replacement, the effect is reversible²:
$ echo 'A ⇒ B' | srgn --symbols --invert
A => B
There is only a limited set of symbols supported as of right now, but more can be added.
German
This action replaces alternative spellings of German special characters (ae, oe, ue, ss)
with their native versions (ä, ö, ü, ß)³.
$ echo 'Gruess Gott, Neueroeffnungen, Poeten und Abenteuergruetze!' | srgn --german
Grüß Gott, Neueröffnungen, Poeten und Abenteuergrütze!
This action is based on a word list (compile without
german feature if this bloats your binary too much). Note the following features about
the above example:

empty scope and replacement: the entire input will be processed, and no replacement is
performed

Poeten remained as-is, instead of being naively and mistakenly converted to Pöten

as a (compound) word, Abenteuergrütze is not going to be found in any reasonable
word list, but was
handled properly nonetheless
while part of a compound word, Abenteuer remained as-is as well, instead of being
incorrectly converted to Abenteür

lastly, Neueroeffnungen sneakily forms a ue element neither constituent word
(neu, Eröffnungen) possesses, but is still processed correctly (despite the
mismatched casings as well)

On request, replacements may be forced, as is potentially useful for names:
$ echo 'Frau Loetter steht ueber der Mauer.' | srgn --german-naive '(?<=Frau )w+'
Frau Lötter steht ueber der Mauer.
Through positive lookahead, nothing but the salutation was scoped and therefore changed.
Mauer correctly remained as-is, but ueber was not processed. A second pass fixes
this:
$ echo 'Frau Loetter steht ueber der Mauer.' | srgn --german-naive '(?<=Frau )w+' | srgn --german
Frau Lötter steht über der Mauer.

Note
Options and flags pertaining to some "parent" are prefixed with their parent's name,
and will imply their parent when given, such that the latter does not need to be
passed explicitly. That's why --german-naive is named as it is, and --german
needn't be passed.
This behavior might change once clap supports subcommand
chaining.

Some branches are undecidable for this modest tool, as it operates without language
context. For example, both Busse (busses) and Buße (penance) are legal words. By
default, replacements are greedily performed if legal (that's the whole
point of srgn,
after all), but there's a flag for toggling this behavior:
$ echo 'Busse und Geluebte ' | srgn --german
Buße und Gelübte 
$ echo 'Busse ? und Fussgaenger ?‍♀️' | srgn --german-prefer-original
Busse ? und Fußgänger ?‍♀️
Combining Actions
Most actions are composable, unless doing so were nonsensical (like for
deletion). Their order of application is fixed, so the order of the flags
given has no influence (piping multiple runs is an alternative, if needed). Replacements
always occur first. Generally, the CLI is designed to prevent misuse and
surprises: it prefers
crashing to doing something unexpected (which is subjective, of course). Note that lots
of combinations are technically possible, but might yield nonsensical results.
Combining actions might look like:
$ echo 'Koeffizienten != Bruecken...' | srgn -Sgu
KOEFFIZIENTEN ≠ BRÜCKEN...
A more narrow scope can be specified, and will apply to all actions equally:
$ echo 'Koeffizienten != Bruecken...' | srgn -Sgu 'bw{1,8}b'
Koeffizienten != BRÜCKEN...
The word boundaries are
required as otherwise Koeffizienten is matched as Koeffizi and enten. Note how the
trailing periods cannot be, for example, squeezed. The required scope of . would
interfere with the given one. Regular piping solves this:
$ echo 'Koeffizienten != Bruecken...' | srgn -Sgu 'bw{1,8}b' | srgn -s '.'
Koeffizienten != BRÜCKEN.
Note: regex escaping (.) can be circumvent using literal scoping.
The specially treated replacement action is also composable:
$ echo 'Mooood: ????!!!' | srgn -s 'p{Emoji}' '?'
Mooood: ?!!!
Emojis are first all replaced, then squeezed. Notice how nothing else is squeezed.
Scopes
Scopes are the second driving concept to srgn. In the default case, the main scope is
a regular expression. The actions section showcased this use case in some
detail, so it's not repeated here. It is given as a first positional argument.
Language grammar-aware scopes
srgn extends this through prepared, language grammar-aware scopes, made possible
through the excellent tree-sitter
library. It offers a
queries feature,
which works much like pattern matching against a tree data
structure.
srgn comes bundled with a handful of the most useful of these queries. Through its
discoverable API (either as a library or via CLI, srgn --help), one
can learn of the supported languages and available, prepared queries. Each supported
language comes with an escape hatch, allowing you to run your own, custom ad-hoc
queries. The hatch comes in the form of --lang-query , where lang is a
language such as python. See below for more on this advanced topic.


Note
Language scopes are applied first, so whatever regex aka main scope you pass, it
operates on each matched language construct individually.

Prepared queries (sample showcases)
This section shows examples for some of the prepared queries.
Finding all unsafe code (Rust)
One advantage of the unsafe keyword in
Rust is its "grepability".
However, an rg 'unsafe' will of course surface all string matches (rg 'bunsafeb'
helps to an extent), not just those in of the actual Rust language keyword. srgn helps
make this more precise. For example:
// Oh no, an unsafe module!
mod scary_unsafe_operations {
    pub unsafe fn unsafe_array_access(arr: &[i32], index: usize) -> i32 {
        // UNSAFE: This function performs unsafe array access without bounds checking
        *arr.get_unchecked(index)
    }

    pub fn call_unsafe_function() {
        let unsafe_numbers = vec![1, 2, 3, 4, 5];
        println!("About to perform an unsafe operation!");
        let result = unsafe {
            // Calling an unsafe function
            unsafe_array_access(&unsafe_numbers, 10)
        };
        println!("Result of unsafe operation: {}", result);
    }
}
can be searched as


        Expand