# `Linx.NFT.Tokenizer`
[🔗](https://github.com/oshlabs/linx/blob/v0.2.0/lib/linx/nft/tokenizer.ex#L1)

Char-by-char lexer for the `~NFT` sigil and `.nft` files.

Mirrors the architecture of `Phoenix.LiveView.TagEngine.Tokenizer`
and of nft's own `src/scanner.l`: an explicit stack of
**start conditions** (lex states) lets context-sensitive
constructs add a new state without disturbing the rest of the
lexer.

The conditions in play:

  * `:default` — top-level lexing of keywords, identifiers,
    literals, operators, punctuation, statement separators.
  * `:line_comment` — `#` to end of line.
  * `:block_comment` — `/* ... */`; supports nesting (nft itself
    doesn't, but supporting nesting costs ~5 lines and prevents
    a real footgun on hand-edited files).
  * `:string` — `"..."` with `\\`/`\"`/`\n`/`\t`/`\r`/`\0`
    escapes. (String-internal Elixir interpolation is not yet
    supported — it'll push `:elixir_expr` from `:string` when
    added, no other change required.)
  * `:elixir_expr` — only enterable when the `:interpolation?`
    option is true. Scans an Elixir expression up to the
    matching `}`, skipping `}` characters that appear inside
    strings/charlists/comments inside the expression.

## Token shape

Each token is a 2- or 3-tuple:

    {:kind, meta}                # punctuation with no payload
    {:kind, value, meta}         # everything else

where `meta` is `%{line: pos_integer(), column: pos_integer()}`
pointing at the *start* of the token.

Identifiers are emitted as `{:identifier, "name", meta}` — the
parser decides which names are keywords. (Pattern-matching on
binaries is ergonomic in Elixir; this avoids a 200-entry
keyword table here.)

## Statement separators

In nft syntax, statements inside a `{ ... }` body are separated
by either `;` or a newline. To keep parsing simple, the
tokenizer emits a single `:stmt_sep` token for every `;` and
for every (possibly multi-line) run of newlines, collapsing
consecutive separators into one. Newlines that appear inside
brackets are still emitted — the parser ignores spurious
separators in positions where they're not meaningful.

Line continuations (`\\\n`) are consumed silently.

## Numeric / address literals

Network primitives need a small lookahead to disambiguate:

  * `0x...` / `0X...` — hex integer.
  * `0b...` / `0B...` — binary integer.
  * `\d+` followed by no `.` or `:` or `/` — plain decimal integer.
  * `\d+\.\d+\.\d+\.\d+` — IPv4 literal (optional `/N` CIDR).
  * IPv6: any run starting with hex chars that contains `:` and
    whose contents are valid IPv6 syntax.
  * MAC: six 2-char hex octets joined by `:`.

Identifiers that *happen* to begin with hex letters (e.g. `eth0`
or even `fe80`) are still tagged as identifiers when not
followed by `:`. If the identifier is all-hex and followed by
`:` plus a hex char, the lexer rewinds and re-scans as an
IPv6/MAC literal.

## Errors

Anything the tokenizer can't classify raises a
`Linx.NFT.ParseError` with `{file, line, column}` and the
offending source line. The caller (sigil macro, `parse/1`,
`parse_file/1`) catches and either re-raises (compile-time) or
returns `{:error, %ParseError{}}`.

## Extensibility

All architectural decisions here were chosen for **incremental
extension**, since the supported grammar is the common ~85%
subset and the long tail of nft constructs (synproxy, secmark,
osf, fib, jhash, advanced ct, dup/fwd, tproxy, xfrm, tunnel) will
be added per-construct over time. Each addition becomes:

  1. (Optional) a new start condition pushed from somewhere in
     `:default` — add a clause and a step function.
  2. (Optional) a new token kind — extend the `@type token`
     union and the parser's pattern matches.

The stack discipline means none of these touch existing
conditions.

# `token`

```elixir
@type token() ::
  {:identifier, String.t(), token_meta()}
  | {:integer, integer(), token_meta()}
  | {:string, String.t(), token_meta()}
  | {:ipv4, String.t(), token_meta()}
  | {:ipv6, String.t(), token_meta()}
  | {:mac, String.t(), token_meta()}
  | {:cidr_v4, String.t(), token_meta()}
  | {:cidr_v6, String.t(), token_meta()}
  | {:elixir_expr, String.t(), token_meta()}
  | {:stmt_sep, token_meta()}
  | {atom(), token_meta()}
```

# `token_meta`

```elixir
@type token_meta() :: %{line: pos_integer(), column: pos_integer()}
```

# `tokenize`

```elixir
@spec tokenize(
  String.t(),
  keyword()
) :: {:ok, [token()]} | {:error, Linx.NFT.ParseError.t()}
```

Tokenizes `source` into a flat list of tokens.

## Options

  * `:file` — source filename for error messages
    (default `"nofile"`).
  * `:line` — starting line number (default `1`); useful when
    called from a `~NFT` sigil with `__CALLER__.line` to make
    error locations line up with the surrounding `.ex` source.
  * `:column` — starting column number (default `1`).
  * `:interpolation?` — whether to recognize `#{...}` Elixir
    interpolation (default `false`). The sigil sets this to
    `true`; `parse/1` / `parse_file/1` leave it `false`.

Returns `{:ok, tokens}` or `{:error, %Linx.NFT.ParseError{}}`.

---

*Consult [api-reference.md](api-reference.md) for complete listing*