Move tokenization out of the single GPU processing goroutine and
into each request's HTTP handler goroutine. This allows the next
request's prompt to be tokenized on the CPU while the current
request is executing on the GPU.
Use atomic.Int32 for Array.pinned and a sync.Mutex for the global
arrays slice so MLX arrays can be created and pinned from multiple
goroutines without racing on those structures. Convert Array value
receivers to pointer receivers and struct fields from Array to
*Array to avoid copying the atomic.
This does not fully achieve thread safety even when building
completely independent graphs. The tracing flag and traceScratch
slice in compile.go are unprotected, so concurrent Compile calls
will race. MLX itself is not fully thread-safe either although
it is working to improve.
* app/ui: fix model picker showing stale model after switching chats
Optimistic messages created during streaming were storing the full
Model object instead of the model name string. When switching back
to a chat with cached streaming data, the restore effect read an
object where it expected a string, causing the model picker to fail
matching and remain stuck on the previous chat's model.
* app/ui: fix two more instances of Model object passed as model name
Fix the same bug at lines 523 and 536 in the assistant_with_tools
event handler, where selectedModel (object) was used instead of
selectedModel.model (string).
launchInteractiveModel was introduced in PR #14609 without the
client.Show() capability-detection block that RunHandler uses.
This left opts.MultiModal always false in the TUI path, causing
image/audio file paths to always be treated as unknown commands
instead of being loaded as multimodal attachments.
Mirror the Show() call, pull-on-404 fallback, cloud auth handling,
and MultiModal/Think population from RunHandler into
launchInteractiveModel.
Fixes#15711
When both filters are active, avoid paying for a full sort in top-P
and a partial sort in top-K. Single-filter paths are unchanged.
Improves generation throughput on gemma4:e4b by 1.5%.
Match the ollamarunner and OpenAI semantics: raw, full-vocab log-softmax
with the top-K ranked by probability. Skipped on the GPU when the request
doesn't ask for logprobs so decode doesn't pay for it otherwise.
DeepSeek-V2-style aux-loss-free routing computes sigmoid(gates) once but
needs it twice: the raw sigmoid output is gathered after top-k, while the
post-bias negation is the argpartition key. Fuse into a single multi-output
Compiled kernel returning both, saving two launches on the routing path
per token. Exposed as a general SigmoidRouter since the same pattern is
shared across DeepSeek-V2 descendants.
Improves glm4.7 generation performance by approximately 1%.
If you have a long running create, and start another ollama server with the
same model dir, the GC algorithm deletes the pending blobs and breaks the
create. This adds a 1h grace period to avoid deleting in-flight creation
operations.
Following up on #15560, this change now has e2b/e4b render differently
from 26b/31b.
For backwards compatibility, we take the existing renderer name `gemma4`
and make it do dynamic resolution based on the model name/size, but the
intended use is for the models to be republished with the renderer
variant specified explicitly: `gemma4-small` or `gemma4-large`.
After the rotating buffer has wrapped (c.offset > c.maxSize) a subsequent
L>1 Update() went through a slice-to-[0, c.idx) path that discarded all
slots in [c.idx, Dim), losing the older-but-still-in-window tokens the
first Q of the new batch needs for its sliding-window attention.
Linearize the circular buffer to logical order in that wrapped case so
the existing trim + concat preserves the last (maxSize - 1) old tokens.
When the buffer has not yet wrapped (c.offset <= c.maxSize), slots
[c.idx, Dim) are grow padding or stale post-rewind data, so keep
dropping them.
Converts SiLU/GELUApprox to compiled kernels and adds SwiGLU,
matching upstream mlx/mlx_lm's activations pattern. Routes llama,
qwen3, qwen3_5 (dense + MoE), and glm4_moe_lite MLP paths through
mlx.SwiGLU so each MLP invocation runs as one fused Metal/CUDA
kernel rather than a chain of per-op launches.
Wraps MLX's mlx_compile API so Go functions can be traced into fused
kernels. Contiguous elementwise chains collapse into a single
Metal/CUDA kernel instead of launching one per op.
Exposes Compile plus arity helpers (Compile1/2/3) that mirror Python's
@mx.compile decorator shape, lazily building the closure on first call
so package-level declarations work before the MLX dylib loads.
* gemma4: implement Gemma 4 model for MLX (text-only runtime)
* gemma4: two MoE + SWA prefill perf fixes
Two performance optimizations in the gemma4 forward pass
1. Memoize the sliding-window prefill mask across layers.
2. Softmax only over the selected experts in Router.Forward.
* review comments
Gemma 4 prompts differ when thinking is disabled for different sized
models: 26b/31b emit an empty thought block, while e2b/e4b do not.
Before #15490, our shared Gemma 4 renderer effectively matched the
e2b behavior. #15490 changed it to always emit the empty thought block,
which regressed e2b/e4b nothink behavior and led to #15536 (and possibly
This change restores the previous shared behavior by removing the empty
trailing thought block. It also renames the checked-in upstream chat
templates so the e2b and 31b fixtures are tracked separately.
A follow-up will split Gemma 4 rendering by model size.
Fixes: #15536
For some versions of Xcode, cmake builds are failing due to header problems in
cross-compiling during the generate phase. Since generate is producing arch
independent generated output, we can skip this during cross-compiling.
* mlx: add op wrappers for Conv2d, Pad, activations, trig, and masked SDPA
Add Conv2d, flexible Pad (with axes/mode), PadConstant, Maximum,
Minimum, Softplus, ReLU, GLU, Clamp, Sin, Cos, Clip,
ScaledDotProductAttentionMasked, and RoPEWithFreqs. Refactor
RoPEWithBase to delegate to RoPEWithFreqs.
* review comments
* mlx: fix ScaledDotProductAttentionMasked to consult the mask argument
Improve the MLX model creation pipeline with several model-agnostic changes:
- Rewrite supportsVision to use vision_config instead of architecture name
- Add supportsAudio for audio encoder detection
- Add alignment checking (isAligned) for quantization group sizes
- Support per-projection mixed quantization in MoE expert packing
- Record per-tensor quant metadata in safetensors blobs
- Parse per-tensor quant metadata at model load time
- Validate quantize output is non-empty before storing
- Fix pin/unpin cleanup in expert group quantization
- Promote v_proj/k_proj/down_proj to INT8 for INT4 base quant
- Add MetalIsAvailable() utility
- Skip audio encoder tensors from quantization