<!-- CURSOR_SUMMARY --> > [!NOTE] > Adds a context-limit banner with one-click “summarize into new chat,” refactors token counting with react-query, and persists per-message max token usage. > > - **Chat UX** > - **Context limit banner** (`ContextLimitBanner.tsx`, `MessagesList.tsx`): shows when within 40k tokens of `contextWindow`, with tooltip and action to summarize into a new chat. > - **Summarize flow**: extracted to `useSummarizeInNewChat` and used in chat input and banner; new summarize system prompt (`summarize_chat_system_prompt.ts`). > - **Token usage & counting** > - **Persist max tokens used per assistant message**: DB migration (`messages.max_tokens_used`), schema updates, and saving usage during streaming (`chat_stream_handlers.ts`). > - **Token counting refactor** (`useCountTokens.ts`): react-query with debounce; returns `estimatedTotalTokens` and `actualMaxTokens`; invalidated on model change and stream end; `TokenBar` updated. > - **Surfacing usage**: tooltip on latest assistant message shows total tokens (`ChatMessage.tsx`). > - **Model/config tweaks** > - Set `auto` model `contextWindow` to `200_000` (`language_model_constants.ts`). > - Improve chat auto-scroll dependency (`ChatPanel.tsx`). > - Fix app path validation regex (`app_handlers.ts`). > - **Testing & dev server** > - E2E tests for banner and summarize (`e2e-tests/context_limit_banner.spec.ts` + fixtures/snapshot). > - Fake LLM server streams usage to simulate high token scenarios (`testing/fake-llm-server/*`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 2ae16a14d50699cc772407426419192c2fdf2ec3. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Adds a “Summarize into new chat” trigger and a context limit banner to help keep conversations focused and avoid hitting model limits. Also tracks and surfaces actual token usage per assistant message, with a token counting refactor for reliability. - **New Features** - Summarize into new chat from the input or banner; improved system prompt with clear output format. - Context limit banner shows when within 40k tokens of the model’s context window and offers a one-click summarize action. - Tooltip on the latest assistant message shows total tokens used. - **Refactors** - Token counting now uses react-query and returns estimatedTotalTokens and actualMaxTokens; counts are invalidated on model change and when streaming settles. - Persist per-message max_tokens_used in the messages table; backend aggregates model usage during streaming and saves it. - Adjusted default “Auto” model contextWindow to 200k for more realistic limits. - Improved chat scrolling while streaming; fixed app path validation regex. <sup>Written for commit 2ae16a14d50699cc772407426419192c2fdf2ec3. Summary will update automatically on new commits.</sup> <!-- End of auto-generated description by cubic. -->
Fake LLM Server
A simple server that mimics the OpenAI streaming chat completions API for testing purposes.
Features
- Implements a basic version of the OpenAI chat completions API
- Supports both streaming and non-streaming responses
- Always responds with "hello world" message
- Simulates a 429 rate limit error when the last message is "[429]"
- Configurable through environment variables
Installation
npm install
Usage
Start the server:
# Development mode
npm run dev
# Production mode
npm run build
npm start
Example usage
curl -X POST http://localhost:3500/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Say something"}],"model":"any-model","stream":true}'
The server will be available at http://localhost:3500 by default.
API Endpoints
POST /v1/chat/completions
This endpoint mimics OpenAI's chat completions API.
Request Format
{
"messages": [{ "role": "user", "content": "Your prompt here" }],
"model": "any-model",
"stream": true
}
- Set
stream: trueto receive a streaming response - Set
stream: falseor omit it for a regular JSON response
Response
For non-streaming requests, you'll get a standard JSON response:
{
"id": "chatcmpl-123456789",
"object": "chat.completion",
"created": 1699000000,
"model": "fake-model",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "hello world"
},
"finish_reason": "stop"
}
]
}
For streaming requests, you'll receive a series of server-sent events (SSE), each containing a chunk of the response.
Simulating Rate Limit Errors
To test how your application handles rate limiting, send a message with content exactly equal to [429]:
{
"messages": [{ "role": "user", "content": "[429]" }],
"model": "any-model"
}
This will return a 429 status code with the following response:
{
"error": {
"message": "Too many requests. Please try again later.",
"type": "rate_limit_error",
"param": null,
"code": "rate_limit_exceeded"
}
}
Configuration
You can configure the server by modifying the PORT variable in the code.
Use Case
This server is primarily intended for testing applications that integrate with OpenAI's API, allowing you to develop and test without making actual API calls to OpenAI.