I'm going to show you how to self-host a fast and reliable code-completion AI model
Skip the story go to the step by step guide
First a little back story: After we saw the birth of Co-pilot a lot of different competitors have come onto the screen products like Supermaven, cursor, etc. When I first saw this I immediately thought what if I could make it faster by not going over the network?
Self-hosting models
So I started digging into self-hosting AI models and quickly found out that Ollama could help with that, I also looked through various other ways to start using the vast amount of models on Huggingface but all roads led to Rome. Hence, I ended up sticking to Ollama to get something running (for now).
Finding a model
I started by downloading Codellama, Deepseeker, and Starcoder but I found all the models to be fairly slow at least for code completion I wanna mention I've gotten used to Supermaven which specializes in fast code completion. So with everything I read about models, I figured if I could find a model with a very low amount of parameters I could get something worth using, but the thing is low parameter count results in worse output. But I also read that if you specialize models to do less you can make them great at it this led me to "codegpt/deepseek-coder-1.3b-typescript", this specific model is very small in terms of param count and it's also based on a deepseek-coder model but then it's fine-tuned using only typescript code snippets.
Using the model
So after I found a model that gave quick responses in the right language. The next problem was it often hallucinated and to solve that problem I found out you can modify code completion settings like temperature, topP, etc.
So for my coding setup, I use VScode and I found the Continue extension of this specific extension talks directly to ollama without much setting up it also takes settings on your prompts and has support for multiple models depending on which task you're doing chat or code completion.
So I wanted a predictable, highspeed, accurate output for this I gave it these settings:
{
"temperature": 0.2,
"topP": 0.15,
"topK": 5,
"presencePenalty": 0.1,
"frequencyPenalty": 0.1,
"stop": ["; ", "} "],
"maxTokens": 200
}
A low temperature will result in more predictable output as it favors the most probable output i wanted to give it some creativity but I found higher than 0.25 it starts to hallucinate too much
The same goes for topP and topK also helps with predictability you can ask Google or an AI model to elaborate on the difference but I wouldn't at this stage specifically be able to tell
presencePenalty & frequencyPenalty allows the model to repeat itself more often which I found quite useful for most code
Then my stop settings are specific for typescript as well last but not least maxTokens are set to 200 because I don't want it to generate very long examples of code this way I leave less space for errors that I don't catch with a glance of what is generated
All these settings are something I will keep tweaking to get the best output and I'm also gonna keep testing new models as they become available.
I would love to see a quantized version of the typescript model I use for an additional performance boost.
Step-by-step guide
- Install Ollama (https://ollama.com/download)
- Add Continue to VScode (https://marketplace.visualstudio.com/items?itemName=Continue.continue)
- Choose the local configuration for the continue extension (https://docs.continue.dev/setup/configuration#local-and-offline-configuration)
- In your terminal run
ollama pull codegpt/deepseek-coder-1.3b-typescript
- Edit your continue config
code ~/.continue/config.json
and replace the current tabAutocompleteModel with this:
"tabAutocompleteModel": {
"title": "Deepseek Typescript 1.3B",
"provider": "ollama",
"model": "codegpt/deepseek-coder-1.3b-typescript",
"completionOptions": {
"temperature": 0.2,
"topP": 0.15,
"topK": 5,
"presencePenalty": 0.1,
"frequencyPenalty": 0.1,
"stop": ["; ", "} "],
"maxTokens": 200
}
}
Enjoy you're new coding buddy and also remember to tweak and test to your own needs and please share if you find better alternatives
Specs and misc
So I thought I wanted to add my computer specs since that has an effect on the performance
I daily drive a Macbook M1 Max - 64GB ram with the 16inch screen which also includes the active cooling.
Here is a link to the geekbench scores
https://browser.geekbench.com/macs/macbook-pro-16-inch-2021-apple-m1-max
Also here is a link to a great macbook AI performance comparison
https://www.mrdbourke.com/apple-m3-machine-learning-test/
Top comments (3)
Is there a reason you used a small Param model ? 1.3b -does it make the autocomplete super fast?
*I'm noting the Mac chip, and presume that's pretty fast for running Ollama right? Could you have more benefit from a larger 7b model or does it slide down too much?
I'm about to experiment with Continue and would love to know your exp with these models
So i learned more - i wanna add that it would be exciting to test 1.58bit interference especially when its supported on GPU github.com/microsoft/BitNet
Itβs substantially slower with any 7b parameter model
Even if you use like a heavy quantized model so a narrow use case model with fewer params really makes a difference