LangChain, an on-premises CPU LLM, and a basic Prompt

Self-hosting Large Language Models?

Possible

Are they as capable as SaaS subscription models?

No, but they have custom use-cases.

LLMWare Bling on a CPU

Utilize the Hugging Face pipeline for easy access to a pre-trained model. LLMware Bling, being a CPU LLM, is part of this setup. The model's configuration enables remote code execution from Hugging Face.

https://huggingface.co/llmware/bling-stable-lm-3b-4e1t-v0

bling-stable-lm-3b-4e1t-0.1 part of the BLING ("Best Little Instruction-following No-GPU-required") model series, RAG-instruct trained on top of a StabilityAI stablelm-3b-4e1t base model.
BLING models are fine-tuned with distilled high-quality custom instruct datasets, targeted at a specific subset of instruct tasks with the objective of providing a high-quality Instruct model that is 'inference-ready' on a CPU laptop even without using any advanced quantization optimizations.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "llmware/bling-stable-lm-3b-4e1t-v0"

# Ensure the directory for saving models is created and specified in your environment
# This is more about ensuring that the model download doesn't prompt for storage location or confirmation
import os
from transformers import logging

# Optionally, increase logging level if you want to see more details about the download process
logging.set_verbosity_info()

# Make sure you have set TRANSFORMERS_CACHE in your environment variables
# os.environ["TRANSFORMERS_CACHE"] = "/path/to/your/preferred/cache/directory"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500)
hf = HuggingFacePipeline(pipeline=pipe)

Prompt for the model with LangChain

from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

test = chain.invoke({"question": question})

Listing and GitHub repo

The answer to this very relevant question can be found in the listing: