QBRAG (QnA Boosted RAG) with Vectara

Jan 11, 2024

A company’s knowledge bases often don’t answer the wide variety of questions a user could come up with. A Customer Support system ideally could answer specific (but wide variety) questions about the company’s systems and knowledge. Example: “How can I enter Cash Flow in ThruThink?”. But sometimes the user asks generic questions, such as “What is Cash Flow?” which could be sourced from the mind of a giant LLM model and / or the internet.

My submission is essentially a synthetic data enrichment of the knowledge base by reusing generated questions and answers. The questions and answers were normally and originally generated for fine-tuning a model. The motivation behind that was to apply fine-tuning on top of RAG so the resulting agent would potentially be even more specialized and grounded.

However, once we have the QnA dataset (more than 10 thousand question and answer pairs), we could feed that also into the RAG pipeline to provide more ammunition for augmentation. That could support specific user queries potentially by better matching than a “nonfocused” chunked generic knowledge base.

As part of the effort, I contributed to the question_extractor open-source project which offers semantic markdown knowledge base chunking.

  • Added the ability to use AnyScale’s system
  • Added support for QnA JSON formats for Google models and certain Azure OpenAI models
  • Added throttling down to prevent vigorous hammering of embedding and LLM API to prevent ban

Technology used:

  1. Vectara: embedding, vector database, ranking, etc.
  2. AnyScale: LLM hosting, fine-tuning
  3. React: front-end

The React front-end provided excellent agent UX with nicely interpolated and numbered references. Vectara does enhanced techniques such as hybrid keyword and vector search and re-ranking for the Retrieval AUgmented Generation. Even though some articles lament over large context window models superseding the vector databases, vector databases could still provide lower costs (prompts containing whole knowledge data would incur proportionally large API costs and also long latency), while still potentially providing more control over the pipeline.

2041 participants of 165 teams competed at the hackathon. Thanks to Ofer Mendelevitch for helping with technical guidance and thoughts on RAG vs fine-tuning at GitHub HQ event, and thanks for all the mentors and tech session related to the hackathon as well. I keep marching forward and exploring more about customer support LLM agents for my startups.

Comments loading...