Comparative analysis of retrieval systems in the real world

Writer Team | May 3, 2024

Our research paper presents a comprehensive analysis of state-of-the-art methods that integrate advanced language models with sophisticated retrieval techniques. We evaluated these methods based on two critical aspects: accuracy, measured by the RobustQA average score, and efficiency, determined by the average response time. Our study covers a diverse range of methods, including Azure Cognitive Search Retriever with GPT-4, Pinecone’s Canopy framework, various implementations of LangChain with Pinecone and different language models (OpenAI, Cohere), LlamaIndex with Weaviate Vector Store’s hybrid search, Google’s RAG implementation on Cloud VertexAI-Search, Amazon SageMaker’s RAG, and a novel approach combining a graph search algorithm with a language model and retrieval awareness (WRITER Retrieval).

The impetus for this analysis stems from the increasing demand for robust and responsive question-answering systems in various domains. As the complexity of queries and the volume of information grows, it becomes imperative to retrieve relevant information quickly while ensuring precision and adaptability of the responses. The RobustQA metric offers a nuanced view of how well these systems perform under diverse paraphrasing of questions, reflecting real-world querying scenarios.

Our findings indicate that the graph search algorithm combined with a language model and retrieval awareness (WRITER Retrieval) stands out as the most effective method, balancing high accuracy with quick response times. LlamaIndex with Weaviate Vector Store also shows high accuracy. On the other hand, RAG implementations on Google Cloud and Amazon SageMaker lag in performance. This analysis suggests that specialized retrieval-aware methods combined with efficient language models lead to better performance in both accuracy and response time.

Key findings and takeaways:

Performance metrics: We evaluated the methods using two primary metrics: accuracy, measured by the RobustQA average score, and efficiency, determined by the average response time.
Top performers: Our findings highlight that the Graph search algorithm combined with a language model and retrieval awareness (WRITER Retrieval) excels in both accuracy and response time, making it the most effective method. Additionally, LlamaIndex with Weaviate Vector Store also demonstrates high accuracy.
Lagging methods: The RAG implementations on Google Cloud VertexAI-Search and Amazon SageMaker were found to lag behind in performance, particularly in terms of accuracy and response time.
Efficiency and accuracy: Our analysis indicates that specialized retrieval-aware methods that incorporate efficient language models generally achieve better results in both accuracy and response time.
Empirical evaluation: We conducted rigorous testing across eight different retrieval system configurations, providing a comprehensive view of the strengths and weaknesses of each method.
RobustQA metric: The RobustQA metric was crucial in our evaluation, offering a detailed perspective on how well the systems handle diverse paraphrasing of questions, which is essential for real-world applications.

This research provides valuable insights for our team and the broader community of AI/ML/NLP engineers, guiding the selection and implementation of technologies in AI-driven search and retrieval applications.

Read the paper