Searching for data, including images, using the advanced search tool on the Canadian Astronomy Data Centre (CADC) website can be difficult for users, as it requires knowledge of the
ADQL language and involves multiple steps to narrow and refine search queries. The goal of this project is to leverage Large Language Models (LLMs) and autonomous agents to create a chatbot that assists users in searching for images in the CADC database using natural language. Our LLM-based agent accepts queries in English, converts them to
ADQL code, and returns the results after executing the query against the database. The system is designed to handle common user errors, such as spelling mistakes, incorrect column names, and incorrect values. In such cases, the chatbot suggests a shortlist of similar but correct values that the user might have intended. The user's feedback is then collected to retrieve the correct content. This robustness was achieved by incorporating Retrieval-Augmented Generation (RAG) and semantic search tools, which verify query components with the user before execution and test them against the database.
To evaluate the performance of our system, we created a dataset of questions across different categories: standard questions, spelling errors, incorrect columns, and incorrect values. The system demonstrates 80-90% accuracy on benchmarks, which is a significant improvement over existing systems built using
OpenAI’s custom GPT, which achieved less than 20% accuracy on the same tests. Our solution streamlines the search process for CADC users, making data retrieval more efficient and accessible.