Editor’s note: Iryna Gurevych, PhD and Haritz Puerto are speakers for ODSC East 2023. Be sure to check out their talk, “SQuARE: Towards Multi-Domain and Few-Shot Collaborating Question Answering Agents,” there!

Are you fascinated by the power of Question Answering (QA) models but find yourself intimidated by technical challenges? Do you yearn to compare different QA models but dread the time-consuming process of setting them up? Are you curious about explainability methods like saliency maps but feel lost about where to begin? Or do you want to compare the capabilities of ChatGPT against regular fine-tuned QA models? Don’t worry, you’re not alone! We’ve been in your shoes too, and that’s why we created SQuARE: Software for Question Answering Research!

Question Answering is the task in Natural Language Processing that involves answering questions posed in natural language. The goal of QA is to create models that can understand the nuances of a question and some given evidence documents to provide an accurate and concise answer. QA is a critical area of research in NLP, with numerous applications such as virtual assistants, chatbots, customer support, and educational platforms.

SQuARE is a research project that aims to make QA research more accessible. The current high-speed development of Artificial Intelligence yields thousands of datasets and trained models in repositories such as GitHub and Hugging Face. Comparing and analyzing these models usually requires learning libraries, writing code to run the models, and unifying their formats to compare them, which makes this process time-consuming and not scalable. SQuARE simplifies the research process in QA by providing a user-friendly interface accessible from your web browser to run, compare, and analyze QA models.

Having all these models readily available in one place also allows us to explore the potential of combining them to create multi-agent QA systems, i.e., making multiple expert agents collaborate to give answers to questions. We developed a novel model called MetaQA that aggregates QA expert agents in different domains and outperforms by large margins multi-dataset models, i.e., models trained on multiple datasets and thus experts in many domains. One of the benefits of multi-agent systems, and in particular MetaQA, is that we can reuse pretrained agents already available in online model hubs, such as SQuARE. Moreover, combining expert agents is an immensely easier task to learn by neural networks than end-to-end QA. This makes multi-agent systems very cheap to train. In particular, MetaQA only requires 16% of the data needed by multi-dataset models. Furthermore, updating MetaQA is straightforward. For instance, if a new state-of-the-art model in numerical reasoning is available, we just need to download it and make MetaQA call it instead of the old expert agent, i.e., no retraining is needed!

Another research line in SQuARE is how to make information retrieval more effective. Information retrieval is a key aspect in question answering since it is used to obtain the documents to extract the answers to the given questions. One type of user question is information-seeking. These queries are particularly challenging because users might be unfamiliar with the topic, and thus, they may not be aware of common keywords in that domain. However, it is easy for them to identify whether a document is relevant. Thus, we have integrated relevance feedback into neural re-ranking methods to improve the effectiveness of retrieval methods. This task is particularly challenging because neural methods usually require large amounts of training data. However, only a very limited amount of relevant feedback is available per query. To address this, we make use of recent advances in parameter-efficient fine-tuning and few-shot learning. In particular, we fine-tune a re-ranker model using only the relevant feedback from each query and obtained large performance gains in multiple domains, including news and COVID.

Lastly, we are currently working on integrating recent works on Large Language Models such as ChatGPT. With SQuARE, we can simplify the analysis of the capabilities of ChatGPT by comparing it with regular fine-tuned state-of-the-art models. In addition, SQuARE can provide a platform to easily extend ChatGPT with external tools. For example, it is well known that ChatGPT and similar models struggle with mathematical operations. Thus, SQuARE could provide a predefined prompt to call a calculator to do the operations, and then return the results to ChatGPT. Current works have used ElasticSearch to retrieve documents, calculators, and similar simple operations, but SQuARE has the potential to scale this. Users can define new operators in SQuARE, and create prompts to call them. In this way, we don’t need ChatGPT to do everything, ChatGPT can rely on expert agents or external tools to do what it can’t do.

All these works, among others, are being integrated into SQuARE to facilitate reproducibility and research in QA. With SQuARE, you can finally explore the world of QA models without any technical barriers holding you back. Our easy-to-use online platform empowers researchers, data scientists, and enthusiasts alike to experiment with state-of-the-art QA models and compare their performance with ease by using your web browser. Plus, our built-in QA ecosystem, including explainability, adversarial attacks, graph visualizations, and behavioral tests, allows you to analyze the models from multiple perspectives.


Iryna Gurevych (Ph.D. 2003, U. Duisburg-Essen, Germany) is a professor of Computer Science and director of the Ubiquitous Knowledge Processing (UKP) Lab at the Technical University (TU) of Darmstadt in Germany. Her main research interests are in machine learning for large-scale language understanding and text semantics. Iryna’s work has received numerous awards. Examples are the ACL fellow award 2020 and the first Hessian LOEWE Distinguished Chair award (2,5 mil. Euro) in 2021. Iryna is co-director of the NLP program within ELLIS, a European network of excellence in machine learning. She is currently the president of the Association of Computational Linguistics. In 2022, she received an ERC Advanced Grant to support her vision for the next big step in NLP, “InterText – Modeling Text as a Living Object in a Cross-Document Context.”

Haritz Puerto is a Ph.D. candidate in Machine Learning & Natural Language Processing at UKP Lab in TU Darmstadt, supervised by Prof. Iryna Gurevych. His main research interests are reasoning for Question Answering and Graph Neural Networks. Previously, he worked at the Coleridge Initiative, where he co-organized the Kaggle Competition Show US the Data. He got his master’s degree from the School of Computing at KAIST, where he was a research assistant at IR&NLP Lab and was advised by Prof. Sung-Hyon Myaeng.