Skip to content

Publications

Are Key Phrases All that Reviewers Care About? A Comprehensive Benchmarking of Reviewer Matchmaking Systems!

Association for the Advancement of Artificial Intelligence AAAI (2025)[Core rank- A*]

Reviewer Matchmaking (RM) is a pivotal process in academic publishing that aligns manuscripts with appropriate reviewers based on their expertise and prior publications. The demand for an automated RM system has escalated with the significant surge in submissions over the past decade. State-of-the-art (SOTA) RM models are document-representation-based (DR-RM) and match the manuscript and reviewer's past publication using a similarity method defined on a high-dimensional vector space. However, they are far from accurate despite their large-scale usage. In this paper, we establish that conventional RM evaluation measures are unreliable and instead emphasize that standard correlation measures are adequate. For the first time, we compare the performance of six SOTA DR-RM models with those of fourteen SOTA Key-phrase Extraction-based RM (KPE-RM) models - an alternate unexplored approach. We observe that KPE-RM models show comparable results in many cases, with the new best model being PatternRank-RM - a KPE-RM model beating the best DR-RM model SPECTER2-RM (Pearson: 0.004+, Spearman: 0.006+, Kendall: 0.043+). We conclude that KPE-RM models must be contextualized to the RM task and cannot be used as plug-n-play.

PerSEval: Assessing Personalization in Text Summarizers

Transactions on Machine Learning Research TMLR (2024)

Personalized summarization models cater to individuals' subjective understanding of saliency, as represented by their reading history and current topics of attention. Existing personalized text summarizers are primarily evaluated based on accuracy measures such as BLEU, ROUGE, and METEOR. However, a recent study argued that accuracy measures are inadequate for evaluating the degree of personalization of these models and proposed EGISES, the first metric to evaluate personalized text summaries. It was suggested that accuracy is a separate aspect and should be evaluated standalone. In this paper, we challenge the necessity of an accuracy leaderboard, suggesting that relying on accuracy-based aggregated results might lead to misleading conclusions. To support this, we delve deeper into EGISES, demonstrating both theoretically and empirically that it measures the degree of responsiveness, a necessary but not sufficient condition for degree-of-personalization. We subsequently propose PerSEval, a novel measure that satisfies the required sufficiency condition. Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that -- (i) PerSEval is reliable w.r.t human-judgment correlation (Pearson's r = 0.73; Spearman's ρ = 0.62; Kendall's τ = 0.42), (ii) PerSEval has high rank-stability, (iii) PerSEval as a rank-measure is not entailed by EGISES-based ranking, and (iv) PerSEval can be a standalone rank-measure without the need of any aggregated ranking.


Are Large Language Models In-Context Personalized Summarizers? Get an iCOPERNICUS Test Done!

Empirical Methods in Natural Language Processing EMNLP (2024)[Core rank- A*]

Large Language Models (LLMs) have succeeded considerably in In-Context-Learning (ICL) based summarization. However, saliency is subject to the users' specific preference histories. Hence, we need reliable In-Context Personalization Learning (ICPL) capabilities within such LLMs. For any arbitrary LLM to exhibit ICPL, it needs to have the ability to discern contrast in user profiles. A recent study proposed a measure for degree-of-personalization called EGISES for the first time. EGISES measures a model's responsiveness to user profile differences. However, it cannot test if a model utilizes all three types of cues provided in ICPL prompts: (i) example summaries, (ii) user's reading histories, and (iii) contrast in user profiles. To address this, we propose the iCOPERNICUS framework, a novel In-COntext PERsonalization learNIng sCrUtiny of Summarization capability in LLMs that uses EGISES as a comparative measure. As a case-study, we evaluate 17 state-of-the-art LLMs based on their reported ICL performances and observe that 15 models' ICPL degrades (min: 1.6%; max: 3.6%) when probed with richer prompts, thereby showing lack of true ICPL.


Accuracy is not enough: Evaluating Personalization in Summarizers

Empirical Methods in Natural Language Processing EMNLP (2023)[Core rank- A*]

Text summarization models are evaluated in terms of their accuracy and quality using various measures such as ROUGE, BLEU, METEOR, BERTScore, PYRAMID, readability, and several other recently proposed ones. The central objective of all accuracy measures is to evaluate the model’s ability to capture saliency accurately. Since saliency is subjective wrt the readers’ preferences, there cannot be a fit-all summary for a given document. This means that in many use-cases, summarization models need to be personalized wrt user-profiles. However, to our knowledge, there is no measure to evaluate the degree-of-personalization of a summarization model. In this paper, we first establish that existing accuracy measures cannot evaluate the degree of personalization of any summarization model, and then propose a novel measure, called EGISES, for automatically computing the same. Using the PENS dataset released by Microsoft Research, we analyze the degree of personalization of ten different state-of-the-art summarization models (both extractive and abstractive), five of which are explicitly trained for personalized summarization, and the remaining are appropriated to exhibit personalization. We conclude by proposing a generalized accuracy measure, called P-Accuracy, for designing accuracy measures that should also take personalization into account and demonstrate the robustness and reliability of the measure through meta-evaluation.


AutoReco: A Tool for Recommending Requirements for their Non-Conformance with Requirement Templates (RTs)

IEEE 31st International Requirements Engineering Conference RE (2023)[Core rank- A]

RTs generally possess a fixed syntactic structure and comprise pre-defined slots, and requirements written in the format of RTs must conform with the template structure. If the requirements do not conform to the RT, manually rewriting them to adhere to the RTs structure is tedious. In this paper, we develop the AutoReco tool for the automated recommendation of functional requirements for non-conformance with requirement templates (RTs). Our preliminary results on nine case studies show an accuracy of 83.9% for providing recommendations to non-conformant requirements with RTs.


Inline Citation Classification Using Peripheral Context and Time-Evolving Augmentation

Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD (2023) [Core rank- B]

Citation plays a pivotal role in determining the associations among research articles. It portrays essential information in indicative, supportive, or contrastive studies. The task of inline citation classification aids in extrapolating these relationships; However, existing studies are still immature and demand further scrutiny. Current datasets and methods used for inline citation classification only use citation-marked sentences constraining the model to turn a blind eye to domain knowledge and neighboring contextual sentences. In this paper, we propose a new dataset, named 3Cext, which along with the cited sentences, provides discourse information using the vicinal sentences to analyze the contrasting and entailing relationships as well as domain information. We propose PeriCite, a Transformer-based deep neural network that fuses peripheral sentences and domain knowledge. Our model achieves the state-of-the-art on the 3Cext dataset by F1 against the best baseline. We conduct extensive ablations to analyze the efficacy of the proposed dataset and model fusion methods.


Improving Access to Science for Social Good

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Database ECML-PKDD (2019) [Core rank- A]

One of the major goals of science is to make the world socially a good place to live. The old paradigm of scholarly communication through publishing has generated enormous amount of heterogeneous data and metadata. However, most of the scientific results are not easily discoverable, in particular those results which benefit social good and are also targeted by non-scientists. In this paper, we showcase a knowledge graph embedding (KGE) based recommendation system to be used by students involved in activities aiming at social good. The proposed recommendation system has been trained on a scholarly knowledge graph constructed for this specific goal. The obtained results highlight that the KGEs successfully encoded the structure of the KG, and therefore, our system could provide valuable recommendations.