We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Systematic searches of published literature are a vital component of systematic reviews. When search strings are not “sensitive,” they may miss many relevant studies limiting, or even biasing, the range of evidence available for synthesis. Concerningly, conducting and reporting evaluations (validations) of the sensitivity of the used search strings is rare, according to our survey of published systematic reviews and protocols. Potential reasons may involve a lack of familiarity or inaccessibility of complex sensitivity evaluation approaches. We first clarify the main concepts and principles of search string evaluation. We then present a simple procedure for estimating a relative recall of a search string. It is based on a pre-defined set of “benchmark” publications. The relative recall, that is, the sensitivity of the search string, is the retrieval overlap between the evaluated search string and a search string that captures only the benchmark publications. If there is little overlap (i.e., low recall or sensitivity), the evaluated search string should be improved to ensure that most of the relevant literature can be captured. The presented benchmarking approach can be applied to one or more online databases or search platforms. It is illustrated by five accessible, hands-on tutorials for commonly used online literature sources. Overall, our work provides an assessment of the current state of search string evaluations in published systematic reviews and protocols. It also paves the way to improve evaluation and reporting practices to make evidence synthesis more transparent and robust.
We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. Our method has three advantages over existing unsupervised methods (such as YAKE). First, it is significantly more effective at extracting keywords from long texts in terms of precision and recall. Second, it allows inference of two types of keywords: local and global. Third, it extracts basic topics from texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works. The agreement between annotators is moderate to substantial. Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.
Decompounding is an essential preprocessing step in text-processing tasks such as machine translation, speech recognition, and information retrieval (IR). Here, the IR issues are explored from five viewpoints. (A) Does word decompounding impact the Indian language IR? If yes, to what extent? (B) Can corpus-based decompounding models be used in the Indian language IR? If yes, how? (C) Can machine learning and deep learning-based decompounding models be applied in the Indian language IR? If yes, how? (D) Among the different decompounding models (corpus-based, hybrid machine learning-based, and deep learning-based), which provides the best effectiveness in the IR domain? (E) Among the different IR models, which provides the best effectiveness from the IR perspective? This study proposes different corpus-based, hybrid machine learning-based, and deep learning-based decompounding models in Indian languages (Marathi, Hindi, and Sanskrit). Moreover, we evaluate the effectiveness of each activity from an IR perspective only. It is observed that the different decompounding models improve IR effectiveness. The deep learning-based decompounding models outperform the corpus-based and hybrid machine learning-based models in Indian language IR. Among the different deep learning-based models, the Bi-LSTM-A model performs best and improves mean average precision (MAP) by 28.02% in Marathi. Similarly, the Bi-RNN-A model improves MAP by 18.18% and 6.1% in Hindi and Sanskrit, respectively. Among the retrieval models, the In_expC2 model outperforms others in Marathi and Hindi, and the BB2 model outperforms others in Sanskrit.
The complexity of process plants and the growing demand for digitalization require efficient and accurate information retrieval throughout the lifecycle phases of a process plant. This paper discusses the concept of instantiation and introduces a method for identifying and multiplying required information in plant engineering using scalable so-called Instantiation Blocks linked to the Bill of Material. Core functionality, an ontology graph and a user interface based on Python and React are developed to demonstrate the implementation of the framework and validate its effectiveness in practice.
Taxonomies are key to creating certainty in language across the legal landscape. In this article Alice Laird and Katy Snell of Howard Kennedy examine why taxonomies are still important in the age of AI. They also discuss how they have built on their expertise on taxonomies for the benefit of their firm, and how they have become involved with a cross firm project called noslegal.
This article is based on a course on how to use Moys classification scheme,1 delivered by Helen Garner and Felicity Staveley-Taylor on behalf of the British and Irish Association of Law Librarians (BIALL) in November 2022. The article, which is also written by Helen and Felicity, provides guidance on how to use Moys classification, explaining the features that enable the scheme to be expanded to accommodate new subject areas. The article also explains some of the features which ensure the scheme remains relevant to legal libraries today. Sections from the classification scheme, as published in the fifth edition, appear in the article text. In addition, any references to page numbers are to the fifth edition.
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.
Web search is an experience that naturally lends itself to recommendations, including query suggestions and related entities. In this article, we propose to recommend specific tasks to users, based on their search queries, such as planning a holiday trip or organizing a party. Specifically, we introduce the problem of query-based task recommendation and develop methods that combine well-established term-based ranking techniques with continuous semantic representations, including sentence representations from several transformer-based models. Using a purpose-built test collection, we find that our method is able to significantly outperform a strong text-based baseline. Further, we extend our approach to using a set of queries that all share the same underlying task, referred to as search mission, as input. The study is rounded off with a detailed feature and query analysis.
The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approach.
Metonymy resolution (MR) is a challenging task in the field of natural language processing. The task of MR aims to identify the metonymic usage of a word that employs an entity name to refer to another target entity. Recent BERT-based methods yield state-of-the-art performances. However, they neither make full use of the entity information nor explicitly consider syntactic structure. In contrast, in this paper, we argue that the metonymic process should be completed in a collaborative manner, relying on both lexical semantics and syntactic structure (syntax). This paper proposes a novel approach to enhancing BERT-based MR models with hard and soft syntactic constraints by using different types of convolutional neural networks to model dependency parse trees. Experimental results on benchmark datasets (e.g., ReLocaR, SemEval 2007 and WiMCor) confirm that leveraging syntactic information into fine pre-trained language models benefits MR tasks.
Customer survey data is critical to supporting customer preference modeling in engineering design. We present a framework of information retrieval and survey design to ensure the collection of quality customer survey data for analyzing customers’ preferences in their consideration-then-choice decision-making and the related social impact. The utility of our approach is demonstrated through the survey design for customers in the vacuum cleaner market. Based on the data, we performed descriptive analysis and network-based modeling to understand customers’ preferences in consideration and choice.
Digital literacy is receiving increased scholarly attention as a potential explanatory factor in the spread of misinformation and other online pathologies. As a concept, however, it remains surprisingly elusive, with little consensus on definitions or measures. We provide a digital literacy framework for political scientists and test survey items to measure it with an application to online information retrieval tasks. There exists substantial variation in levels of digital literacy in the population, which we show is correlated with age and could confound observed relationships. However, this is obscured by researchers’ reliance on online convenience samples that select for people with computer and internet skills. We discuss the implications of these measurement and sample selection considerations for effect heterogeneity in studies of online political behavior. We argue that there is no universally applicable formula for selecting a given non-probability sample or operationalization of the concept of digital literacy; instead, we conclude, researchers should make theoretically informed arguments about how they select both sample and measure.
Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.
This paper, written by Gineke Wiggers, Suzan Verberne, Gerrit-Jan Zwenne and Wouter Van Loon, addresses the concept of ‘relevance’ in relation to legal information retrieval (IR). They investigate whether the conceptual framework of relevance in legal IR, as described by Van Opijnen and Santos in their paper published in 2017, can be confirmed in practice.1 The research is conducted with a user questionnaire in which users of a legal IR system had to choose which of two results they would like to see ranked higher for a query and were asked to provide a reason for their choice. To avoid questions with an obvious answer and extract as much information as possible about the reasoning process, the search results were chosen to differ on relevance factors from the literature, where one result scores high on one factor, and the other on another factor. The questionnaire had eleven pairs of search results. A total of 43 legal professionals participated consisting of 14 legal information specialists, 6 legal scholars and 23 legal practitioners. The results confirmed the existence of domain relevance as described in the theoretical framework by Van Opijnen and Santos as published in 2017.2 Based on the factors mentioned by the respondents, the authors of this paper concluded that document type, recency, level of depth, legal hierarchy, authority, usability and whether a document is annotated are factors of domain relevance that are largely independent of the task context. The authors also investigated whether different sub-groups of users of legal IR systems (legal information specialists who are searching for others, legal scholars and also for legal practitioners) differ in terms of the factors they consider in judging the relevance of legal documents outside of a task context. Using a PERMANOVA there was found to be no significant difference in the factors reported by these groups. At this moment there is no reason to treat these sub-groups differently in legal IR systems.
Causationin written natural language can express a strong relationship between events and facts. Causation in the written form can be referred to as a causal relation where a cause event entails the occurrence of an effect event. A cause and effect relationship is stronger than a correlation between events, and therefore aggregated causal relations extracted from large corpora can be used in numerous applications such as question-answering and summarisation to produce superior results than traditional approaches. Techniques like logical consequence allow causal relations to be used in niche practical applications such as event prediction which is useful for diverse domains such as security and finance. Until recently, the use of causal relations was a relatively unpopular technique because the causal relation extraction techniques were problematic, and the relations returned were incomplete, error prone or simplistic. The recent adoption of language models and improved relation extractors for natural language such as Transformer-XL (Dai et al. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860) has seen a surge of research interest in the possibilities of using causal relations in practical applications. Until now, there has not been an extensive survey of the practical applications of causal relations; therefore, this survey is intended precisely to demonstrate the potential of causal relations. It is a comprehensive survey of the work on the extraction of causal relations and their applications, while also discussing the nature of causation and its representation in text.
Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concept vectors which can be used to enhance comparable corpora through clustering techniques. We then show how one can extract bilingual lexicons of improved quality from these enhanced corpora. We finally show that the bilingual lexicons obtained can complement existing bilingual dictionaries and improve cross-language information retrieval systems.
The concept of remotely operated, unmanned, and autonomous ships is creating increasing interest in the maritime domain, promising safety, increased efficiency and sustainability. Shore control centers (SCCs) have been proposed to operate such vessels and some industry projects are initiated. This paper aims at bringing knowledge about what a SCC is envisioned to be. It identifies and explores challenges related to designing and developing SCCs through semi-structured interviews with the research community and industry. We discuss tasks, functions and interactions between human and machine.
Information retrieval (IR) aims at retrieving documents that are most relevant to a query provided by a user. Traditional techniques rely mostly on syntactic methods. In some cases, however, links at a deeper semantic level must be considered. In this paper, we explore a type of IR task in which documents describe sequences of events, and queries are about the state of the world after such events. In this context, successfully matching documents and query requires considering the events’ possibly implicit uncertain effects and side effects. We begin by analyzing the problem, then propose an action language-based formalization, and finally automate the corresponding IR task using answer set programming.
Language and communication are considered as relevant to artificial intelligence. Linguists are not the only scientists wishing to test theories of language functioning: so do psychologists and neurophysiologists. This chapter briefly looks at samples of important and prescient early work, and shows two contrasting, slightly later, approaches to the extraction of content, evaluation, representation, and the role of knowledge. It considers a range of systems embodying natural language processing (NLP)/computational linguistics (CL) aspects since the early seventies, and divides them by their relationships to linguistic systems and in relation to concepts normally taken as central to AI, namely logic, knowledge, and semantics. Broadly, statistical methods imply the use of only numerical, quantitatively based, methods for NLP/CL, rather than methods based on representations, whether those are assigned by humans or by computers. The chapter discusses the role of annotations to texts and the interpretability of core AI representations.