Skip to main content

    Martin Rajman

    ABSTRACT Document ranking for scientific publications involves a variety of specialized resources (e.g. author or citation indexes) that are usually difficult to use within standard general purpose search engines that usually operate on... more
    ABSTRACT Document ranking for scientific publications involves a variety of specialized resources (e.g. author or citation indexes) that are usually difficult to use within standard general purpose search engines that usually operate on large-scale heterogeneous document collections for which the required specialized resources are not always available for all the documents present in the collections. Integrating such resources into specialized information retrieval engines is therefore important to cope with community-specific user expectations that strongly influence the perception of relevance within the considered community. In this perspective, this paper extends the notion of ranking with various methods exploiting different types of bibliographic knowledge that represent a crucial resource for measuring the relevance of scientific publications. In our work, we experimentally evaluated the adequacy of two such ranking methods (one based on freshness, i.e. the publication date, and the other on a novel index, the download-Hirsch index, based on download frequencies) for information retrieval from the CERN scientific publication database in the domain of particle physics. Our experiments show that (i) the considered specialized ranking methods indeed represent promising candidates for extending the base line ranking (relying on the download frequency), as they both lead to fairly small search result overlaps; and (ii) that extending the base line ranking with the specialized ranking method based on freshness significantly improves the quality of the retrieval: 16.2% of relative increase for the Mean Reciprocal Rank (resp. 5.1% of relative increase for the Success@10, i.e. the estimated probability of finding at least one relevant document among the top ten retrieved) when a local rank sum is used for aggregation. We plan to further validate the presented results by carrying out additional experiments wi Our experiments show that (i) the considered specialized ranking methods indeed represent promising candidates for extending the base line ranking (relying on the download frequency), as they both lead to fairly small search result overlaps; and (ii) that extending the base line ranking with the specialized ranking method based on freshness significantly improves the quality of the retrieval: 16.2% of relative increase for the Mean Reciprocal Rank (resp. 5.1% of relative increase for the Success@10, i.e. the estimated probability of finding at least one relevant document among the top ten retrieved) when a local rank sum is used for aggregation. We plan to further validate the presented results by carrying out additional experiments with the specialized ranking method based on the download-Hirsch index to further improve the performance of our aggregative approach.
    ... BibTeX | Add To MetaCart. @MISC{Chappelier01parsingdop, author = {Jean-Cedric Chappelier and Martin Rajman}, title = { Parsing DOP with Monte-Carlo Techniques}, year = {2001} }. ... 5, A property of the multinomial distribution –... more
    ... BibTeX | Add To MetaCart. @MISC{Chappelier01parsingdop, author = {Jean-Cedric Chappelier and Martin Rajman}, title = { Parsing DOP with Monte-Carlo Techniques}, year = {2001} }. ... 5, A property of the multinomial distribution – Kesten, Morse - 1959. 3, Parsing Inside-Out. ...
    Abstract: The goal of this contribution is to present how the notion of dialogue management evaluation was integrated in the rapid dialogue prototyping methodology (RDPM), designed and experimented by the authors in the framework of the... more
    Abstract: The goal of this contribution is to present how the notion of dialogue management evaluation was integrated in the rapid dialogue prototyping methodology (RDPM), designed and experimented by the authors in the framework of the InfoVox project. We first describe the proposed RDPM. The general idea of this methodology is to produce, for any given application, a quickly deployable dialogue-driven interface and to improve this interface through an iterative process based on Wizard-of-Oz experiments (ie dialogue simulations) ...
    Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale... more
    Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale semantic database -- the EDR Electronic Dictionary -- that provides a concept hierarchy based on hyponym and hypernym relations. This concept hierarchy is used to generate a synthetic representation of the document by aggregating the words present in topically homogeneous document segments into a set of concepts best preserving the document's content. This new extraction technique uses an unexplored approach to topic selection. Instead of using semantic similarity measures based on a semantic resource, the later is processed to extract the part of the conceptual hierarchy relevant to the document content. Then this conceptual hierarchy is searched to extract the most relevant set of concepts to represent the topics discussed in the document. Notic...
    Similarities for textual data The evaluation of similarities between textual entities (documents, sentences, words...) is one of the central issues for the implementation of efficient methods for tasks such as description and exploration... more
    Similarities for textual data The evaluation of similarities between textual entities (documents, sentences, words...) is one of the central issues for the implementation of efficient methods for tasks such as description and exploration of textual data, information retrieval or knowledge extraction (text mining). The main purpose of this contribution is to propose a comparative presentation of different approaches used to define the notion of similarity in fields such as Textual Data Analysis, Information Retrieval or Text Mining. We first discuss some of the linguistic treatments (tagging, lemmatization, …) necessary for the pre-processing of the textual data and then analyze some of the measures (cosinus, chi-square, Kullback-Leibler) used to quantify the similarities (or dissimilarities) between textual entities. Finally, we present some techniques allowing to improve the quality of the similarities in the case where additional knowledge, such as external corpora or semantic gra...
    Research Interests:
    Keywords: document ranking ; specialized search ; scientific publication databases ; rank aggregation ; ranking criteria selection These Ecole polytechnique federale de Lausanne EPFL, n° 5460 (2012)Programme doctoral Informatique,... more
    Keywords: document ranking ; specialized search ; scientific publication databases ; rank aggregation ; ranking criteria selection These Ecole polytechnique federale de Lausanne EPFL, n° 5460 (2012)Programme doctoral Informatique, Communications et InformationFaculte informatique et communicationsLaboratoire d'intelligence artificielle Reference doi:10.5075/epfl-thesis-5460Print copy in library catalog Record created on 2012-08-15, modified on 2017-05-12
    In this paper we report an experiment of an automated metric used to analyze the grammaticality of machine translation output. The approach (Rajman, Hartley, 2001) is based on the distribution of the linguistic information within a... more
    In this paper we report an experiment of an automated metric used to analyze the grammaticality of machine translation output. The approach (Rajman, Hartley, 2001) is based on the distribution of the linguistic information within a translated text, which is supposed similar between a learning corpus and the translation. This method is quite inexpensive, since it does not need any reference translation. First we describe the experimental method and the different tests we used. Then we show the promising results we obtained on the CESTA data, and how they correlate well with human judgments. 1 CESTA : Campagne d’Evaluation des Systemes de Traduction Automatique, for Machine Translation Evaluation Campaign
    Les Grammaires a Substitution d'Arbres Polynomiales (pSTSG), une sous-classe des STSG pour laquelle la recherche de l'arbre d'analyse le plus probable n'est plus NP-difficile mais polynomiale, sont definies et... more
    Les Grammaires a Substitution d'Arbres Polynomiales (pSTSG), une sous-classe des STSG pour laquelle la recherche de l'arbre d'analyse le plus probable n'est plus NP-difficile mais polynomiale, sont definies et caracterisees en termes de proprietes generales sur leurs arbres elementaires. Des conditions necessaires et suffisantes pour une polynomialite effective sont d'abord donnees. Puis deux principes de selection sont presentes, reposant sur cette caracterisation et permettant l'extraction automatique des pSTSG a partir d'un corpus d'arbres. Par ailleurs, les STSG standard utilisent un modele probabiliste de nature generative, dans lequel les probabilites sont conditionnees par les racines des arbres et dont les parametres sont determines de facon heuristique. Les experiences montrent que ce type de modeles induit des comportements indesirables en analyse syntaxique. Nous proposons ici une approche probabiliste non generative pour les STSG, selon la...
    The aim of this report is to describe the browsers that have been developed by various groups within the IM2 project, highlighting goals, design methodologies, key functionalities and evaluation methods used by each. The paper concludes... more
    The aim of this report is to describe the browsers that have been developed by various groups within the IM2 project, highlighting goals, design methodologies, key functionalities and evaluation methods used by each. The paper concludes with a tabular overview of the media, input and output modalities and special functionalities handled by each browser, as well as providing specific contact points and references.
    This paper presents the result of an experimental system aimed at performing a robust semantic analysis of analyzed speech input in the area of information system access. The goal of this experiment was to investigate the eoeectiveness of... more
    This paper presents the result of an experimental system aimed at performing a robust semantic analysis of analyzed speech input in the area of information system access. The goal of this experiment was to investigate the eoeectiveness of such a system in a pipelined architecture, where no control is possible over the morpho-syntactic analysis which precedes the semantic analysis and query formation.
    Standard general-purpose Web retrieval relies on centralized search engines that do not realistically scale when applied to the exponentially growing number of documents available on the Web. By taking advantage of the resource sharing... more
    Standard general-purpose Web retrieval relies on centralized search engines that do not realistically scale when applied to the exponentially growing number of documents available on the Web. By taking advantage of the resource sharing principle, Peer-to-Peer (P2P) techniques represent a promising architectural alternative for building decentralized Web search engines offering true Web-size scalability, provided that enough peers are available. However, in all such P2P approaches proposed so far, excessive network bandwidth consumption during retrieval, caused by the necessary transmission of possibly very long posting lists, was identified as the major bottleneck for implementing truly scalable P2P full-text Web retrieval. The main objective of the present research is thus to find a decentralized indexing/retrieval strategy that fully exploits the distributed computation possibilities provided by P2P networks, but keeps the required network bandwidth consumption scalable, while guaranteeing an acceptable retrieval quality. To address this problem we introduce a novel indexing/retrieval model based on Highly Discriminative Keys (HDKs), which correspond to carefully selected indexing terms and term sets associated with posting lists truncated to the top-k most relevant documents with respect to the associated key. Using HDKs for indexing thus increases the number of indexing features but, at the same time, strictly limits the size of the associated posting lists. When combined with an adequate retrieval model, this leads to strongly reduced network traffic. More precisely, our experimental results show that HDK-based indexing and retrieval lead to storage and bandwidth requirements that remain manageable with respect to the growth of document collection while preserving a fully acceptable retrieval quality.
    Research Interests:
    Research Interests:
    For most users, Web-based centralized search engines are the access point to distributed resources such as Web pages, items shared in file sharing-systems, etc. Unfortunately, existing search engines compute their results on the basis of... more
    For most users, Web-based centralized search engines are the access point to distributed resources such as Web pages, items shared in file sharing-systems, etc. Unfortunately, existing search engines compute their results on the basis of structural information only, e.g., the Web graph structure or query-document similarity estimations. Users expectations are rarely considered to enhance the subjective relevance of returned results. However, exploiting such information can help search engines satisfy users by tailoring search results. Interestingly, user interests typically follow the clustering property: users who were interested in the same topics in the past are likely to be interested in these same topics also in the future. It follows that search results considered relevant by a user belonging to a group of homogeneous users will likely also be of interest to other users from the same group. In this paper, we propose the architecture of a novel peerto-peer system exploiting col...
    Automatic indexing is one of the important technologies used for Textual Data Analysis applications. Standard document indexing techniques usually identify the most relevant keywords in the documents. This paper presents an alternative... more
    Automatic indexing is one of the important technologies used for Textual Data Analysis applications. Standard document indexing techniques usually identify the most relevant keywords in the documents. This paper presents an alternative approach that aims at performing document indexing by associating concepts with the document to index instead of extracting keywords out of it. The concepts are extracted out of the EDR Electronic Dictionary that provides a concept hierarchy based on hyponym/hypernym relations. An experimental evaluation based on a probabilistic model was performed on a sample of the INSPEC bibliographic database and we present the promising results that were obtained during the evaluation experiments.
    L'action GRACE est le premier exemple d’application du paradigme d'évaluation aux étiqueteurs morphosyntaxiques pour le français dans le cadre d'une campagne d'évaluation formelle, à participation ouverte et utilisant des... more
    L'action GRACE est le premier exemple d’application du paradigme d'évaluation aux étiqueteurs morphosyntaxiques pour le français dans le cadre d'une campagne d'évaluation formelle, à participation ouverte et utilisant des données de grande taille. Après une rapide description de l’organisation et du déroulement de l'action ainsi que des problèmes posés par la nécessaire mise en place d’un référentiel commun pour l’évaluation, nous présenterons en détail la métrique Précision-Décision qui a été développée dans le cadre de GRACE pour la mesure quantitative des performances des systèmes d’étiquetage. Nous nous intéresserons ensuite aux résultats obtenus pour les participants à la phase de test de la campagne et indiquerons les aspects du protocole d’évaluation qui restent encore à valider sur les données recueillies. Enfin, nous conclurons en soulignant les incidences positives d’une campagne d’évaluation comme GRACE sur le domaine de l’ingénierie linguistique.

    And 237 more