Skip to main content

    Michal Cutler

    Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house... more
    Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house developed rendering engine, data on a pseudo-random sample of web pages is collected. First, several basic attributes are collected to verify the collection process and confirm certain assumptions on web page text. Next, we take a look at the distribution of different types of page content (text, images, plug-in objects, and forms) in terms of rendered visual area. Those different types of content are broken down into a detailed view of the ways in which the content is used. This includes a look at the prevalence and usage of scripts and styles. We conclude that more complex page elements play a significant and underestimated role in the visually attractive, media rich, and highly interactive web pages that are currently being added to the World Wide...
    Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house... more
    Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house developed rendering engine, data on a pseudo-random sample of web pages is collected. First, several basic attributes are collected to verify the collection process and confirm certain assumptions on web page text. Next, we take a look at the distribution of different types of page content (text, images, plug-in objects, and forms) in terms of rendered visual area. Those different types of content are broken down into a detailed view of the ways in which the content is used. This includes a look at the prevalence and usage of scripts and styles. We conclude that more complex page elements play a significant and underestimated role in the visually attractive, media rich, and highly interactive web pages that are currently being added to the World Wide...
    Automatic genre classification of Web pages is currently young compared to other Web classification tasks. Corpora are just starting to be collected and organized in a systematic way, feature extraction techniques are incon sistent and... more
    Automatic genre classification of Web pages is currently young compared to other Web classification tasks. Corpora are just starting to be collected and organized in a systematic way, feature extraction techniques are incon sistent and not well detailed, genres are constantly in dispute, and novel applications have not been implemented. This paper attempts to review and make progress in the area of feature extraction, an area that we believe can benefit all Web page classification, and genre classification in particular. We first present a framework for the extraction of various Web-specific feature groups from distinct data models based on a tree of potentials models and the transformations that create them. Then we introduce the concept of cost-sensitivity to this tree and provide an algorithm for per forming wrapper-based feature selection on this tree. Finally, we apply the cost-sensitive feature selection algorithm on two genre corpora and analyze the performance of the class...
    This paper presents optimization models for selecting a subset of software libraries, viz, collections of programs, residing on floppy disks or compact disks, available on the market. Each library contains a variety of programs whose... more
    This paper presents optimization models for selecting a subset of software libraries, viz, collections of programs, residing on floppy disks or compact disks, available on the market. Each library contains a variety of programs whose reliabilities are assumed to be known. The objective is to maximize the reliability of the computer system subject to a budget constraint on the total cost of the libraries selected. The paper includes six models, each of which applies to a different software structure and assumptions. A detailed branch & bound algorithm for solving one of the six models is described; it contains a simple greedy-procedure for generating an initial solution. For solving the rest of the models, see Berman & Cutler (1995).
    ABSTRACT A distributed self-diagnosis algorithm for VLSI mesh arrays with small clusters of faults is presented. It allows only fault-free cells to make decisions and to propagate diagnosis results. Its time complexity is constant with... more
    ABSTRACT A distributed self-diagnosis algorithm for VLSI mesh arrays with small clusters of faults is presented. It allows only fault-free cells to make decisions and to propagate diagnosis results. Its time complexity is constant with respect to the number of processors. The diagnosability is proportional to the array size
    Research Interests:
    The categorization of a Web user query by topic or category can be used to select useful Web sources that contain the required information. In pursuit of this goal, we explore methods for mapping user queries to category hierarchies under... more
    The categorization of a Web user query by topic or category can be used to select useful Web sources that contain the required information. In pursuit of this goal, we explore methods for mapping user queries to category hierarchies under which deep Web resources are also assumed to be classified. Our sources for these category hierarchies, or directories, are Yahoo! Directory and Wikipedia. Forwarding an unrefined query (in our case a typical fact finding query sent to a question answering system) directly to these directory resources usually returns no directories or incorrect ones. Instead, we develop techniques to generate more specific directory finding queries from an unrefined query and use these to retrieve better directories. Despite these engineered queries, our two resources often return multiple directories that include many incorrect results, i.e., directories whose categories are not related to the query, and thus Web resources for these categories are unlikely to contain the required information. We develop methods for selecting the most useful ones. We consider a directory to be useful if Web sources for any of its narrow categories are likely to contain the searched for information. We evaluate our mapping system on a set of 250 TREC questions and obtain precision and recall in the 0.8 to 1.0 range.
    ... Kelvin Wu, Lei Yu, and Michal Cutler Computer Science, Binghamton University {kwu, lyu, cutler}@binghamton.edu ... to join the army to make garlic bread to remove an ink stain to poach eggs to install Linux to fight hay fever to move... more
    ... Kelvin Wu, Lei Yu, and Michal Cutler Computer Science, Binghamton University {kwu, lyu, cutler}@binghamton.edu ... to join the army to make garlic bread to remove an ink stain to poach eggs to install Linux to fight hay fever to move a refrigerator to treat poison ivy to fertilize a ...
    ABSTRACT The general algorithms and heuristics developed in the module-level automatic test pattern generator called MATEG are described. MATEG is a generalization of the branch-and-bound test generation algorithm for module-based... more
    ABSTRACT The general algorithms and heuristics developed in the module-level automatic test pattern generator called MATEG are described. MATEG is a generalization of the branch-and-bound test generation algorithm for module-based circuits. MATEG retains accuracy of the deterministic test generation, and reduces computation time by exploiting the hierarchy in the circuit-under-test. Some experimental results are presented to demonstrate the efficiency of this approach
    Research Interests:
    Research Interests:
    ABSTRACT The problems of layout of printed circuits and large scale integrated chips are very complex and are therefore usually approached by heuristic methods. This paper presents a more analytic approach to an elementary subset of these... more
    ABSTRACT The problems of layout of printed circuits and large scale integrated chips are very complex and are therefore usually approached by heuristic methods. This paper presents a more analytic approach to an elementary subset of these problems, using combinatorial and graphtheoretic arguments.
    ABSTRACT A special layout problem is considered. It is shown that the problem is NP-complete. For a system with a small number of channels k, a polynomial algorithm is given, finding a layout with a width differing from the minimum by at... more
    ABSTRACT A special layout problem is considered. It is shown that the problem is NP-complete. For a system with a small number of channels k, a polynomial algorithm is given, finding a layout with a width differing from the minimum by at most k + 1. For a system with a large number of channels k, an approximation algorithm is presented. Given an ϵ > 0 this algorithm finds a layout with a width differing from the minimum by at most ϵZ + k + 1, where Z is a lower bound on the width. The complexity of this algorithm depends on ϵ.
    ABSTRACT The traveling salesman problem, path, or cycle is NP-complete. All known exact solutions to this problem are exponential. In the N-line planar traveling salesman problem the points are on N lines in the plane. In this paper,... more
    ABSTRACT The traveling salesman problem, path, or cycle is NP-complete. All known exact solutions to this problem are exponential. In the N-line planar traveling salesman problem the points are on N lines in the plane. In this paper, simple and efficient low-degree polynomial solutions are given to some N-line (N = 2, 3) planar traveling salesman problems using dynamic programming. Such problems arise in practical applications, for example, connecting nets in printed circuits.
    Fault tolerant software uses redundancy to improve reliability; but such redundancy requires additional resources and tends to be costly, therefore the redundancy level needs to be optimized. Our optimization models determine the optimal... more
    Fault tolerant software uses redundancy to improve reliability; but such redundancy requires additional resources and tends to be costly, therefore the redundancy level needs to be optimized. Our optimization models determine the optimal level of redundancy within a software system under the assumption that functionally equivalent software components fail independently. A framework illustrates the tradeoff between the cost of using N-version programming and the improved reliability for a software system. The 2 models deal with: a single task, and multitask software. These software systems consist of several modules where each module performs a subtask and, by sequential execution of modules, a major task is performed. Major assumptions are: 1) several versions of each module, each with an estimated cost and reliability, are available, 2) these module versions fail independently. Optimization models are used to select the optimal set of versions for each module such that the system reliability is maximized and total cost remains within budget
    ABSTRACT Software is becoming central to every aspect to our life. Therefore, developing highly reliable software while considering budget limitations is becoming very important. Because of their increasing complexity, software products... more
    ABSTRACT Software is becoming central to every aspect to our life. Therefore, developing highly reliable software while considering budget limitations is becoming very important. Because of their increasing complexity, software products are implemented using available modules and by performing many programming, testing and integrating tasks. The purpose of this research is to allocate resources efficiency to these activities so that the reliability of a software package will be maximized. The paper includes an optimization model for deriving cost allocations while satisfying a budget constraint. The model allows a decision maker to consider the usage of available modules in the market as well as the option of developing them in-house.It is becoming increasingly difficult to create software products that simultaneously provide high reliability, rapid delivery and low cost. This research deals with the cost of achieving reliable software. Assume a software package has been designed and is ready for implementation. To implement this software package, a set of modules will have to be purchased and many programming and integration tasks will have to be performed. The performance of a programming task consists of the detailed design of a module, coding and unit testing. An integration task consists of the additional testing and debugging needed when the code included in separately tested tasks is joined together. The implementation process ends when the package has been integrated and tested. A model for deriving cost allocations is presented. The objective of the model is to maximize reliability while satisfying a budget constraint. The option of developing modules in-house as well as the option of purchasing them if available are considered in the optimization. The paper includes a branch and bound scheme to derive an optimal solution.
    Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'02) 1082-3409/02 $17.00 © 2002 IEEE