Skip to main content

    Anastasia Bezzubtseva

    Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from... more
    Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users. In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex.
    User behaviour data is essential for modern companies, as it allows them to measure the impact of decisions they make and to gain new insights. A particular type of such data is user location trajectories, which can be clustered into... more
    User behaviour data is essential for modern companies, as it allows them to measure the impact of decisions they make and to gain new insights. A particular type of such data is user location trajectories, which can be clustered into Points of Interest, which, in turn, can be tied to certain venues (restaurants, schools, theaters, etc.). Machine learning is extensively utilized to detect and predict venue visits given the location data, but it requires a sufficient sample of labeled visits. Few Internet services provide a possibility to check-in for a user --- to send a signal that she is visiting a particular venue. However, for the majority of mobile applications it is unreasonable or far-fetched to introduce such a functionality for labeling purposes only. In this paper, we present a novel approach to label large quantities of location data as visits based on the following intuition: if a user is connected to a Wi-Fi hotspot of some venue, she is visiting the venue. Namely, we address the problem of matching Wi-Fi hotspots with venues by means of machine learning achieving 95% precision and 85% recall. The method has been deployed to production of one of the most popular global geo-based web services. We also release our dataset (that we utilize to develop the matching model) to facilitate research in this area.
    We study the problem of predicting future hourly earnings and task completion time for a crowdsourcing platform user who sees the list of available tasks and wants to select one of them to execute. Namely, for each task shown in the list,... more
    We study the problem of predicting future hourly earnings and task completion time for a crowdsourcing platform user who sees the list of available tasks and wants to select one of them to execute. Namely, for each task shown in the list, one needs to have an estimated value of the user's performance (i.e., hourly earnings and completion time) that will be if she selects this task. We address this problem on real crowd tasks completed on one of the global crowdsourcing marketplaces by (1) conducting a survey and an A/B test on real users; the results confirm the dominance of monetary incentives and importance of knowledge on hourly earnings for users; (2) an in-depth analysis of user behavior that shows that the prediction problem is challenging: (a) users and projects are highly heterogeneous, (b) there exists the so-called "learning effect" of a user selected a new task; and (3) the solution to the problem of predicting user performance that demonstrates improvement of prediction quality by up to 25% for hourly earnings and up to $32%$ completion time w.r.t. a naive baseline which is based solely on historical performance of users on tasks. In our experimentation, we use data about 18 million real crowdsourcing tasks performed by $161$ thousand users on the crowd platform; we publish this dataset. The hourly earning prediction has been deployed in Yandex.Toloka.
    ABSTRACT This paper presents a preliminary analysis of hotel room prices in several European cities based on the data from Booking.com website. The main question raised in the study is whether early booking is advantageous indeed, and if... more
    ABSTRACT This paper presents a preliminary analysis of hotel room prices in several European cities based on the data from Booking.com website. The main question raised in the study is whether early booking is advantageous indeed, and if so, how early should it be? First a script was developed to download more than 600 thousand hotel offers for reservations from 25 March 2013 to 17 March 2014. Then an attempt to discover more details concerning the early booking effect was made via basic statistics, graphical data representation and hedonic pricing analysis. It was revealed that making reservations in advance can be really gainful, although more data and research are needed to measure the exact numbers, as they depend on at least seasonality and city.
    This study introduces a novel feature selection approach CMICOT, which is a further evolution of filter methods with sequential forward selection (SFS) whose scoring functions are based on conditional mutual information (MI). We state and... more
    This study introduces a novel feature selection approach CMICOT, which is a further evolution of filter methods with sequential forward selection (SFS) whose scoring functions are based on conditional mutual information (MI). We state and study a novel saddle point (max-min) optimization problem to build a scoring function that is able to identify joint interactions between several features. This method fills the gap of MI-based SFS techniques with high-order dependencies. In this high-dimensional case, the estimation of MI has prohibitively high sample complexity. We mitigate this cost using a greedy approximation and binary representatives what makes our technique able to be effectively used. The superiority of our approach is demonstrated by comparison with recently proposed interaction-aware filters and several interaction-agnostic state-of-the-art ones on ten publicly available benchmark datasets.
    Abstract. In this paper we present a review of the existing typologies of Internet service users. We zoom in on social networking services including blogs and crowdsourcing websites. Based on the results of the analysis of the considered... more
    Abstract. In this paper we present a review of the existing typologies of Internet service users. We zoom in on social networking services including blogs and crowdsourcing websites. Based on the results of the analysis of the considered typologies obtained by means of FCA we developed a new user typology of a certain class of Internet services, namely a collaboration innovation platform.