[go: up one dir, main page]

Academia.eduAcademia.edu

Socially-Aware Multimedia Authoring

Doctoral thesis

Creating compelling multimedia productions is a non-trivial problem. This is true for both professional and personal content. For professional content, extensive production support is typically available during creation. Content assets are well structured, content fragments are professionally produced with high quality, and production assets are often highly annotated (within the scope of the production model). For personal content, nearly none of these conditions exist: content is a collection of assets that are structured only by linear recording time, of mediocre technical quality (on an absolute scale), and with only basic automatic annotations. These conditions limit the options open to casual authors and to viewers of rich multimedia content in creating and receiving focused, highly personal media presentations. The problem is compounded when authors want to integrate community media assets: media fragments donated from a potentially wide and anonymous recording community. In this thesis we reflect on the traditional multimedia authoring workflow and we argue that a fresh new look is required. Our experimental methodology aims at meeting the requirements needed for social communities that are not addressed by traditional authoring and sharing applications. We focus on the particular task of supporting socially-aware multimedia authoring, in which the relationships within particular social groups can be exploited to create highly personal media experiences. Our framework is centered on empowering users in telling stories and commenting on personal media artifacts, considering the long-term social context of the user. The work has been evaluated through a number of prototype tools that allow users to explore, create, enrich and share rich multimedia artifacts. Results from our evaluation process provide useful insights into how a socially-aware multimedia authoring and sharing system should be designed and architected, for helping users in recalling personal memories and in nurturing their close circle relationships.

Socially-Aware Multimedia Authoring Rodrigo Laiola Guimarães Copyright © 2014 by Rodrigo Laiola Guimarães ISBN: 978-94-6259-028-1 Typeset with Microsoft Word Cover picture: TrendyCovers.com Printed and bound by Ipskamp Drukkers, Amsterdam, The Netherlands All rights are reserved. Reproduction in whole or part is prohibited without the written consent of the copyright owner. The work reported in this thesis has been carried out at the Centrum Wiskunde & Informatica (CWI), under the auspices of the group Distributed and Interactive Systems (SEN5). The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. ICT-2007-214793. To the memory of my beloved mother, Maria Célia, whose sacrifices, love, and dedication to family taught me the most important lessons in my life and showed me how to see the hand of God in all things. Dedicado à memoria de minha amada mãe, Maria Célia, cujo sacrifício, amor e dedicação me ensinou as mais importantes lições da minha vida e me mostrou que Deus age sobre todas as coisas. VRIJE UNIVERSITEIT Socially-Aware Multimedia Authoring ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad Doctor aan de Vrije Universiteit Amsterdam, op gezag van de rector magnificus prof.dr. F.A. van der Duyn Schouten, in het openbaar te verdedigen ten overstaan van de promotiecommissie van de Faculteit der Exacte Wetenschappen op dinsdag 28 januari 2014 om 13.45 uur in de aula van de universiteit, De Boelelaan 1105 door Rodrigo Laiola Guimarães geboren te Vitória, Brazilië promotor: copromotor: prof.dr. D.C.A. Bulterman dr. P.S. Cesar promotiecommissie: prof.dr. A. Eliens prof.dr. F. Harmelen prof.dr. J. Murray prof.dr. S. Boll prof.dr. M.G.C. Pimentel Contents Contents ................................................................................................................. vii Acknowledgements ................................................................................................. xi 1 Introduction ..................................................................................................... 1 1.1 Research Questions .................................................................................. 5 1.1.1 Social Aspects ............................................................................. 6 1.1.2 Capture and Processing............................................................... 7 1.1.3 Access and Navigation ............................................................... 8 1.1.4 Creation and Production ............................................................. 9 1.1.5 Content Enrichment .................................................................... 9 1.2 Our Aim ................................................................................................. 10 1.3 Thesis Outline and Summary of the Contributions ................................ 10 1.4 Related Work ......................................................................................... 14 1.4.1 Conventional Authoring Systems ............................................. 14 1.4.2 Interactive Storytelling ............................................................. 16 1.4.3 Community Video Mashups and Content Repurposing ........... 17 2 Personalized Memories of Social Events: Studying Asynchronous Togetherness .................................................................................................. 19 2.1 Methodology .......................................................................................... 22 2.1.1 Content Recording and Preparation .......................................... 24 2.1.2 MyVideos Implementation ....................................................... 27 2.1.3 Participants ............................................................................... 28 2.2 Generic Architecture for Socially-Aware Authoring Systems .............. 31 2.2.1 Social Science Principles .......................................................... 31 2.2.2 Family Interviews and Focus Groups ....................................... 33 2.2.3 Requirements Gathering ........................................................... 34 2.2.4 Guidelines ................................................................................. 37 2.3 Evaluation .............................................................................................. 42 2.4 Discussion .............................................................................................. 46 vii 3 Designing Socially-Aware Video Exploration from Community Assets .. 49 3.1 Community-based Browsing ................................................................. 52 3.1.1 Phase 1 Evaluation .................................................................... 55 3.1.2 Lessons Learned ....................................................................... 58 3.2 Socially-Aware Media Browsing ........................................................... 59 3.3 Evaluation .............................................................................................. 64 3.3.1 Results and Findings ................................................................. 64 3.4 Discussion .............................................................................................. 68 4 Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? ................................................ 71 4.1 Community-based Authoring ................................................................. 73 4.1.1 Phase 1 Evaluation .................................................................... 76 4.1.2 Requirements Gathering ........................................................... 78 4.2 Hybrid Multimedia Authoring ............................................................... 80 4.2.1 Profiling Users .......................................................................... 82 4.2.2 Automatic Generation of Stories .............................................. 84 4.2.3 End-User Personalization of Stories ......................................... 85 4.3 Evaluation .............................................................................................. 88 4.3.1 Results and Findings ................................................................. 88 4.4 Discussion .............................................................................................. 91 5 Supporting Personalized End-User Comments within Third-Party Online Videos ............................................................................................................. 93 5.1 Media Consumption and Commenting Practices ................................... 96 5.1.1 Survey Research ....................................................................... 97 5.1.2 Requirements Gathering ......................................................... 100 5.2 Media Commenting meets Multimedia Document Engineering ......... 101 5.2.1 Document Model .................................................................... 101 5.2.2 Timed Text Content ................................................................ 102 5.2.3 Temporal Hyperlinks .............................................................. 105 5.2.4 Contextual Information ........................................................... 105 5.2.5 Selective Viewing ................................................................... 106 5.3 A Timed Text Video Commenting System ......................................... 106 5.3.1 Infrastructure........................................................................... 108 5.3.2 User Interface.......................................................................... 109 5.4 Evaluation ............................................................................................ 112 viii 5.5 6 5.4.1 Commenting on Videos .......................................................... 113 5.4.2 Close Captioning Videos ........................................................ 113 Discussion ............................................................................................ 116 Conclusions .................................................................................................. 119 6.1 Revisiting the Research Questions ....................................................... 120 6.2 Reflection and Further Directions ........................................................ 123 6.2.1 Media Encoding and Storage .................................................. 125 6.2.2 Media Classification and Annotation ..................................... 126 6.2.3 Customized Media Selection .................................................. 126 6.2.4 Content-Based Navigation ...................................................... 127 6.2.5 Ownership and Digital Rights ................................................ 127 6.2.6 Security and Privacy Concerns ............................................... 128 6.3 Closing Thoughts ................................................................................. 128 Bibliography ........................................................................................................ 131 Summary .............................................................................................................. 143 Nederlandse Samenvatting ................................................................................. 145 Curriculum Vitae ................................................................................................ 147 ix Acknowledgements I wish to publicly acknowledge those whom I have been so fortunate to have met or worked with during this long journey. My Ph.D. thesis is, without a doubt, a team effort. And I owe a very large debt to those who graciously conducted me to make this book a reality. First, I would like to thank my mentor and dear friend Dick Bulterman. It is impossible to describe in a few words my admiration and how much I have learned with him during this period. In stressful times, he would always remind me that “we do what we do (research) simply for the fun of it!”. And he was right! Second, I would like to gratefully and respectfully thank my as important advisor, colleague and “big brother” Pablo Cesar. Thank you for doing such a patient and sterling job of conducting me through this process. I guess you should have gotten paid double, not only for shaping me as a researcher, but also as my psychologist . It was an honor and a pleasure to work with both of you guys. Thank you very much once again, and I truly hope we can cross paths and work together in the near future. I also would like to give a special thank you to the members of my reading committee – Anton Eliens, Frank van Harmelen, Janet Murray, Maria da Graça Pimentel and Susanne Boll – for taking time to carefully read my thesis and for all the helpful comments. Your expert endorsement added a great value to this work. My time at CWI was enjoyable and many people contributed to this. I would like to take a moment to thank some colleagues and friends: Alexandra, Behnaz, Benjamin, Bram, Christoph, David, Emma, Enav, Eleni, Floor, Georgiana, Henke, Irma, Inken, Jannis, José, Karin, Krzysztof, Lara, Léon, Li, Mahdi, Margriet, Milad, Minnie, Natallia (& Mark), Rob, Sara, Shashi, Stephanie, Susanne, Willem, Young-Joo and any other name that might have slipped my mind. I also would like to thank all my colleagues that were part of SEN5 at certain point in time: Bo, Chen, Diogo Martins, Diogo Pedrosa (& Uaiana), Fons, Jack, Ivan, Kees, Ketan, Marcio (& Suzana), Manzato, Marwin, Nelma, Pia, Rufael, Simon and Steven. A big thank you to my all time best friends André, Fernando Mario, Ishan and Wagner. We have been through great moments together. Last, but not least, special xi thanks to my friend Bikkie, whose smile and cheerful Bom dia (good morning) made my days much happier! During my Ph.D., I had the opportunity to work in the pan-European project Together Anywhere, Together Anytime (TA2). I would like to take this opportunity to thank in particular: Peter Stollenmayer (Eurescom); Ian Kegel, Doug Williams, Tim Stevens, Roland Craigie, Peter Glenn and Amanda Gower (British Telecom); Marian Ursu, Vilmos Zsombori and Michael Frantzis (Goldsmiths College London); Peter Ljungstrand (Interactive Institute); Rene Kaiser (Joanneum Research); Danil Korchagin (IDIAP); Marc Steen and Joke Kort (TNO); Orlando Verde (Alcatel-Lucent Belgium). Special thanks to all the users, families and schools (in the Netherlands and the UK) involved in the evaluation process discussed in this thesis. Due to privacy agreements I will not disclose any particular name. I also had the pleasure to visit the Georgia Institute of Technology (GA Tech) and intern at Samsung Research America (SRA) during my Ph.D.. I would like to gratefully thank the people whom kindly hosted me in these opportunities: Janet Murray, Sergio Goldenberg (GA Tech); Henry, Ellen and Oma (Bulterman Family); Alan Messer, Alex Bentley, Edwin, George Hsieh, Glenn Algie, Joakim Soderberg, Jun Nishimura, Mina Yoo, Raghuram Reddy, Scott Pan and Shai Kumar (SRA). While working at CWI, I attended several conferences, workshops and summer schools, where I met great and interesting people. I would like to remind some names: Andrei Bursuc, Cássio Prazeres, Carlos Salles, Cesar Teixeira, Cyril Concolato, David Ayman, David Geerts, Diana Arellano, Ethan Munson, Fabien Cazenave, Florian Stegmaier, Frank Nack, George Lekakos, Hendrik Knoche, Joel dos Santos, Konstantinos Chorianopoulos, Marianna Obrist, Miriam Redi, Mo Rabbath, Pawel Filipczuk, Romualdo Costa, Rudinei Goularte and Sam Davies. I also would like to take this moment to thank all the inspiring teachers I had along the years, in particular to Flávio Miguel Varejão, Luiz Fernando Gomes Soares, José Gonçalves Pereira Filho, Maria da Graça Pimentel and Celso Saibel. A tremendous dank u wel to my dear friends from A.V.V. Swift – Chudi, Danny, Dennis, Edwin, Fred, Gio, Jesse, Jim, Johan, Kevin, Lodi, Marcel, Marius, Nordin, Patti, Paul, Ramalho, Remi, Romain, Thijs and anyone else who I might have forgotten. Being part of the Zondag 5 family provided me a regular scape from work and made me feel closer to home. I will miss playing football with you. I also would like to thank some friends that I had the pleasure to meet in Amsterdam in the most different occasions: Alex, Daniel, Edson, Frans, Hugo, xii Jacquelien, Jordan, Malcher, Martinelli, Renato, Ruben, Valéria, Vinicius, Willemijn, Yoko and Zambon. A big bedankt to my Dutch friend Michiel Zwerus for the inspiring discussions, pool games and great dinners. My entire family has been a steady source of support during my entire career. They don’t quite get what exactly I have been doing during all this time, but they are still so encouraging. I would like to mention my parents, Antônio e Maria Célia, and my sisters Roberta and Rafaela. Not to forget my parents-in-law, José Braz and Maria do Carmo. I would like to specially thank my uncle João Ademir Laiola (aka João Ameixa), whose encouragement always gave me inspiration to keep going. Finally, without my girlfriend, partner and best friend, Cristina, I would not have been able to manage throughout this journey. Her loving support, unflagging patience and zealous pulled me together at the most difficult turns along the way. I am so blessed for having you on my side, and thank you very much for everything. I love you, Namorada! xiii 1 Introduction Nearly a decade ago at an ACM SIGMM1 retreat, one of the grand challenges to the multimedia research community was to develop media authoring tools that would make creating complex media titles as easy as using a WYSIWYG (What You See Is What You Get) word processing system [41]. Since that time, a number of consumer-level video editing tools have been developed that would lead a casual observer to believe that multimedia authoring is a solved problem: using tools like iMovie2 or Windows Movie Maker (or even more sophisticated tools such as Adobe Premiere or Final Cut Pro), even relatively novice video editors can match their talents with the likes of Sergei Eisenstein (see Figure 1.1). The process was further simplified by modern content capture tools, such as smartphones, in which recording, (simple) editing and integrated uploading were combined into a single task. In many ways, video editing has been reduced to transferring content taken from a (personal) camera to a computer, throwing out frames that are unwanted, and uploading the resulting production to a video sharing site. While it is indisputable that media capture and sharing is much easier than at any time in the past, we wonder if the resulting products of such authoring interfaces have provided any significant advances for the viewers of media content. It is even questionable if there have been significant advances for content authors. 1 SIGMM is the Association for Computing Machinery’s Special Interest Group on Multimedia, which specializes in the field of multimedia computing, from underlying technologies to applications, theory to practice, and servers to networks to devices. 2 The services and technologies mentioned in this thesis, if unknown, could very easily be identified via a simple online search, therefore they will not be Web-referenced. 1 2 Chapter 1. Introduction Recent data suggests not. In spite of the ubiquity of video cameras and the growth in video viewing on social networking sites, about 82% of Internet users have never uploaded even a single video [45]. Although most YouTube uploads are amateur content, professional videos are preferred to amateur productions online [38]. From the perspective of personal videos, the problem of creating and sharing content has several dimensions. At a lower level of abstraction, video is not semantically linked; therefore, searching and selecting the desired piece of content to share can become too laborious. At a high-level of abstraction, creating compelling videos – videos that meet the needs and desires of the viewer, not only the producer – is a complex task [69]. Viewers generally expect professionally produced content (in terms of shot selection, story pacing and logical narrative), which most amateur users cannot provide. Although a number of research efforts have addressed content creation from different perspectives [6][19][26][61], based on user studies [57][63] we observed that traditional authoring tools and current social media services fail to address the interpersonal relationships for sharing media that is personal and important to families and small social groups. Our assumption is validated by other studies [24], which concluded that social media applications like Facebook do not take into account the interpersonal tie strength of the users. Thus, we can conclude that the current media landscape demands a revision of traditional research on multimedia authoring to empower users in recalling and sharing personal media experiences with friends and family. This discussion leads to the following question: Main Question Is a new multimedia authoring paradigm required to enable endusers3 to share more personal media within their social circle? During the past years our research work has focused on the study of sociallyaware multimedia authoring. Working with a group of users at local high schools in two different countries (UK and the Netherlands), the process involved research on different facets related to the creation and sharing of multimedia artifacts composed of personal videos. Apart from the underling mechanisms for navigating and reusing personal content, this thesis work argues that a new paradigm, sociallyaware multimedia authoring, is necessary to better fit end-users’ needs. One important aspect of our work is that we decided to follow an interdisciplinary 3 The terms ‘end-user’ and ‘user’ will be used interchangeably in this thesis to describe regular people who operate computer software with minimal technical expertise or previous training. Research Questions 3 Figure 1.1. From the authoring perspective, the main challenge has been to make content creation a manageable process. approach in which both technology and social issues were addressed. As illustrated in Figure 1.2, the core of our methodology integrates knowledge from usercentered design (e.g., need assessment, iterative prototyping and user evaluation) and document engineering. The remaining of this chapter is organized as follows. In Section 1.1 the main question is split into a number of supportive research questions, which form the main focus of this thesis. Then, the contribution of the thesis is detailed in Section 1.2. Section 1.3 presents the thesis outline and summary of each chapter contribution with respective supportive material. Finally, Section 1.4 overviews the related work, contextualizing the research problem. 4 Chapter 1. Introduction Figure 1.2. Timeline of the implementations and evaluations of our system. Research Questions 1.1 5 Research Questions In this thesis, we consider a socially-aware multimedia authoring framework for personalizing video stories from a collection of community assets. The high-level architecture of our framework is sketched in Figure 1.3. The input material includes the video clips that parents agreed to upload, together with a master track recorded by the school. By contributing assets in a shared video repository, each participant gives permission to reuse their own contributions within the community. It is assumed that each participant has the rights to contribute their own material. Privacy and a protected scope for sharing is a key component of our framework. Each media item is automatically associated with the person who uploaded it, and there are mechanisms for participants to restrict sharing of certain clips. Participants can use their credentials for navigating the repository – those parts allowed to them – and for creating and sharing different video compilations intended for different people. At capturing time, there are no specific filming requirements for users. They can record what they wish using their own camera equipment. The goal is to recreate a realistic situation, in which friends and families are recording at a school concert. This flexibility comes at a cost, however, since most existing solutions that work well in analyzing audiovisual material are not that useful for our use case. As indicated in [59], handling user-generated content is challenging, since it is recorded using a variety of devices (e.g., mobile phones), the quality and lightning are not optimal, and the length of the clips is not standard. Figure 1.4 enlists four main stages (with the respective application services) that compose the socially-aware multimedia authoring workflow proposed in this thesis: Capture and Processing, Access and Navigation, Creation and Production, and Content Enrichment. We should not forget the importance of looking at the social aspects around personal media in this workflow. The key research questions emerging from each of the four stages are presented below. But first, we take a look into the social requirements. 6 Chapter 1. Introduction Figure 1.3. High-level architecture of socially-aware multimedia authoring systems. 1.1.1 Social Aspects Recognizing the importance of looking at personal media as a cornerstone for sharing family experiences [72], the intention of our research is to understand the intersection of social multimedia and social interactions in an asynchronous communication context [37]. We are interested in personal media, and how these can become memory artifacts: the content around which conversation happens. We aim to help small groups of people (such as a family, a school class or a sporting club) viewing, creating, and sharing personalized multimedia. From the technical perspective, a system should combine the benefits of personal focus – knowing whom you are talking with – within the context of temporally asynchronous and spatially separated social meeting spaces. Sociological theories [24] and user-centered approaches [3][25] have tackled different aspects of the multimedia workflow. For instance, human-centered efforts Research Questions 7 explore video-mediated communication to share watching videos together over the distance [4]. Similar to us, other studies investigate what people do with media in an asynchronous context, balancing the preponderance of techno-centric work with appropriate user-centric insight [19]. In our work we pay special attention to social theories and human-centered methodologies. Our work, which is predicated on the intrinsic desire to strengthen existing strong ties among people, tackles different aspects of the socially-aware multimedia workflow. That said, the following research question arises: Question 1.1 Can a socially-aware multimedia authoring system be defined in terms of existing social science theories and human-centered processes, and if so, which? 1.1.2 Capture and Processing The research question introduced above puts the accent on the social aspects around personal media. While knowledge from online social networks could be mined to determine the strength of ties among people [24], user interest and sentiment analysis also could be used to facilitate media annotation and content Figure 1.4. Multimedia authoring workflow and application services. 8 Chapter 1. Introduction selection [3]. Given the characteristics of end-user content, in this thesis we have chosen to focus on studying the implicit social practices during video capture within groups of people with strong ties. Our assumption, which could be compared to research in the domain of user modeling [29], is that users’ recording behavior can provide useful insights to better understand the interpersonal relationships. The attempt to ‘understand’ end-users and their social practices is just the first step in the socially-aware multimedia authoring workflow. More important are the indications that this new paradigm brings an improvement over the state of the art in multimedia authoring. In this direction, the research question we have is the following: Question 1.2 Does the functionality provided by a socially-aware multimedia authoring system provide an identifiable improvement over traditional authoring and sharing solutions? If so, how can these improvements be validated? 1.1.3 Access and Navigation Media selection in socially-aware multimedia is not a case of ‘finding as many essentially equivalent videos of an event as possible’, but ‘finding the relatively few videos within that event that are relevant to me now, and structure them into a story based on my context (and that of the people in the video)’. A key aspect in this process is to support interactive content selection. While in terms of user experience, user interfaces are important; even more is the underlying interaction design and recommendation mechanisms. In particular, we are interested in technological solutions that can help users accessing and navigating media content with which they have social affinity. Our work acknowledges previous research efforts in video abstraction/summarization [7], content recommendation [40], synchronization and organization of user-generated content from popular music events [44] and home video management and navigation [28]. However, we go a step further by integrating knowledge of the social relationships to improve content searching and selection by individual users of a shared media repository. With this in mind, we ask the research question: Question 1.3 Does a socially-aware video exploration system provide an identifiable improvement over current approaches for accessing and navigating a repository of shared media? Research Questions 9 1.1.4 Creation and Production Current media authoring is predicated on the notion that content creation is a onetime event. In socially-aware multimedia, content authoring becomes an incremental process of content refinement, sharing and repurposing. ‘Old’ assets remain living entities. This will foster a new generation of create-view-refine-share authoring systems. A key element of this approach is that media gets integrated into some larger narrative story, rather than that the media object is the story itself. A number of research efforts have addressed this problem by focusing on community video remix [14], automatic generation of video mashups from YouTube content [59], social creation of photo albums [58] and configurable and interactive storytelling [49][52]. The main difference of our work lays on the fact that we do not aim at providing a complete description of an event based on the characteristics of individual media fragments, but personalized video stories (narratives) based on the social bonds between people. Convenience and personal effort are also important factors to consider when generating such narratives. In this context, the research question we have is: Question 1.4 Where is the balance between automatic and manual processes when authoring personalized narratives users care about? 1.1.5 Content Enrichment One of the foundations of socially-aware multimedia is that media can take on new meaning based on the insights of downstream viewers. As an example, consider end-user generated comments. They have the potential to enrich and transform the media viewing experience by allowing users to express themselves and interact with others. Currently, media commenting is supported on an overly coarse level. Still, the lack of a embedded ‘media message’ in most personal media content actually presents the viewer of such media with a golden opportunity to superimpose his/her own meaning on top (physically or logically) of the content provided by the media object. We believe that a socially-aware system should enable content enrichment beyond ‘likes’ and out-of-band text comments. The analysis of user-generated comments around media has resulted in innovative work on the semantic and temporal structure of media events [13], user commenting patterns in video on demand [25] and in live video streaming platforms [34]. Not to forget studies on the aggregated behavior of people and 10 Chapter 1. Introduction social media. Examples include motivations behind tagging in Flickr [48] and location-aware photo sharing systems [18]. This thesis acknowledges all these efforts for better understanding users and media. But instead of focusing on the aggregation of user interactions around media, we investigate solutions that allow any user – not necessarily the author – to incrementally add personalized comments within multimedia artifacts. By personalized we mean comments that could be used to highlight interesting things for other viewers, e.g., to make a point about a particular event within a video. This discussion leads to the following question: Question 1.5 1.2 Does the support for timed end-user commenting within preauthored narratives provide an identifiable improvement over current media commenting approaches? Our Aim This thesis investigates mechanisms and principles for togetherness and social connectivity around personal media. The main contribution lays on a new paradigm, socially-aware multimedia authoring, which empowers users in telling stories and commenting on personal media artifacts. The work has been evaluated through prototype tools that allow users to explore, create, enrich and share rich multimedia artifacts. Results from our evaluation process provide useful insights into how a socially-aware multimedia authoring and sharing system should be designed and architected, for helping users in recalling personal memories and in nurturing their close circle relationships. Our experimental methodology aims at meeting the requirements needed for social communities that are not addressed by traditional authoring and sharing applications. During this process the intention was not to focus on a specific piece of software, but to take a broader look at the process and its implications. The final goal is to reformulate the research problem of multimedia authoring by emphasizing the importance of the social relationships among casual media authors, featured subjects and recipients of the media. 1.3 Thesis Outline and Summary of the Contributions We summarize below the content and main contributions of each chapter. Thesis Outline and Summary of the Contributions 11 Chapter 2 sets the stage by presenting a community video use case in which the social relationships between the people involved plays an essential role. Then, we detail our user-centered methodology, which involved requirements gathering, concert recordings, iterative prototyping and user evaluation. Motivated by social theories, preliminary interviews/focus groups and a survey research about social practices around personal videos, we identify key requirements and specify guidelines for realizing socially-aware multimedia authoring systems. Finally, we report on a long-term evaluation process that validates our approach and shows that socially-aware multimedia authoring is a valid alternative for social interactions when apart. The contributions of this chapter, which directly respond research Question 1.1 and research Question 1.2, include: • Introduction of a community video use case and motivation of sociallyaware multimedia authoring; • Description of the user-centered methodology followed in this thesis; • Identification of requirements and specification of general guidelines for realizing socially-aware multimedia authoring systems; and • Discussion about a 4-year evaluation process that includes the validation of the proposed socially-aware multimedia authoring framework. This chapter is based on the following papers: R.L. Guimarães, P. Cesar, D.C.A. Bulterman, V. Zsombori, and I. Kegel. 2011. Creating personalized memories from social events: community-based support for multi-camera recordings of school concerts. In Proceedings of the 19th ACM international conference on Multimedia (MM ‘11). ACM, New York, NY, USA,303-312.DOI=10.1145/2072298.2072339 http://doi.acm.org/10.1145/2072298.2072339. (17% acceptance rate) R.L. Guimarães, P. Cesar, D.C.A. Bulterman, I. Kegel, and P. Ljungstrand. 2011. Social Practices around Personal Videos using the Web. In Proceedings of the ACM Web Science Conference (WebSci ‘11). Available at http://journal.webscience.org/437/ (15% acceptance rate) Chapter 3 considers the development of innovative mechanisms to enable users to browse and navigate a repository of shared media. Context-aware user interfaces and filtering mechanisms are proposed by taking into account 12 Chapter 1. Introduction relationships between users of the system and subjects featured in the videos. This chapter also discusses the importance of semantic annotations to describe personal media. Our approach is then compared to traditional (and less individual) media exploration tools. The contributions of this chapter, which directly address research Question 1.3, can be summarized as follows: • Design and evaluation of an initial interface to facilitate the personalized exploration of a repository of shared media; • Design and implementation of a new browsing interface based on the requirements elicited in the initial evaluation process; and • User evaluation of the new interface, demonstrating that, when compared to traditional approaches, we have improved the ability to explore videos users care about, among a pool containing parent-made recordings. This chapter is based on the following paper: D.C. Pedrosa, R.L. Guimarães, P. Cesar and D.C.A. Bulterman. 2013. Designing Socially-Aware Video Exploration: A Case Study Using School Concert Assets. In Proceedings of the 17th International Academic MindTrek Conference: Making Sense of Converging Media (MindTrek ‘13). Chapter 4 compares automatic approaches for generating video stories (or media artifacts) from user-generated content with more manual mechanisms to reflect personal effort and intimacy. Our findings, which directly relates to research Question 1.4, indicate that the balanced combination of manual and automatic processes will be the basis for authoring tools that better fit end-users’ needs. The contributions of this chapter are summarized below: • Two-phased design, implementation and user evaluation of an authoring system to create personalized video stories from community assets; and • Discussion about the benefits of a compromise between automatic and manual processes when creating personalized video artifacts. This chapter contains extracts from the following papers: Thesis Outline and Summary of the Contributions 13 R.L. Guimarães, P. Cesar and D.C.A. Bulterman. 2013. Personalized Presentations from Community Assets. In Proceedings of the 19th Brazilian Symposium on Multimedia and the Web (WebMedia ‘13). ACM, New York, NY, USA, 257-264. DOI=10.1145/2526188.2526208 http://doi.acm.org/10.1145/2526188.2526208 (33% acceptance rate) [Won, best multimedia paper] R.L. Guimarães. Automatic and manual processes in end-user multimedia authoring tools: where is the balance?. In Proceedings of the international conference on Multimedia (MM ‘10). ACM, New York, NY, USA, 1699-1700. DOI=10.1145/1873951.1874327 http://doi.acm.org/10.1145/1873951.1874327 V. Zsombori, M. Frantzis, R.L. Guimarães, M.F. Ursu, P. Cesar, I. Kegel, R. Craigie, and D.C.A. Bulterman. 2011. Automatic generation of video narratives from shared UGC. In Proceedings of the 22nd ACM conference on Hypertext and hypermedia (HT ‘11). ACM, New York, NY, USA, 325-334. DOI=10.1145/1995966.1996009 http://doi.acm.org/10.1145/1995966.1996009. (34% acceptance rate) [Nominated, best paper/best newcomer] Chapter 5 presents mechanisms to support end-user commenting and enrichment of pre-authored video stories. This approach is used as a way to communicate the viewer’s personal view by highlighting a particular event that might be interesting to his/her social circle. This chapter, which directly responds research Question 1.5, brings the following contributions: • Motivation based on a survey research about media consumption and commenting habits of a group of Internet users; • Specification and description of temporal document transformations that allow end-users to create and share personalized timed text comments within third-party online videos; • Design and implementation of a video commenting tool that realizes such document transformations; and • User evaluation showing that users appreciated the functionalities of our system and would use it to communicate. This chapter is based on the following papers: 14 Chapter 1. Introduction R.L. Guimarães, P. Cesar, and D.C.A. Bulterman. 2012. “Let me comment on your video”: supporting personalized end-user comments within third-party online videos. In Proceedings of the 18th Brazilian Symposium on Multimedia and the Web (WebMedia ‘12). ACM, New York, NY, USA, 253-260. DOI=10.1145/2382636.2382690 http://doi.acm.org/10.1145/2382636.2382690 (30% acceptance rate) R.L. Guimarães, P. Cesar, and D.C.A. Bulterman. 2010. Creating and sharing personalized time-based annotations of videos on the web. In Proceedings of the 10th ACM symposium on Document engineering (DocEng ‘10). ACM, New York, NY, USA, 27-36. DOI=10.1145/1860559.1860567 http://doi.acm.org/10.1145/1860559.1860567 (31% acceptance rate) Chapter 6 dedicates to open-ended questions and concluding remarks. This chapter contains extracts from the following article: D.C.A. Bulterman, P. Cesar and R.L. Guimarães. 2013. Socially-Aware Multimedia Authoring: Past, Present and Future. ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), Volume 9, Issue 1s, Article 35 (October 2013), 23 pages. DOI=10.1145/2491893 http://doi.acm.org/10.1145/2491893 1.4 Related Work We frame this work within a historical perspective on three areas: conventional authoring systems, interactive storytelling and video mashups/content repurposing. 1.4.1 Conventional Authoring Systems At the 1993 ACM Multimedia conference, Hardman et al. [43] presented a paper on structured multimedia authoring. Just over a decade later, this study was revised for the initial issue for ACM TOMCCAP [16]. At that time, multimedia authoring was seen by many as a seminal topic within the research community. As described in these publications, several paradigms existed for compositing (or binding) media objects, including: Related Work 15 • Structure-based composition: composition where the (often hierarchical) logical structure of the components server as the basis for generating a particular presentation instance timeline; • Timeline-based composition: composition in which a particular presentation instance determines the content relationships among objects; • Graph-based composition: composition in which the relationships among objects have cause/effect relationships, but limited logical structure; and • Script-based composition: composition where the inherent logical structure of elements is hidden as side effects of a procedural execution model. All of these methods (of which structure-based remains the most compelling) are examples of relatively formal models in the sense that there is a need for an explicit authoring activity to take place in creating a presentation. This explicit activity was intended to manage the inherent complexity of selecting, editing, combining and positioning media in temporal and spatial dimensions. In many ways, the process was similar to early text processing systems, in which formatting codes and layout directives needed to be directly and overtly inserted into a content stream. In general, formal authoring systems are based on an implicit model in which an editor is assumed to understand the basic aspects of content production. These include understanding: a) The content alternatives available; b) The interests (and attention spans) of the intended audience; and c) The formal or informal narrative and cinematographic principles required to build a compelling story. While significant steps have been made in better understanding the encoding of narrative structures [32], the management of content and the management of viewer-driven interests provide fruitful areas for new work. We argue that there are two primary reasons that personal content viewers are unresponsive to nonprofessional content. The first reason is that the opportunity to home-editors represented by b) is largely unexploited by formal authoring systems. In many professional editing situations, all three of these aspects have been well understood, albeit for b) at an aggregate level of detail. For more personal content, home editors would seem to have a tremendous advantage: they typically know the person or persons for whom a particular content object is being created. Sometimes the 16 Chapter 1. Introduction intended audience is relatively diffuse (such as one’s 1,000 closest Facebook friends), but other times it can be highly focused: the grandmother of a young high school musician. The second reason that personal content viewers are unresponsive to non-professional content is that conventional formal authoring systems maintain a push model of content rather than a pull model, in which a content viewer in intimately involved in the process of content selection and personalization. This means that the author/editor determines all of the choices, with little infrastructure support for end-user personalization at the detail level. 1.4.2 Interactive Storytelling During the past decade, various Artificial Intelligence (AI) approaches have been suggested for the creation of configurable and interactive storytelling [49][54]. A main thread of investigation has so far focused on generated content, often involving intelligent animated characters (e.g., Ibanez et al. [35]). Not to forget the use of interactive video as a basis for scenario-driven interactive tours, with additional mini-games for elaborating on specific topics or tasks that arise during exploration process [2]. Another representative example is Vox Populi [68], in which rhetorical documentaries are created from a pool of media fragments, and the Narrative Structure Language (NSL), a production-independent framework for the authoring and delivery of configurable and interactive video narratives [52]. More recently, a system capable of creating different story variants from a baseline video was presented [5]. In general, these systems generate sequencing video shots, while maintaining local video consistency. In order to support the automated generation of the interactive story, extensive use of metadata annotations on individual media objects is made. These systems have been applied to professionally produced media content, using well-defined (and generic) content and story descriptions. Our view on socially-aware multimedia authoring differs from typical interactive storytelling approaches in two important ways. First, the community content that we consider is not professionally produced and annotated. While we provide a reasonable degree of person and object recognition, the poor lighting and overall moderate quality of the content often requires user intervention to classify and locate content fragments. A second difference is that, although we focus on storytelling, we explore this concept in the context of repositories of UGC (User-Generated Content). There is still a structured representation of an overall interactive story space, but there is no control over the way the content is captured. The content Related Work 17 structures that can be made and exploited are only those emerging from the structure of the covered event itself. 1.4.3 Community Video Mashups and Content Repurposing A second thread of more general story development is represented by work on video mashups and content repurposing. In this respect, it is interesting to note the current shift from local-based home videos management systems [28][65] to global-based video sharing Internet services. Recent works [39][44] describe frameworks to synchronize and organize user-contributed content from live music events, creating an improved representation of the event that builds on the automatic content match. Shrestha et al. [59] report on an application for creating mashup videos from YouTube recordings of concerts. They present a number of content management mechanisms (e.g., temporal alignment and content quality assessment) that then are used for creating a multi-camera mashup. Saini et al. [53] go a step further by incorporating history-based diversity measurement, state-based video editing rules, and view quality in automated video mashup generations. Naci and Hanjalic [71] report on a video interaction environment for browsing records in music concerts, in which the underlying automatic analyzer extracts the instrumental solos and applause sections in the concert videos, and also the level of excitement along the performances. Lately, crowdsourcing has been proven to be a good ally for content analysis. For example, fans of a band can be useful for improving content retrieval mechanisms, where a video search engine allows for user-provided feedback to improve, extend, and share, automatically detected results in concepts from video footage recorded during a rock n’ roll festival [11]. Our work builds on previous findings in event modeling [74] and identification [30][31], and video abstraction/summarization [7]. The main difference lays on the fact that we do not aim at providing a complete description of the shared event, but a better understanding of how community media can serve individual needs. Other interesting works propose a community video remixing tool [14], a video repurposing tool [66] and a video enrichment system that enable reciprocity [56]. In this direction, we should mention current practices around news stories, where users can reuse fragments of video clips for expressing opinions [46]. When compared with all these approaches, socially-aware multimedia authoring intends to help end-users generate stories in which social bonds between people play a major role. The previous approaches did not take into consideration 18 Chapter 1. Introduction the case in which video authors and the people depicted in the videos are closely related. Similar to us, recent work has proposed a media sharing application that takes into account the interpersonal ties. This tool is capable of producing audiovisual media shows based on events, people, locations, and time [75]. In comparison to our work, this application does not allow for the creation of a narrative-based story based on multi-camera community recordings. 2 Personalized Memories of Social Events: Studying Asynchronous Togetherness1 The place: the Exhibition Hall in Prague. The date: August 23, 2009. Radiohead is about to start their concert. The band invites fans to capture personal videos, distributing 50 Flip cameras. After the concert the cameras are then collected, and the videos are post-processed along with Radiohead’s audio masters. The resulting DVD2 captures the concert from the viewpoint of the fans, making it more immersive and proximal than typical concert productions. The concert of Radiohead typifies a shift in the way music concerts – and other social events – are being captured, edited, and remembered. In the past, professionals created a full-featured video, often structured according to a generic and anonymous narrative. Today, advances in non-professional devices are making each attendee a potential cameraperson who can easily upload personalized 1 This chapter is based on the following papers: R.L. Guimarães, P. Cesar, D.C.A. Bulterman, V. Zsombori, and I. Kegel. 2011. Creating personalized memories from social events: community-based support for multi-camera recordings of school concerts. In Proceedings of the 19th ACM international conference on Multimedia (MM ‘11). ACM, New York, NY, USA,303-312.DOI=10.1145/2072298.2072339 http://doi.acm.org/10.1145/2072298.2072339. (17% acceptance rate) R.L. Guimarães, P. Cesar, D.C.A. Bulterman, I. Kegel, and P. Ljungstrand. 2011. Social Practices around Personal Videos using the Web. In Proceedings of the ACM Web Science Conference (WebSci ‘11). Available at http://journal.webscience.org/437/ (15% acceptance rate) 2 Available at http://radiohead-prague.nataly.fr. Last access on May 15th 2013. 19 20 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness material to the Web, mostly as collections of raw-cut or semi-edited fragments. From the multimedia research perspective, this shift makes us reflect and reconsider traditional models for content analysis, authoring, and sharing. This thesis considers the case in which performers and the audience belong to the same social circle (e.g., parents, siblings and classmates at a typical school concert). Each participating member of the audience records content for personal use, but they also capture content of potential group interest. This content may be interesting to the group for several reasons: it may break the monotony of a single camera viewpoint, it may provide alternative (and better) content for events of interest during the concert (solos, introductions, bloopers), or it may provide additional views of events that were not captured by a person’s own camera. It is important to understand that the decision to use substitute or additional content will be made in the particular context of each user separately: the father of the trombone player is not necessarily interested in the content made by the mother of the bass player unless that content is directly relevant for the father’s needs. Put another way, by integrating knowledge of the structure of the social relationships within the group, content classification can be improved and content searching and selection by individual users can be made more effective. In order to understand the role of the social network among group members in a multi-camera setting, consider the comparison presented in Table 2.1. This table compares the use of multi-camera content in three situations: by a (professional) video crew creating an archival production, by a collection of anonymous users contributing to a conventional user-generated content mashup, and finally within a defined social circle as input for differentiated personal videos. (Semi-) Professional DVD-style productions often follow a well-defined narrative model implemented by a human director, and are created to capture the essence of the event. Anonymous user-generated content mashups are created from ad-hoc content collections, often based on the content classification methods [44][59]. In socially-aware communities, friends and family members capture, edit and share videos of small-scale social events with the main purpose of creating personal (and not group) memories3. In particular, this chapter considers the following two research questions in the context of a multimedia authoring system from community assets: 3 Interested readers can find a video picturing the general concept of personalized community videos at http://www.youtube.com/user/TA2Project#p/u/6/re-uEyHszgM. And an example of personal video at http://www.youtube.com/user/TA2Project#p/u/4/Ho1p_zcipyA. Last access on May 15th 2013. 21 Methodology Preparation, Capturing Relationship between Performers and People Recording Intelligent Processes Purpose Table 2.1. Handling Multi-Camera Recordings of Concerts. Professional DVD Scripted Professional Human director (planning) Complete coverage Anonymous UGC Mashup Ad-Hoc Similar likings, idols Video search, video analysis Complete coverage Socially-Aware Community Ad-Hoc Family & friends Video analysis, video authoring Memories, bonds Question 1.1 Can a socially-aware multimedia authoring system be defined in terms of existing social science theories and human-centered processes, and if so, which? Question 1.2 Does the functionality provided by a socially-aware multimedia authoring system provide an identifiable improvement over traditional authoring and sharing solutions? If so, how can these improvements be validated? Our work focuses parents, family members and friends of students participating in a high school concert. In this scenario, parents capture recordings of their children for later viewing and possible sharing with friends and relatives. Working with a test group at local high schools in two different countries (UK and the Netherlands), we investigate how focused content can be extracted from a shared repository, and how content can be enhanced and tailored to form the basis of a personalized multimedia artifact, that can be eventually transferred and shared with family and friends (each with different degrees of connectedness and tie strength with the performer and his/her parents). Results from a four-year 22 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness evaluation process provide useful insights into how a socially-aware multimedia authoring and sharing system should be designed and architected, for helping users in recalling personal memories and in nurturing their close circle relationships. The remaining of this chapter is structured as follows. Section 2.1 discusses the user-centered methodology followed in this thesis, in which both technology and social issues were addressed. Then, motivated by social theories and interviews/focus groups with potential users, Section 2.2 identifies key requirements for socially-aware multimedia authoring and sharing systems. This section addresses the first research question, by providing guidelines to realize systems that meet those requirements. Section 2.3 reports on results and findings regarding the utility and usefulness of the proposed framework, thus directly responding the second research question. Lastly, Section 2.4 concludes the chapter. 2.1 Methodology This thesis is part of an extended study to better understand the role that multimedia authoring tools can play in improving social communications between friends and families living apart. In particular, we are interested in understanding how individual users can personalize the use of community assets to make unique video stories that can be shared within a closed social circle (see Figure 2.1). This work has been realized in the context of the pan-European project Together Anywhere, Together Anytime4 (TA2). The goal of this project was to understand how technology can improve relationships between groups of people separated in space and time. We focused on an asynchronous authoring and sharing framework in which highly personalized music videos are constructed from a collection of independent parent-made recordings. For that, a system called MyVideos was developed as a collection of configurable processes, each of which allowed us to study one or more aspects of the development of socially-aware multimedia authoring systems. We have been actively investigating this problem for several years. The methodology reported in this section (and complemented in the next chapters) integrates knowledge from human factors (e.g., focus groups/interviews for need assessment, iterative prototyping and user evaluation) and document engineering. Potential users have been involved in the design and evaluation process since the 4 http://www.ta2-project.eu/ Methodology 23 Figure 2.1. Overview of the requirements and validation parameters for socially-aware multimedia authoring systems. beginning of the project, starting with interviews and focus groups, leading up to the evaluation of a two-phased prototype system. A set of parents from local high schools has actively collaborated with this research. Starting in December 2009, the parents were invited to a focus group that took place in Amsterdam; in April 2010 they recorded (together with some researchers) a concert of their children. From Jul-Sep 2010, these parents used our 24 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness prototype application with the video material recorded in that concert. Based on the feedback and results, the software was re-designed in a second phase. This second time, we involved a high school in Woodbridge (UK), where a concert was recorded in November 2011. Subsequently, the parents that participated in that concert evaluated our second prototype implementation. During these years, we have systematically investigated mechanisms for helping users explore assets from a community collection of videos and to automatically generate ‘stories’ from these assets based on a narrative model. 2.1.1 Content Recording and Preparation MyVideos has been tested and evaluated using data recorded in 4 different concerts as summarized in Table 2.2: a school rehearsal in Woodbridge5 in the UK, a jazz concert by an Amsterdam local band called the Jazz Warriors6, a school concert at the St. Ignatius Gymnasium7, and finally another school concert in Woodbridge. In December 2008 in the Woodbridge School concert (UK), a total of five cameras were used to capture the rehearsal. The master camera was placed in a fixed location, front and center to the stage, set to capture the entire scene (a ‘wide’ shot), with no camera movement and an external stereo microphone in a static location physically near to the rehearsal performance. In the end of November 2009, a jazz concert was recorded as part of an asset collection process for the MyVideos phase 1. The goal of the capture session was to gain experience with a user setup that would be similar to that expected for the first trial. The concert took place on November 27th, 2009 at the Kompaszaal8, a public restaurant and performance location in Amsterdam. The Jazz Warriors is a traditional big band with approximately 20 members. In total 8 cameras were used to capture the concert, where two cameras were considered as ‘masters’ and were placed at fixed locations at stage left and stage right. In total, about 220 video clips and approximately 80 images were collected at the event. The longest video clip was 50 minutes, the shortest 5 seconds. These first two concerts were primarily experimental. They were very useful for testing the automatic processes for analyzing and annotating video clips: a 5 http://www.woodbridge.suffolk.sch.uk http://jazzwarriors.nl 7 http://www.ig.nl 8 http://www.kompaszaal.nl 6 25 Methodology Date Event Duration (approx.) Musicians Cameras (incl. master) Videos Recorded Media Objects Table 2.2. Data gathering events. Woodbridge School (UK) Dec/2008 50min 25 5 (1) 100 100 Jazz Warriors (NL) Nov/2009 50min 20 8 (2) 220 220 St. Ignatius Gymnasium (NL) Apr/2010 1h35min 20 12 (2) 197 197 Woodbridge School (UK) 12 (1) 331 668 Concert Nov/2011 1h20min 18 temporal alignment algorithm and a Semantic Video Annotation Suite. The temporal alignment tool is used to align all of the individual video clips to a common time base. The core of the temporal alignment algorithm is based on perceptual time-frequency analysis with a precision of 10ms. Figure 2.2 sketches the temporal alignment of a recorded dataset (more information on the datasets will be provided below). The level of accuracy of our tool is of around 99%, improving state-of-the-art solutions [44][59]. Since the focus of this thesis is not on content analysis, we will not further detail this part of the system. The interested reader can find the algorithm and its evaluation elsewhere [20]. The Semantic Video Annotation Suite [64] provides basic analysis functions, similar to the ones reported in [59]. The tool is capable of automatically detecting potential shot boundaries, of fragmenting the clips into coherent units, and of annotating the resulting video sub-clips. In the next sections, we discuss the media gathering and annotation processes that preceded the user evaluations of MyVideos phase 1 and phase 2 prototype implementations. 26 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness Figure 2.2. Temporal alignment of a real life data set from a concert, where a community of users recorded video clips. 2.1.1.1 Data Gathering for Phase 1 On April 16th, 2010 the concert from the Big Band – school band of the St. Ignatius Gymnasium – was recorded. In this case a core group of parents took part in the recordings and provided the research team with all the material. In total around 197 media objects were collected for a concert lasting about 1 hour and 35 minutes. Twelve (12) cameras were used; two of them used as the master cameras. Once the footage was captured, the process to tag people, instruments and songs was realized in two stages. The first one was carried out manually. This task was performed looking through the videos and marking a line in a spreadsheet for each event (effectively it was almost always multiple lines to account for the multiple people/instruments). There were 7 kits in this process; each kit included 10 video files, ranging in length from about 5 seconds to 5 minutes. The quickest person took about 1 hour to complete while the longest kit took about 6 hours. The total time spent annotating ‘manual’ kits was approximately 16 hours. Later, a second approach was implemented by using a pre-populated data spreadsheet and an annotation sheet that used drop-down boxes taking data from the datasheet. This approach was more effective and the total time spent annotating 8 kits was approximately 12 hours. Yet computing the time spent to annotate the master track a rough approximation of total time spent annotating the concert was of about 40 hours. After the annotation phase, the initial prototype was ready to be evaluated. Methodology 27 2.1.1.2 Data Gathering for Phase 2 For the evaluation of the second prototype implementation, new recordings took place again in the Woodbridge high school (UK) in November 2011. The concert lasted around 1 hour and 20 minutes, in which 18 students performed in 14 songs. A total of twelve cameras were used to capture the concert. The master camera was placed in a fixed location, front and sideway to the stage. Eight cameras were distributed among parents, relatives, and friends of performers. Members of the research team used the other 3 cameras. In total about 331 raw video clips were captured, some of which were recorded before or after the event. For this dataset, a hired group of people manually sub-clipped and annotated songs and performers. The total amount of time spent examining, sub-clipping and preparing the footage was around 156 hours. This includes a number of tasks apart from annotating clips, such as importing and transcoding all the videos to the same format, sub-clipping the footage, assigning annotations, transferring the annotations to machine readable CSV (Comma-Separated Values) files via OCR (Optical Character Recognition) and error checking. The outcome of this process was the creation of 668 sub-clips – or media objects out of the 331 original videos (see Table 2.2) – used in the evaluation of MyVideos phase 2. 2.1.2 MyVideos Implementation The MyVideos application has been implemented as a Web-based application, targeting users with little technical background. From the user viewpoint this means that they only need access to the public Internet and everything runs within a JavaScript-enabled Web browser on their device. The server components are hosted on a dedicated testbed with a high bandwidth symmetrical Internet connection and virtualized processor clusters dedicated to hosting Web applications and serving video. In our architecture, each school would rent space and functionality on the testbed, in order to make systems like ours available to their community. The server-side of our system includes a Mongrel Web application server (implemented in Ruby and Rails), a narrative engine (implemented in Java) that creates personalized narratives, a MySQL database that stores all the relational data concerning the media assets, and a media server that stores the recorded video clips and delivers them through HTTP (Hypertext Transfer Protocol) video streaming. The communication between the Web application and the narrative engine uses 28 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness JavaScript Object Notation (JSON). Only the application server and the video server are directly accessible through the Internet, while the remaining components are hidden to the outside world. The client side only requires a Web browser and the Ambulant Player9, for playing the video compilations in SMIL (Synchronized Multimedia Integration Language) [17]. The application on the client’s devices was implemented using JavaScript and AJAX (Asynchronous JavaScript and XML). Additional JavaScript libraries have been used for simplifying the development of the client-side software. In particular, YUI 2 and jQuery have been useful for event handling and AJAX interactions. For playback of individual video clips, two different solutions have been used. When supported by the browser, HTML5 video elements have been used (e.g., for an iPad implementation). Otherwise, we used an embedded Flash player (JW player). 2.1.3 Participants The number of participants in both phases was kept small so that we could establish directed and long-term relationships. The qualitative nature of our interactions provided us with a deep understanding of the ways in which people currently share experiences to foster strong ties. The participants involved in both phases represent a realistic sample for the intended use case: parents, relatives, and/or friends of the kids going to the same high school; all of them tend to record the kids; some of them have some experience with multimedia editing tools. We believe that this sample of users provides us a relevant picture of the ways people currently record videos of other people they care about, and how they use such footage to share experiences within their (probably restricted) social group. Since our main focus is to better understand small groups of people with strong interpersonal ties, the evaluation of MyVideos was realized with a fixed selection of users. It would have been impossible to do crowdsource testing, since we wanted to explore the fact that people had a social connection with the recorded footage. This section describes the subjects and methodology applied in each evaluation phase. 9 http://www.ambulantplayer.org Methodology 29 Figure 2.3. The makeup, age and gender of participants in phase 1 evaluation. 2.1.3.1 Phase 1 Setting As illustrated in Figure 2.3, 7 people, among relatives and friends of the performers that attended the school concert in Amsterdam, were recruited. The rationale used for selecting the participants was diversity. We wanted to gather as many roles as possible for better understanding the social needs of our potential users. The participants were three high school students, a social scientist, a software engineer, an art designer and a visual artist, resulting in a variety of needs that may influence the video capturing, editing and sharing behaviors. All participants were Dutch. The average age of the participants was 37.1 years (SD = 20.6 years); 3 participants (42.8%) were female. Among the participants, 3 had children (ranging from 14 to 17 years old). All participants were currently living in the Netherlands, but the uncle of a performer that lived in the US. He was recruited to serve as an external participant (the only one that was not present in the concert). The prototype evaluation was conducted over a two-month span in the summer of 2010 (Jul-Sep). More interested in subjective results that in statistical data, our approach was largely exploratory and interactive. The evaluation process consisted of 2 sessions. The initial one was used to collect background information about video recording habits, e.g., participants’ intentions and the social relations around media. We also 30 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness Figure 2.4. The makeup, age and gender of participants in phase 2 evaluation. used this session as an opportunity to understand how participants conceptualized the concert. The second (in-depth) session was dedicated to capture video editing practices and media sharing routines of the participants, based on their interactions with the system. We used the footage they had recorded during the high school concert in the spring of 2010 to evaluate our initial prototype system. Both sessions were started with an ice-breaking activity on the whiteboard, followed by discussions around the research questions. 2.1.3.2 Phase 2 Setting Thirteen (13) people (from 6 families) participated in the evaluation of our second prototype implementation. Participants consisted of performers, parents and other relatives of the teenagers that performed in the Woodbridge school concert, as illustrated in Figure 2.4. All participants were English speakers and were currently living in the UK. Seven of them (~54%) were 40+ years old; the other 6 people were in the 11-20-age range, 4 of which performed in the concert. Six (6) Generic Architecture for Socially-Aware Authoring Systems 31 participants were female. Participants kindly volunteered themselves for their participation, and the experiments were conducted over a two-month span in the beginning of 2012 (Jan-Feb). We used a semi-structured approach for data collection. We started the individual interviews by explaining the high-level goals of our system and by asking participants about their video recording and sharing practices. Then, the participants were instructed to interact with the prototype system and to answer the evaluation questionnaires. Nine (9) out of the 13 participants committed to fill in the questionnaires discussed in Section 2.3. 2.2 Generic Architecture for Socially-Aware Authoring Systems The motivation of our work is rooted in the inherent necessity of people for socializing and for nurturing relationships. As discussed in the previous section, we followed an interdisciplinary approach in which both technology and social issues were addressed. At the core of this approach was the establishment of a long-term relationship with a group of parents within local high schools (in the UK and in the Netherlands) as a basis for gathering requirements, evaluating prototype implementations and validating the socially-aware authoring concept proposed in this thesis work. Motivated by social theories and focus groups/interviews with potential users, in this section we formalize the general guidelines for realizing sociallyaware multimedia authoring and sharing systems. In Section 2.3 and in the next chapters, we discuss the evaluation of MyVideos, a system that realizes and validates such guidelines. The design and architecture of our socially-aware multimedia authoring framework are direct results from the long-term process reported in this thesis. 2.2.1 Social Science Principles The experimental methodology presented in this thesis is based on two social science theories: social connectedness and strength of the interpersonal ties. Social connectedness theory helps us to understand how social bonds are developed over time, and how existing relationships are maintained. Social connectedness happens when one person is aware of another person, and confirms his/her awareness [67]. Such awareness between people can be natural and intense 32 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness Figure 2.5. A schematic view of the perceived strength of social bond over time in relation to our scenario. or lightweight. As reported elsewhere, even photos [72] and sounds [42] can be good vehicles for creating the feeling of connectedness. Figure 2.5 illustrates a schematic view of the perceived strength of a social bond over time, showing reoccurring shared events (‘interaction rituals’ in the Durkheim sense [23]), with a fading strength of the social bond in between. The peaks in the figure correspond to intense and natural shared moments, when people participate in a joint activity (e.g., a music concert) re-affirming their relationships and extending their common pool of shared memories. The smaller peaks correspond to social connectedness actions, such as sending a text message or sharing a personalized video of the shared event, that build on such shared memories. If we were to follow the social connectedness theory, we would design a system that mediates the smaller peaks and thus helps in fostering relationships over time. Granovetter [55] defines interpersonal ties as: “… a combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie.” If we were to design a video sharing system intended for family and friends, we would exploit the social bonds between people by taking into account their personal relationships (intimacy). The system would provide mechanisms for Generic Architecture for Socially-Aware Authoring Systems 33 personalizing the resulting videos (adding personal intensity) with some effort (amount of time), and would allow the recipient to acknowledge and reply to the creator (reciprocity). 2.2.2 Family Interviews and Focus Groups In order to better understand the problem space, we involved a representative group of users at the beginning of the evaluation process. The first evaluation, in 2008, consisted of interviews with sixteen families across four countries (UK, Sweden, Netherlands, and Germany). The second evaluation, in-depth focus groups – with three parents each – was run in the summer of 2009 in the UK and in December 2009 in the Netherlands. As social connectedness theory suggests, many participants engaged in varied forms of media sharing. Participants felt that reliving memories and sharing experiences helped them (and other households) feeling closer. Parents e-mailed pictures of the kids playing football to the grandparents, shared holiday pictures via Picasa, or on disk, or using Facebook, enabling friends and families to stay in touch with each other’s lives. Nevertheless, the interviewed people said that if they shared media, they would do so via communication methods they perceived as private and then only to trusted contacts. There was a general reticence from the parents towards existing social networking sites. In the UK, the parents stressed that they would not share the videos with ‘the world’, but would share it with other family members for fun. For example, when asked about YouTube one parent said: “I haven’t... my wife’s side of the family... they’re always putting clips of video on YouTube and all these sorts of things... that makes me cringe a bit... I think… well, why would I want to do that? Do I think that’s interesting to anybody?” A number of parents reported photography as a hobby and would routinely edit their shared images. Their children, on the other hand, even if interested in photography, seemed less keen to manually edit pictures, and declared a strong preference for automatic edits or relied on their parents. The participants would then discuss the incidents relating to the pictures later on with friends and family, on the phone or at the next reunion. Home videos tended to be watched far less frequently, although the young pre-teen participants appreciated them and were 34 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness described by their parents as having “worn the tape[s] down” from constant viewing when much younger. Based on the interviews, we concluded that current social media approaches are not adequate for a family or a small social group for storing and sharing collections of media that is personal and important to them [63]. Much richer systems are needed and will become an essential part of life for family relationships. In general the participants’ responses converged to: • A willingness to engage in diversified forms of recollections through recorded videos; • A clear requirement for systems that could be trusted as ensuring privacy; • A positive reaction to the suggestion of automatic and intelligent mechanisms for managing home videos. In each case, creating personalized video stories (tailored for family use) remained a core issue. 2.2.3 Requirements Gathering Figure 2.6 a) shows the answers of the participants in Amsterdam to the questionnaires about video recording and editing practices during phase 1 evaluation. Most participants said they often record videos in social events (e.g., family gatherings, vacation trips and/or school concerts). However, validating previous studies [19], they rarely look at the recorded material afterwards. According to the participants, one problem is the relatively high number of media assets captured during an event – for instance, around 200 media assets from 12 cameras for a concert lasting 1h35min. Another problem is that the footage, as captured, cannot be easily explored. For most of them, video editing was considered time consuming and way too complicated. Therefore, they rarely edit their videos. Most users said that they had an editing suite at home. PC users were familiar with Windows Movie Maker, while Mac users with iMovie. Some participants described how they would create a movie about the high school concert using their preferred editing tool. They would choose some clips and drag them to the timeline. Then, they would use visual effects, transitions and sounds that are usually provided with the video editing software. In general, they indicated that they would tell the story of the concert using their personal videos. Some participants mentioned that video editing Generic Architecture for Socially-Aware Authoring Systems 35 a) Video recording and editing practices. b) Media sharing habits and social relations. Figure 2.6. Results of the questionnaires about social practices around personal videos (phase 1 evaluation). also could demand high processing power, which would slow down the computer. As a workaround, they occasionally (between sometimes and never) would perform minor editing operations (e.g., clipping) on their own video camera. Figure 2.6 b) presents the results of the questionnaires about media sharing habits and social relations around the media. Participants said they were used to watch videos on YouTube sometimes, and many of them used Facebook quite frequently (always). However, they were not used (between never or rarely) to tag videos and/or photos. When prompted whether and how they shared their videos, they repeatedly said that in general they rarely posted personal videos on the Web. While the youngest participants argued their personal videos were not interesting enough, for our older respondents privacy was the main concern not to share personal videos on the Web. 36 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness Figure 2.7. Social media habits (phase 2 evaluation). “It is personal… if I make a personal shot, a close-up of my daughter, for example, and I do this for personal reasons, I never do this for the others.” (Mother of a performer) Figure 2.7 shows some responses of the British participants to the background questions related to media capturing, editing, and sharing. Most subjects said they rarely record videos in social events (less frequently than the group in Amsterdam). Although, they declared to sometimes look at the videos they recorded after the event has taken place. Five (5) out 9 participants said they were unfamiliar with video editing tools, and therefore, they never edit their videos. The vast majority said they were quite concerned about sharing personal videos on the Internet, and they were not used to do so (6 participants said never, while 1 rarely). Based on these general user needs, social theories and initial interviews with focus groups, we defined a number of requirements for socially-aware multimedia authoring and sharing systems, as follows: i. Support social connectedness: it should provide tools and mechanisms for maintaining relationships over time. The goal is not so much on supporting high intensity moments – the event – but for the small peaks of awareness (recollection of the event); ii. Support privacy and control: most parents in the interviews and the focus groups expressed that current video sharing models do not fit the needs of family and friends due to privacy issues. Thus, new systems should address the parents’ concerns, and provide adequate privacy mechanisms; Generic Architecture for Socially-Aware Authoring Systems 37 iii. Support effortless interaction: people are reluctant to invest time in processes they consider that could be done automatically. Future systems should include automatic processes for analyzing, annotating, and creating videos of the shared event; and iv. Support personal effort, intimacy and reciprocity: while such automatic processes lower the burden for the user, they do not conform to existing social theories. Since we do not want to limit the joy of handcrafting videos for others, systems should offer optional manual interfaces for personalization purposes. We used these requirements as the basis for specifying the guidelines discussed in the next section. 2.2.4 Guidelines In order to support the social theories described in Section 2.2.1 and the requirements identified in Section 2.2.3, our socially-aware multimedia authoring framework considers a number of automatic, semi-automatic and manual processes that assist in the media exploration and creation of personal memories of an event. These processes balance convenience and personal effort when making targeted, personalized videos. Emotional intensity is provided by a recommendation algorithm that searches for people and moments that might bring memories to the user. For mediating intimacy, our framework proposes means to enrich videos for others by including highly personalized comments. With these features we intend to increase the feeling of connectedness, particularly among family members and friends who could not attend the social event. 2.2.4.1 Supporting Emotional Intensity An assumption leading to the design of our socially-aware framework was that in a community setting, users are particularly interested in looking for video clips in which people close to them are featured (social-based searches). Such assumption is validated in Chapter 3, which presents our efforts in designing and implementing an interface for browsing multi-camera recordings. The core of the navigation interface is a recommender algorithm that takes into account not only the filters selected by the user and the content quality assessment, but also the recording behavior of each user individually. This feature considers the semantic annotations 38 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness associated to the user’s media and on the subjects that more frequently appear on his/her recordings. For example, a father can make a request for his daughter playing ‘Cry Me River’, since he remembers this was an emotive moment of the concert. Given an example query: SelectedPersons = [Julia]; SelectedSong = [Cry Me River]. The result will be: QueryPersons(Julia) ∩ QueryEvents(Cry Me a River) The query algorithm works as follows: 1. Select fragments of the video clips matching the query; in case of complex queries, select intersecting sets; 2. If the result consists of one fragment, return it; 3. If the result consists of more than one fragment, order the resulting list based on the following criteria: • The requested person; • The video clips uploaded by the logged user; • The subjects that appear more frequently in the video clips uploaded by the logged user (affection parameter); • The content quality assessment (e.g., shot type, resolution, duration). In addition to the query interface that allows users to find moments that they particularly remember, a socially-aware multimedia authoring framework should offer optional manual interfaces for improving semantic annotations. When users are searching for specific memories, it might happen that results are not accurate due to errors in the annotations. Our approach considers that users could correct such annotations while previewing individual clips. For example, they can change/add/remove the name of the performer and the title of the song. Generic Architecture for Socially-Aware Authoring Systems 39 2.2.4.2 Reflecting Personal Effort One of the major differentiators of our work is that its primary purpose is not the creation of an appealing video summary version of the event or the creation of a collective collaborative community work. Instead, our approach intends to facilitate the reuse of collective contents for individual needs. Rather than using personal fragments to strengthen a common group video, our work takes groups fragments to increase the value of a personal video. Each of the videos created by a socially-aware multimedia authoring system should be tailored to the needs of particular members of the group – the video created for the father of the trombone player will be different from the one for the mother of the bass player, even though they may share some common content. Users should be able to automatically assemble a story based on a number of parameters such as people to be featured, songs to be included, and duration of the compilation. Such selection triggers a narrative engine that creates an initial video using multi-camera recordings. The narrative engine selects the most appropriate fragments of videos from the repository, based on the user preferences, and assembles them following basic narrative constructs. Given an example query: SelectedPersons = [Julia]; SelectedSong = [Cry Me River]; SelectedDuration = [3minutes]. The algorithm extracts the chosen song from the master audio track, and uses its structure as backbone for the narration. It then selects all the video content aligned with the selected audio fragment; the master video track provides a good foundation and possible fall back fragments that are not well covered by individual recordings. The audio object is the leading layer and, in turn, it is made of AudioClips. This structure generates a sequence of all the songs that relate to the query. As soon as the length of the song sequence extends beyond the SelectedDuration, the compilation is terminated. The video object has the role of selecting appropriate video content in sync with the audio. An example of the selection criteria is the following: 1. Select video clip that is in sync with the audio; 2. Ensure time continuity of the video; 40 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness 3. If there are more potential clips that ensure continuity, select those with Person annotations matching the user choices stored in SelectedPersons; 4. If the result consists of more clips, select those which Instruments annotation match the instruments that are active in the audio layer; 5. If the result consists of more clips, select those which Person annotation matches the persons currently playing. Once the automatic authoring process is complete, a new video compilation is created in which the selected song and people are featured. As reported elsewhere [76], such narrative constructs have been developed and tested together with professional video editors. Our assumption, based on the social theories, was that automatic methods – while useful – were not sufficient for creating personal memories of an event. Such assumption is validated in Chapter 4. Figure 2.8 shows a comparison between automatic and manual generation of mashups. Automatic techniques are better suited for group needs such as a complete coverage or a summary of the event, but are not capable of capturing subtle personal and affective bonds. We argue instead for hybrid solutions, in which manual processes allow users to add their personal view to automatically assembled videos. A socially-aware multimedia authoring system should provide such interfaces for manually fine-tuning video compilations. Users can improve and personalize existing productions by including other video clips from the shared repository. For example, a parent can add more clips in which his daughter is featured for sharing with grandma, or he can instead add a particularly funny moment from the event when creating a version for his brother. As we will discuss in Chapter 4, participants liked such functionality, which automatic processes are not able to provide. 2.2.4.3 Supporting Intimacy and Enabling Reciprocity Apart from allowing fine-tuning of assembled video stories, a socially-aware multimedia authoring system should enable users to perform enrichments. Users can record an introductory audio or video, leading to more personalized stories. As we will see in Chapter 4, this functionality (we call it ‘capture me’) was appreciated by most of our participants. Our framework also addresses reciprocity by enabling life-long editing and enriching of compiled videos. As indicated before, videos created using our framework can be manually improved and enriched using other assets from the Generic Architecture for Socially-Aware Authoring Systems 41 Figure 2.8. Comparison between automatic and manual generation of video compilations. Automatic methods are not sufficient for creating personal and intimate memories. repository, and adding personal video and audio recordings. In Chapter 5, we go a step further, discussing the possibility of the recipients adding comments synchronized to specific moments within the video productions. Thus, users receiving an assembled video story can easily include further timed comments as a reciprocity action intended to the original sender. For example, a grandmother, who receives a video story from her son, might add a “Isn’t my granddaughter cute?!” reply as a reciprocal message within the video. The main benefit is that this functionality enables people to comment and enrich existing video stories. 2.2.4.4 Guidelines relative to Requirements In addition to supporting emotional intensity (requirement i), reflecting personal effort, supporting intimacy and enabling reciprocity (requirement iv), our sociallyaware multimedia authoring framework also meets the other requirements identified in Section 2.2.3, as discussed below. Using a trusted storage media server (provided, for instance, by the school) we address the privacy issue (requirement ii). Parents can upload the material from 42 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness the concerts to a common media repository. The repository is a controlled environment, since it is provided and maintained by the school, instead of being an external resource controlled by a third-party company. Moreover, all the media material is tagged and associated with the parent who uploaded it, and there are mechanisms so parents can decide not to share certain clips in which their children appear. Users can use their credentials for navigating the repository – those parts allowed to them – and for creating different stories for different people. The requirement on effortless interaction (requirement iii) is met by the provision of a number of automatic processes that analyze and annotate the videos, and that help users to navigate media assets and to create memories. As introduced in the previous subsections, users can navigate the video repository using a recommender algorithm, and they can automatically generate video compilations from the multi-camera recordings. 2.3 Evaluation In this section we report on evaluation of the utility and usefulness of our sociallyaware multimedia authoring framework. In particular, our results address the requirements on social connectedness, and privacy and control (requirements i and ii, respectively). As described above, the evaluations of the prototype system have taken place in two different countries (UK and the Netherlands) since 2008, when we started exploring this novel area of research. Our results have been obtained via questionnaires, user testing and observations. During phase 1 evaluation, users were instructed to interact with the MyVideos prototype system after responding the background survey presented in Section 2.2.3. Figure 2.9 presents the answers regarding the overall assessment after users interacted with the system. In general participants liked MyVideos and considered its functionality useful (Q1.1). Based on the received feedback, we can conclude that participants appreciated the benefits of our system and considered it a valuable vehicle for remembering events, thus improving social connectedness (requirement i). In particular, participants largely agreed that MyVideos would help them in recalling memories of social events (Q1.2). They also indicated that by using MyVideos they would share more videos with others (Q1.3). As shown in Figure 2.5, this feedback is aligned with the small peaks of awareness we intended to mediate with socially-aware multimedia authoring tools. Evaluation Figure 2.9. Utility and usefulness of MyVideos. Results of the questionnaire from phase 1 evaluation (Amsterdam/NL). 43 44 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness It might be surprising that although participants liked the system, some of them said that they did not find it much ‘safer’ than other video sharing services (Q1.4), or that they would not pay for it (Q1.5). As discussed earlier, the first issue has been motivated by privacy concerns (requirement ii). Most senior users were reluctant to uploading material outside their reach, hard drive, (even though it was a controlled environment). For the latter issue, we present more insights in the second evaluation process. Lastly, most of our subjects said that current home video management and sharing systems do not satisfy their needs (Q1.6). When questioned whether their video material would be enough to create a compelling video, they mainly answer negatively (Q1.7). They agreed that content captured by other people that participated at the same event could be interesting for others (Q1.8). However, most of the users asserted that current tools do not allow for easy watching and repurposing other parents’ footage. Figure 2.10 presents the answers to the questions related to the utility and usefulness of the second prototype system, including comparisons to other existing solutions. Overall, participants were enthusiastic about MyVideos (Q2.1). As in phase 1 evaluation, all participants declared that our socially-aware multimedia authoring framework helped them to recall memories of social events (Q2.2), and it made them feel more connected with their loved ones (Q2.3). These results directly meet requirement i. “Overall, I had great fun. It was more than just getting into that concert again. It was doing something completely different. Almost like another activity. Which could almost have been anything. But the fact it was this concert, with my daughter in it, made it extra special.” (Father of a performer) “I was especially keen to use this to create a video of my son playing cello to share with my father who lives in Wales… I actually don’t have any videos of him playing cello as it is often not the done thing to video concerts…” (Mother of a performer) Similarly to the result obtained in phase 1 evaluation, participants indicated they would share more videos if they had a tool like ours at hand (Q2.4). However, only 4 (out of 9) considered the system ‘safer’ than current video sharing services (Q2.5), while 5 said they would spend money on it (Q2.6). A user argued about the cost-benefit of having a system like MyVideos. Evaluation Figure 2.10. Utility and usefulness of MyVideos. Results of the questionnaire from phase 2 evaluation (Woodbridge/UK). 45 46 Chapter 2. Personalized Memories of Social Events: Studying Asynchronous Togetherness “Maybe I would pay for it, but it depends on cost and how much it would be used.” (Mother of a performer) On the other hand, a teenager justified his opinion, which is common among his age group. “I tend not to bother with paid services; I just do without the service.” (Brother of a performer) It is important to highlight that most participants agreed that the material they usually capture is not sufficient to create a good video memory of an event (Q2.7). Therefore, it would be useful to have access to the content recorded by other parents’ (Q2.8). Based on the participants’ comments and answers, we get a strong sense that current tools are not enough to attend their needs. Current video sharing platforms on the Web do not allow for a collection of families that may have limited interactions to be brought together by contributing media assets for common use. 2.4 Discussion In this chapter we reformulated the research problem of multimedia authoring, by investigating mechanisms and principles for togetherness and social connectivity around media. During 4 years, our user-centered methodology involved interviews/focus groups with users, prototype implementations and user evaluation. Motivated by general user needs, social theories and initial interviews, we specified a set of guidelines for the design and implementation of socially-aware multimedia authoring and sharing tools. We aim at nurturing strong ties and improving social connectedness by supporting emotional intensity, personal effort and intimacy, and by enabling reciprocity. As shown in this chapter, our approach is aligned with the requirements needed for social communities that are not addressed by existing social media Web applications. These guidelines characterize the first contribution of this chapter, and directly answer the first research question. The overall evaluation process of a system that realizes such guidelines represents the second contribution of this chapter. It contemplated a long-term process in the Netherlands and in the UK, in which people actively participated and recorded concerts of their relatives/friends. Results from the evaluation process Discussion 47 show that the functionality provided by our socially-aware multimedia authoring system meets our requirements and brings an identifiable improvement over traditional approaches. These results, which are complemented by other findings in the next chapters, directly answer our second research question, and show that a system like ours is a valid alternative for social interactions when apart. In the next chapters, we look into detail at each step that composes the socially-aware multimedia authoring workflow discussed in Chapter 1. First, in Chapter 3 we present our efforts in enabling community-based users to explore and navigate a large content space based on their personal interests. While following the emotional intensity guideline, our design meets requirement i (social connectedness). Then, in Chapter 4 we discuss the balance between convenience and personal effort when generating highly personalized video compilations of targeted interest within a social circle. This chapter addresses the personal effort guideline, and the evaluation results show that we meet requirements iii and iv (effortless interaction and personal effort/intimacy, respectively). Finally, while following the intimacy and reciprocity guidelines, Chapter 5 turns its attention to supporting the recipient in commenting within a video story (requirement iv). 3 Designing Socially-Aware Video Exploration from Community Assets1 The previous chapter provided the basis of socially-aware multimedia authoring. Our results validated the main assumptions, showing that users appreciate the importance of video sharing for building common experiences and for increasing the feeling of togetherness with others. Our results also indicated that current video sharing services fail to meet users’ needs, because they miss useful mechanisms for navigating media and do not take into account emotional intensity and intimacy. In this chapter we argue that there is a need for useful mechanisms for navigating and sharing media, and socially-aware video management systems should provide efficient automatic processes to manage personal interests. The wide availability of video recording devices in mobile telephones and pocket cameras has made documenting shared events easy (see Figure 3.1). The collected set of videos provides a rich archive from which users can enjoy content that matches their personal interest. Unfortunately, current browsing tools, including social networks, are not geared to supporting this form of selective consumption; these tools are geared towards throwing away unwanted content from a single collection, and not for browsing a broader community collection of temporally aligned alternatives. Current video tools often support only a high-level 1 This chapter is based on the following paper: D.C. Pedrosa, R.L. Guimarães, P. Cesar and D.C.A. Bulterman. 2013. Designing Socially-Aware Video Exploration: A Case Study Using School Concert Assets. In Proceedings of the 17th International Academic MindTrek Conference: Making Sense of Converging Media (MindTrek ‘13). 49 50 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets abstraction of objects and events, and do not help users to explore community videos that portray people within their social circle. Even though social networks archive media based on higher-order social relationships, they do not provide support for searching and navigating media content that was captured at a particular event by different camera people. Most social events have an inherent structure that can be used to aid searching for content. We can take advantage of this structure for the development of socially-aware video exploration interfaces. Most participants at an event will attach different levels of importance to any given sub-event, based on their personal/social preferences. If we consider a high school concert, it has a structure (the order of the songs), a sub-structure (individual songs) and multiple levels of sub-sub-structure: solos, duets, vocal announcements and other often-unpredictable happenings. As discussed in the previous chapter, not everyone at the concert (or viewing it) will be equally interested in all parts. Parents will focus on their own children, students on their friends, and invited guests on the clock. This chapter focuses on our efforts in designing and implementing an interface for browsing community assets, in which the relationships between users of the system and performers, featured in the videos, play an essential role in content selection. Our work, which follows the emotional intensity guideline (see Chapter 2), includes our findings and key results from the two-phased series of evaluations. In the next chapter we will show that such social bonds are key not only for navigating a shared media space, but also for authoring personalized stories users care about. Here, we focus on the importance of providing a rich representation of an event (in this case, a high school concert) in a way that helps users to navigate and explore a community repository based on their social/personal interests. The research question we address is: Question 1.3 Does a socially-aware video exploration system provide an identifiable improvement over current approaches for accessing and navigating a repository of shared media? To answer this research question we first present a browsing interface, and the underlying system infrastructure, that allow for socially-aware exploration of a collection of media assets captured in an event. Users can explore and navigate (fragments of) video clips recorded by several people based on their own personal/social interests. The design, deployment and evaluation of the system resulted in the identification of key requirements for this novel type of browsing Community-based Browsing 51 Figure 3.1. Typical interface for watching videos on the Web. It does not take into account the social affinity between viewers and subjects featured in the video. interfaces. In particular, our approach 1) supports exploration based on the inherent event structure; 2) it makes use of contextual information to help in the navigation process; 3) it allows for flexible searches based on combination of filters; and finally, 4) it provides a way to switch between cameras angles that might have captured different aspects of the event. The structure of this chapter is as follows. First, Section 3.1 provides an overview of the design and evaluation of an initial prototype system for sociallyaware video exploration. Based on the users’ feedback, a set of functional requirements was gathered. Then, in Section 3.2 we describe the design and implementation of the second version of the browsing interface that addresses these requirements. Next, Section 3.3 reports the evaluation of such system, analyzing the results. Finally, Section 3.4 provides a reflection on how our findings fit in the context of this thesis. 52 3.1 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets Community-based Browsing The family interviews and focus groups in the beginning of this journey (see Chapter 2) provided us valuable data for identifying a series of requirements. The conclusion was that current social media sharing interfaces are not adequate for satisfying the expectations of strong ties. In this chapter, we focus on innovative interfaces that help users to explore a shared media repository they have social affinity with (emotional intensity guideline defined in Chapter 2). The final goal is to provide interfaces that can help shaping and sharing memories of important events with family members and friends. The starting point of our investigation was traditional video browsing interfaces, such as YouTube (see Figure 3.1). Nevertheless, early in this process, Figure 3.2. Initial prototype implementation for browsing videos (thumbnail view). Community-based Browsing 53 we realized that this kind of service does not provide social filters (e.g., to select videos by a particular performer) for concert videos, and it does not take advantage of the temporal relationships between videos belonging to the same event. To address these limitations of current video sharing services, our initial video browsing interface offered two views for exploring community contributed video clips. The thumbnail view (Figure 3.2) displayed media assets in a paginated grid, while the timeline view (Figure 3.3) showed how recorded videos temporally fitted the event timeline. In both views a user could apply six different filters to refine a query. These filters were: all media, my media, cameras, people, instruments and events. ‘All media’ referred to all videos uploaded to the system. ‘My media’ restricted navigation to only the videos uploaded by the current user. ‘Cameras’, ‘people’, ‘instruments’, and ‘events’ filters would display the respective annotated video clips based on the filter selection (e.g., ‘Julia’ or ‘Drums’). Figure 3.3. Initial prototype implementation for browsing videos (timeline view). 54 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets Figure 3.4. Initial interface for adding and correcting annotations. Besides allowing for navigating the video clips, our initial interface also enabled users to annotate media assets and to correct existing annotations that could be wrong. When showing a video clip, a user could ‘flip it’ over and access all annotations related to that clip (as shown in Figure 3.4). Using the footage recorded during the Big Band concert in Amsterdam, potential users were invited to evaluate our initial system. Details about the methodology and user assessment can be found in Chapter 2. In the remaining of this section, we discuss the results regarding media exploration obtained in the first evaluation phase. From these results a set of new requirements were elicited, and used for the design of the second phase. Community-based Browsing 55 3.1.1 Phase 1 Evaluation In general, participants’ feedback for the first version of our system was positive (see Figure 3.5). Four (4) out of 7 participants said it was better than traditional tools to find people they cared about (Q1.1). We received slightly better feedback when we asked whether our system was better to browse videos recorded by other parents (Q1.2). These results are directly aligned with the requirements of emotional intensity and easiness of use. During the evaluation session, participants were actively looking for video clips of their close friends and relatives. In particular, some participants wanted to immediately share video clips with members of their close circle. “Can I send it now?” was a common reaction after seeing a video clip they especially liked. When asked how they would share the videos, teenagers expressed they would rather download the video files to their local computers, send a link of a particular video by email or share on YouTube and/or Facebook. Parents, on the other hand, indicated that a ‘Burn to DVD’ functionality of the selected videos also would be convenient given that grandparents usually do not have Internet access at home. When prompted about what they remembered of the concert, most participants that attended it said that they recalled superficially the spatial arrangement of the stage (see Figure 3.6). At this point, some participants Figure 3.5. Results of the questionnaires from phase 1 evaluation. 56 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets Figure 3.6. Sketches from participants illustrating the concert setup. mentioned that it would be interesting to have a spatial representation of the concert venue to help browsing the event footage. When inquired about particular events they remembered, participants reported on solos performed by different musicians. Among the youngest participants, an event in particular was pointed out as the most memorable of the concert. “I think that the jamming at the end I liked the most... I found that the most memorable of the whole evening...” (Friend of some performers) In some cases participants complained – and were desperate – when the quality of the video was not good enough or when the metadata was wrong (see Figure 3.7). Most participants expressed they would add/correct metadata with our system (Q1.3). However, they were quite resistant about the amount of time they would spend on this process, arguing that it demanded a lot of effort. “It is not my problem (correct the wrong metadata)… people don’t have time to play with the system.” (Uncle of a performer) Community-based Browsing 57 Figure 3.7. Participants’ reactions during the evaluation process. When questioned about the filter functionality, participants appreciated such feature because it would allow them to retrieve only the videos related to their interest. Nevertheless, almost all participants manifested interest in using a combination of filters, when searching for videos (e.g., show all videos of the trombone player in the 3rd song). Despite being feasible using the recommendation algorithm presented in Chapter 2, such functionality was not contemplated in the first version of our user interface. At last, some participants also mentioned that a person or instrument should be considered featured in a video only if this was a prominent shot, e.g., close-up or solo. They would not be interested in a video clip in which the subject of interest barely appears. “If he (my nephew) is in the background but he is on the shadow it is OK but I would like to see a video in which he really shows up… My mother (performer’s grandma) would not enjoy seeing this video of him because there is not much to see.” (Uncle of a performer) 58 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets 3.1.2 Lessons Learned In the first evaluation phase we followed an interactive approach, where a number of new requirements were defined. The most relevant observation was the necessity of providing contextual information for browsing, searching and watching community assets. On the one hand, the thumbnail view did not show the temporal relationships between the video clips. On the other hand, the interface of the timeline view was considered complex. Participants were looking for a more intuitive and simple visualization model. We observed during the evaluation process that they tended to remember the inherent structure of the event (e.g., the concert program or spatial arrangement). Rather than treating each media asset as a discrete entity, archival theory and practice suggests that digital videos should be managed, preserved and presented to users in a way that reflects the social and documentary context in which they were originally embedded [8]. This argument leaded us to the specification of following requirement: i. Support inherent event structure: users indicated the need for a more intuitive metaphor to organize or cluster community assets. Such approach would help them in exploring and searching for people or events of interest; Although the interface allowed users to add/remove and correct existing annotations, these were not directly accessible. In order to see and change any information regarding a video clip (e.g., associated performers, songs or instruments), users had to click on a button to show the annotation interface (see Figure 3.4). When playing a video, the same problem was evident: annotations were again ‘hidden’ behind the media. In some situations, users would just click and watch a video in order to know more about its content. This was a time consuming process that led to frustration of the users. Based on these issues, we introduce our second requirement: ii. Make contextual information explicit: feedback from users suggested that by clearly showing associated annotations, it would facilitate the browsing experience. It would also minimize the chance of ‘blind’ navigation or of getting ‘lost’ in the media space; Socially-Aware Media Browsing 59 In the previous section we said that both thumbnail and timeline views offered a number of different filters for content selection. Despite appreciating this functionality, participants manifested interest in using more than one filter at the same time when searching for people or events of interest. The use of individual filters did not fulfill their needs. Based on this we present our next requirement: iii. Allow combination of filters: users should be able to combine filters to compose robust queries. Such functionality would allow them to find videos of interest more effectively and faster; Users feedback also suggested that they would like to have a spatial representation of the videos, in which content recorded from different angles could be activated in parallel. The work of Kennedy and Naaman [44] indicates that in a music scenario, like the one addressed in this thesis, alternative camera views could significantly reduce the required time to scan or to watch the content, while still providing a complete overview. In these lines, we introduce our last requirement: iv. Allow multi-camera navigation: when watching a particular event (e.g., a solo), users should be able to switch between different camera angles (if there is any other available). Such functionality would enrich the browsing experience by providing spatial context. In this section we introduced a set of functional requirements based on user feedback and results from the first evaluation phase. These new functional requirements motivated the design and evaluation of a second prototype system. In the next section we discuss our efforts for providing more effective socially-aware visualization mechanisms and innovative navigation paradigms. 3.2 Socially-Aware Media Browsing The first prototype was helpful for better understanding user requirements for socially-aware video exploration of community assets. The evaluation results suggested we were in the right direction and helped in identifying a number of requirements for improving the user experience. With such requirements in mind, we started a new design from scratch. The browsing component discussed in this 60 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets Figure 3.8. Browsing interface based on the concert program. section intends to simplify the exploration of media assets, without compromising the flexibility of query specification [47]. To address our first functional user requirement, we designed an interface based on the concert program (Figure 3.8 (2)). This digital version of the original paper-based program handed out at the day of the event (Figure 3.9) clusters songs in two columns. Rather trivial in concept, it provides a general overview of the event schedule. In this interface, performers have a prominent position at the top (Figure 3.8 (1)). After all, these ‘raising stars’ are the main reason for users (friends and family) to use the system. For each song in the concert program, a few video clips are recommended. This design choice provides contextual information without having to select a specific song. We also implemented a clip hovering functionality that shows a key frame animation on mouse over. It aims at providing a summary of the video Socially-Aware Media Browsing 61 Figure 3.9. Paper-based concert program handed out at the day of the event. without the need to watch it. These design decisions are directly aligned with our second requirement. Hovering the mouse over interface elements (i.e., performers, songs and clips) also provides efficient and informative feedback. For instance, when a user hovers the mouse cursor over a performer thumbnail, the associated songs and media clips containing that person are highlighted in the user interface. This functionality, which has been designed to react in rapid response time, reduces the short-term memory load [47] and makes clear the relationship between performers, songs and clips. Another functionality supported in the new prototype is the specification of queries based on performers. When the user clicks on a particular performer, the selection is sent to the server, which recalculates the recommendations considering the selection. Our design also allows for more complex query specifications such as the combination of two or more performers. In this case, a conjunction operator is used to connect the selection of performers. Thus, only songs (and the respective video clips) in which there is an intersection among the selected performers will be 62 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets Figure 3.10. Supporting combination of filters. In this example, one performer is selected and the mouse cursor over another performer highlights the songs and video clips in which both performers played together. highlighted in the interface (see Figure 3.10). This functionality addresses our third requirement by allowing participants to search for videos using combined filters. Next we present our efforts on integrating context information, video playback, and supporting multi-camera navigation. As aforementioned, some video clips are listed in each song of the concert program. These videos are the entry points for media playback and multi-camera navigation. The video clip recommendations are based on the selected search terms and on the user profile (as we will see in the next chapter, the user profile is computed automatically considering user recording behavior). When the user clicks on one of the recommended video clips, the playback interface is launched as illustrated in Figure 3.11. This interface is divided in three Socially-Aware Media Browsing 63 Figure 3.11. Interface for watching video clips. main areas: media player (1), video clip information panel (2), and alternative views of a video clip (3). The information panel shows metadata associated to the video clip (e.g., who has recorded it, the number of views). It also provides information that is constantly updated based on the video playback (e.g., the song elapsed time). This panel also offers users a way to share a link of the current clip with someone by e-mail, to download the current clip, or to inform the system administrator that the clip has inappropriate content. The rationale behind this last functionality is to cope with the privacy concerns discussed in Chapter 2. The area of alternative views of a video clip (Figure 3.11 (3)) presents other camera angles that happened at the same time of the main video. In other words, this area shows concurrent video clips recorded by other people during the event. By design choice, only a limited number of alternative views is presented (or 64 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets recommended) to the user. It is possible that more cameras were active at that point in time. The position of each camera is set when the player interface is launched and remains unchanged during playback. When the user clicks on an alternative camera view, this will take the place of the main video, and the playback will continue from the same point in time as if the user had changed his position or angle. Such interface provides support for watching and navigating the media space, which directly addresses our fourth and last requirement about multi-camera navigation. Due to performance limitations in a Web browser environment, alternative videos are not played at the same time as the main clip. As an elegant workaround, our design provides a camera update functionality that – during the playback of the main clip, – periodically changes the key frame of each alternative. This approach aims to minimize the blind camera navigation problem discussed in Section 3.1. 3.3 Evaluation Using the footage recorded during the Woodbridge high school concert (UK) in the beginning of November 2011, 13 people from 6 families participated in the evaluation of the new prototype system. While this section reports on the observations from the interactions of all the 13 participants, the quantitative data shows the answers from 9 people (the others did not fill in the evaluation questionnaires). More information about the methodology and participants’ profiles can be found in Chapter 2. Next, we analyze the user responses and discuss the findings regarding socially-aware video exploration. Our results are based on a qualitative analysis of the interviews and observation of the system usage. 3.3.1 Results and Findings Figure 3.12 and Figure 3.13 present the results of the questionnaires. Overall, participants appreciated the browsing interface (Q2.1). They indicated that it is useful for finding videos of performers and that it is better than traditional tools to explore the event media space (Q2.2 and Q2.3, respectively). Therefore, users would find videos more efficiently using our system (Q2.4 and Q2.5). If we compare with the results obtained in the initial evaluation (see Figure 3.5), there was a clear improvement, even though these were two distinct experiments. Evaluation 65 “It (the browsing interface) has everything in one place and you can access other (people’s) videos without having to import / open them.” (Brother of a performer) The concert program metaphor was well assessed by our participants. In general, they expressed that this inherent event structure provides a simple and intuitive overview of what happened during the concert. Performers’ thumbnails displayed at the top of the user interface were also appreciated. Participants said this was a good way to quickly look for videos they were interested in. “Very easy to use! Performers at top is a good idea, and the concert programme is very clear!” (Father of a performer) When asked how much they liked the mouse over functionality in the concert program, 8 out of 9 participants said a lot, while the other participant said some (Q2.7). This was by far the most appreciated functionality of our prototype system. Participants enjoyed the rapid contextual information feedback when they hovered any of the interface elements (e.g., performers, songs or videos). “I really liked the mouse over feature in the concert programme!” (Mother of a performer) “This is really good!” (Performer about filters and mouse over functionality) One of the participants mentioned that this mechanism was a bit slow though. Rapid response time is critical to support effective feedback. Providing highly responsive interactive results is important for dynamic browsing interfaces like ours, and fast response time for query reformulation allows the user to try multiple queries rapidly [47]. One aspect that needs further investigation is how to present recommendations for each song. Some users indicated that more recommendations could be showed: they assumed that there were more videos available. Apart from that, they seemed to like the video recommendations (see Q2.8). As mentioned earlier in this chapter, our video recommender takes into account the social bonds between users and performers. In the next chapter we detail the profiling approach used by our video recommender. 66 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets “I’m wondering why it (the browsing interface) particularly picked those 2 videos.” (Father of a performer) While exploring videos displayed in the concert program, users expressed that they were having an engaging experience, but they did not have an option to play a song from begin to end. This feedback suggests the need for supporting more complex narrative alternatives that not only take into account the temporal alignment between videos, but also the preferences and social relationships of each individual user. This subject is the focus of the next chapter, which discusses the Figure 3.12. Results of the questionnaires from phase 2 evaluation. 67 Evaluation Q2.7 How much did you like the mouse over functionality to filter the concert program? Q2.8 How much did you like the recommended video clips? Q2.9 How much did you like the functionality of alternative views of a clip? 0 1 2 3 4 5 6 7 8 9 number of respondents None Hardly some A little bit Some A lot N/A Figure 3.13. More results of questionnaires from phase 2 evaluation. balance between automatic generation of video narratives and use of manual processes to reflect personal imprint. “I think when you got individual songs, or individual pieces, [...] you might well want to, say, see the whole three minutes or something from the beginning.” (Father of a performer) Our participants appreciated the multi-camera navigation support (Q2.9). This functionality raised the demand of having all the alternative videos playing at the same time and for seamless transition between camera angles. However, users also were aware of the browser and bandwidth limitations in our scenario. In some sessions there were some technical problems when switching from one camera view to another. Instead of starting the new clip from the current time, the playback would start a video from the beginning. This problem clearly led to frustration. “I liked having a lot of different camera angles, which is something you don’t get with anything else.” (Performer) “Found the alternative views slightly complicated as regards case of use – couldn’t always tell whereabouts in the performance we were, seemed a bit jumpy. Probably just an issue of getting to grips with the programme though!” (Performer) 68 Chapter 3. Designing Socially-Aware Video Exploration from Community Assets Still regarding the multi-camera navigation interface, some participants suggested that it would be nice to have a visual representation of the duration of each clip within the song, as it would help them to situate themselves temporally. This goes in the same direction of the work proposed by Yu et al. [77]. Regarding video annotation, 7 participants declared they would explicitly rate videos while watching (Q2.6). “I would rate a clip while watching to tell it does not belong to a song or it has poor quality… just to make sure it would not be recommended again!” (Mother of a performer) “I would tag videos (thumbs up/down) as not being of good quality or in poor position e.g., performers face not visible as obscured by music stand.” (Father of a performer) A few participants mentioned they normally do not use to rate videos at all. “I never really use the rating features of YouTube.” (Brother of a performer) 3.4 Discussion In this chapter we presented our efforts in designing and implementing an interface for browsing a collection of user-generated videos from a shared event. The interface aimed at helping users to easily access contents based on their social interests. This chapter described a two-phased development and experimentation. First, we discussed the design and development of our initial prototype system. The evaluation of this tool allowed us to identify a number of functional user requirements for interacting with a set of videos from the same concert. These findings guided the development and evaluation of a new video browsing interface. Results from the experiments show that our new prototype satisfied the requirements and led to a clear improvement when compared to the initial system. Using a concert program metaphor (requirement i), participants could search for videos using combined filters (requirement iii) and experience moments of interest from different camera angles (requirement iv). Not to forget that our system provides efficient and informative feedback to help in this process (requirement ii). Discussion 69 Overall, our design decisions have improved the ability to explore videos users care about, among a pool containing the recordings of different parents. Our results clearly indicate that a socially-aware video exploration system like ours (which fulfills the emotional intensity guideline and social connectedness requirement introduced in Chapter 2) provides an improvement over current tools for accessing and navigating a repository of shared media assets. These results directly answer the research question asked in the beginning of this chapter. Enabling users to explore an event and search for video clips they, and other participants, have recorded is an important step towards making personal media more accessible. But it is just the beginning. Individual video assets most of the times do not provide rewarding narrative experiences that help users remember important events. In the next chapter we discuss the balance between automatic and manual processes for creating personalized stories from community assets. 4 Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance?1 In the previous chapter we have seen that the ability to search and browse content based on the social bonds is very important for making personal media more accessible. Nevertheless, it is too often the case that personal recordings are abandoned on memory cards or as downloaded files on hard drives never to be accessed again [19]. The main reason for this is that, as captured, video is not ready for being looked at. Video, as a time-based medium, necessarily requires processing after capture. Editing, for instance, can be performed on a handheld 1 This chapter contains extracts from the following papers: R.L. Guimarães, P. Cesar and D.C.A. Bulterman. 2013. Personalized Presentations from Community Assets. In Proceedings of the 19th Brazilian Symposium on Multimedia and the Web (WebMedia ‘13). ACM, New York, NY, USA, 257-264. DOI=10.1145/2526188.2526208 http://doi.acm.org/10.1145/2526188.2526208 (33% acceptance rate) [Won, best multimedia paper] R.L. Guimarães. Automatic and manual processes in end-user multimedia authoring tools: where is the balance?. In Proceedings of the international conference on Multimedia (MM ‘10). ACM, New York, NY, USA, 1699-1700. DOI=10.1145/1873951.1874327 http://doi.acm.org/10.1145/1873951.1874327 V. Zsombori, M. Frantzis, R.L. Guimarães, M.F. Ursu, P. Cesar, I. Kegel, R. Craigie, and D.C.A. Bulterman. 2011. Automatic generation of video narratives from shared UGC. In Proceedings of the 22nd ACM conference on Hypertext and hypermedia (HT ‘11). ACM, New York, NY, USA, 325-334. DOI=10.1145/1995966.1996009 http://doi.acm.org/10.1145/1995966.1996009. (34% acceptance rate) [Nominated, best paper/best newcomer] 71 72 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? smartphone to trim out poor and redundant content that is always captured alongside quality material. This is because video carries complex information and quality judgments cannot always be made on the spot, while filming. Editing is also required to create attractive artifacts for what we believe could later become valuable memories we want to watch and share with friends and family members. A simple juxtaposition of recorded fragments does not necessarily result in attractive mementos. But editing is not a simple process and people often do not want to engage with it, vide the results provided by our participants in Chapter 2. This is true for personal content from one source, but it is especially true when considering mixing content recorded at a single event from many sources: the community video problem. This chapter provides an analysis of our efforts on multimedia authoring using community assets. As with browsing and navigation, we have developed a first version of an authoring system, subjected it to an extensive long-term user testing, and then developed an improved version that follows the guidelines of socially-aware multimedia authoring. As described in Chapter 2, our initial work was subjected to a 10-month evaluation process, enabling end-users to create stories reusing collective content for individual needs. Our initial results showed a general enthusiasm from participants, which were validated in the first evaluation phase. The initial implementation, which was aligned with the personal effort guideline, made use of a narrative engine to automatically compile personalized stories based on the community media assets [76]. While the video compilations produced by the initial system were considered visually compelling, end-users missed the capability of personalizing those by adding their own ‘imprint’. The complexity of authoring personalized stories from community assets have led to the consideration of the following research question: Question 1.4 Where is the balance between automatic and manual processes when authoring personalized narratives users care about? We have approached this research question from three more concrete and strongly interlinked perspectives. In particular, this chapter investigates: 1. The degree to which media authoring can be simplified by the use of a narrative engine to produce a ‘rough cut’ (an initial video story) automatically; Community-based Authoring 73 2. The degree to which this rough cut can be automatically tailored based on the relationships within an end-user’s social network; and 3. The degree to which automatically generated video stories can be easily refined and further personalized using intuitive manual extensions with minimal extra effort. The primary contribution of this chapter is a hybrid authoring system that allows users to create and share personalized media with others. This chapter is structured as follows. Section 4.1 motivates the problem of creating personalized stories from community assets, and discuss the evaluation we have carried out during the first phase. Section 4.2 describes the design and implementation of a new hybrid (or semi-automatic) authoring system that meets the functional user requirements elicited in phase 1. Section 4.3 reports on the results from the user evaluation of our prototype, demonstrating the benefits of our hybrid authoring approach. Finally, Section 4.4 concludes the chapter offering a discussion about the lessons learned. 4.1 Community-based Authoring Creating compelling multimedia productions is a non-trivial problem. The problem is compounded when authors want to integrate community media assets: media fragments donated from a potentially wide and anonymous recording community. The purpose of this section is to describe our initial efforts to facilitate the creation of personalized stories from community assets. Our initial approach provided users both independent manual and automatic authoring threads (called Editor and Composer, respectively). The intention was to compare the quality of easy-to-create fully automated compilations with the amount of effort required to manually creating personalized video stories. Figure 4.1 shows Composer, the thread for automatically assembling video compilations in our initial prototype system. Users only had to explicitly select the subject matter (people, songs, instruments) and two other parameters (style and duration). Then, by pressing the ‘GO’ button, a narrative engine would be triggered, and in less than three minutes a video using the assets captured by different cameras at the concert would be created. The narrative engine would select the most appropriate fragments of videos from the repository, based on the declared user parameters, and assemble them following narrative constructs. 74 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Figure 4.1. Initial prototype implementation for automatic video editing. As mentioned in Chapter 2, the automatic authoring capabilities of the system were also assessed using expert input. Three video professionals with between 5 and 20 years experience were interviewed. All three agreed with the basic footage preparation and narrative structures that were used to build the video compilations. They were especially keen with the approach of using an audio track as a master timeline to drive the story development. They also concurred with our approach of automatically selecting alternative shots from cameras available in parallel tracks and using rules that selected clips based on shot types [76]. Our initial prototype system also provided an interface for manually creating video compilations (Editor). To find videos of interest, users could use the same set of filters and views available in the video exploration tool (refer to Chapter 3). Using the Editor, users could just drag and drop recommended video clips from the shared repository to the storyboard (see Figure 4.2). For example, a parent could add more clips in which his daughter was featured for sharing with grandma, or he could instead add a particularly ‘funny’ moment from the event when creating a version for his brother. Community-based Authoring 75 Figure 4.2. Initial prototype implementation for manually editing videos. Figure 4.3. Elements of an authored video composition: the parent has included an introductory image and a video for making it more personal and intimate. 76 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Apart from allowing fine-tuning of productions, the Editor also enabled users to perform enrichments. It provided mechanisms for including personal audio, video, and textual commentaries. For example, these could be subtitles aligned with the video clips commenting the event for others. Users could as well record an introductory audio or video, leading to more personalized stories. Figure 4.3 illustrates some elements of an authored video, where a parent has created his own version of a concert. He has also added some personal assets, such as an introductory image and a video recording of himself that acts as the message envelope. This functionality was called ‘capture me’. As reported in Chapter 2, the evaluation of the initial system was preceded by 3 social events. While the first two recording experiments mainly focused on the evaluation of the annotation processes and narrative structures, the third one, a school concert in Amsterdam, allowed us to engage a group of parents, relatives and friends of performers for evaluating the initial version of our system. In the remaining of this section we discuss the lessons learned about the authoring threads during the evaluation of the first phase. 4.1.1 Phase 1 Evaluation In this study, all participants first interacted with the community-based browsing interface (see Chapter 3), and then they were introduced to the authoring threads. In general, they appreciated both approaches to create personalized video compilations and considered the functionalities useful. Using our authoring tool they felt they could create more stories faster and easier (if compared to traditional systems – Q1.1-Q1.3 in Figure 4.4 from the evaluation of the first phase). Overall, the automatic assembled videos were considered visually compelling (see reactions in Figure 4.5). Although participants also indicated that they would like to have more manual processes available to further personalize and fine-tune the video compilations (Q1.4-Q1.6). “I want more portraits of my daughter (in this automatic generated compilation)... is it possible to edit an existing movie (in the Editor)?” (Father of a performer) In the manual authoring thread participants could find and select their favorite video clips. However, a complain was that they had to choose each and every clip for the compilation. Regarding optional processes, for most of our Community-based Authoring 77 participants the ‘capture me’ function – for including personal assets in a video compilation – was seen as a way to personalize videos for a target audience. As shown in the results (Q1.5), such functionality was mostly appreciated. Participants indicated they would use it, for instance, when creating a birthday present video. In the initial version of our prototype system users could either generate video compilations automatically (not being able to change these later on) or edit manually (having total control but starting from scratch). While automatic compilations were quite appreciated because of shot selection and camera diversity, users provided important evidences that manual processes were indispensable to reflect intimacy and effort. Figure 4.4. Results of the questionnaires from phase 1 evaluation. 78 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Figure 4.5. Quotes of a participant using the automatic authoring thread. Based on participants’ comments, reactions, and answers to the questionnaires, we can conclude that they appreciated the benefits of our authoring system and considered it a valuable vehicle for creating enjoyable memories. While these results were highly relevant, we were aware that they were not complete. More importantly was the indication that instead of the automatic or the manual authoring thread, a hybrid solution would better fit the participants’ needs (Q1.4). In the next section, we discuss the functional requirements that motivated the design of a new version of our socially-aware multimedia authoring system. 4.1.2 Requirements Gathering Regarding manual authoring, participants identified a number of issues that could improve the creation process. Even though some participants were familiar with end-user video editing tools, for most of them this process was time consuming and complicated. Even though they appreciated the filtering functionalities included in the Editor, they indicated that they would not like to start the process from scratch. Community-based Authoring 79 Given the difficulties inherent in video editing, they would rather first use an automatic system that provided them with an already compiled story. Based on this feedback we introduce our first functional requirement: i. Not start from scratch: users indicated their preference for an authoring paradigm, in which an initial narrative compilation would be created on their behalf. Such approach would simplify the authoring/editing task and increase their productivity; Regarding automatic authoring, participants generally appreciated the easiness of use. The interface for automatically generating stories only required users to select a number of parameters such as duration, people, instruments, and songs to be shown in the compilation (see Figure 4.1). After a few minutes, users could watch a static narrative story based on their preferences. Even though they generally enjoyed the final results, they would have preferred that the system selected some of the parameters. In particular, they requested for automatic methods capable of identifying the interpersonal relationships with the performers of the concert. This discussion leads to our next requirement: ii. Consider implicit interpersonal relationships: participants assumed that the system could automatically identify and process their interpersonal relationships with performers when creating video stories; A common frustration with automatically generated videos in the initial prototype was that the automated process created a video story that could not be modified. Participants indicated that they would like to fine-tune (or personalize) automatic generated stories by using manual tools. They felt that the final result could potentially be more personal by adding assets and personal comments that more closely reflected their view of the event. This was of particular importance in video sharing situations, in which some participants wanted to send stories of the event to particular people within their social circle, such as an uncle or the grandmother of a performer. This result is consistent with our hypothesis that emotional intensity and intimacy should play a key role in socially-aware multimedia authoring systems (see Chapter 2). Geared by this discussion on personal effort we present our last requirement: 80 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? iii. Allow for personal imprint: participants suggested that automatically generated compilations could be modified. They wanted to remain in control over the final production, being able to make small changes. This approach would allow them to create more personalized stories. Based on these requirements, we concluded that a new version of the authoring system was needed. The new approach would allow users to request a first compilation based on their implicit preferences and interpersonal relationships with performers. The system would then present an initial narrative, which could be edited and personalized on a per-clip basis. This hybrid authoring system ambitiously brings together both automatic and manual processes, so that narrative segments can be compiled, adjusted and edited successively. In the next section we discuss our efforts in designing and implementing such new authoring paradigm. 4.2 Hybrid Multimedia Authoring The high-level workflow of our new authoring tool is detailed in Figure 4.6. Since we intend to improve the creation of video compilations based on multi-camera recordings, the input material still includes the school master track and the actual video clips that users agreed to upload. As shown in Chapter 3, all video clips are stored in a shared video repository that also serves as a media clip browser in which parents, students, and authorized family members can explore (and selectively annotate) the videos. In the new design, the event exploration is the starting point for the authoring process. With the goal of creating a personalized video compilation based on a song, the user simply clicks on one of the songs in the concert program interface for triggering the narrative engine (Figure 4.7). The engine is in charge of creating a first montage from the video assets (and from video fragments) based on narrative structures and on interpersonal relationships (dependent on the identity of the user that is logged in). Such compilation, from now referred as the Director’s Cut, can be later modified by the end-user for making it more personal. Next, we discuss how the three requirements identified over our initial prototype have been considered in the design and implementation of the new authoring system. Hybrid Multimedia Authoring Figure 4.6. High-level workflow of our hybrid authoring tool. 81 82 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Figure 4.7. Triggering the Director’s Cut from the browsing interface. 4.2.1 Profiling Users Profile of users logged in the system can facilitate the automatic creation of personalized video compilations. Traditional ways of user profiling include implicit activity monitoring (log) and explicit insertion of personal data. While these approaches provide relevant results for a statistically significant group of people interacting during a long time span, they are not sufficient for our highly personalized environment. For this reason, we have implemented a mechanism to automatically compute the relationships between users and performers. Such mechanism follows three steps. First, we fill a database table with the songs each performer participated in. This is done by inspecting the annotations Hybrid Multimedia Authoring 83 regarding performers in video clips, and looking for intersections with the songs that compose the event timeline. A key part of this procedure is that a weight is associated to each song/performer row in the database table. Such weight (or ranking) is calculated based on some parameters: the number of annotations each performer has in that particular song, the duration of these annotations (how long a musician is featured within a video clip), the quality of the annotated videos (e.g., high-definition or low-quality), and the shot type annotation (e.g., close-up or wide-shot). Note that the final ranking can be tweaked by giving different weights to the parameters. After all final weights have been calculated, they are normalized per song basis. This means that the performer with the highest weight in a song gets 1, while another that is not featured in the same song (or has the lowest weight) gets 0 (zero). All the other song/performer weights will then fall in the range [0, 1]. The result of this process is a table with normalized weights, which suggest the importance of each performer in each of the songs. The weighted song/performer table is used in the video selection process (compilation generation and alternative clip recommendations). Second, we make use of the capturing behavior of each recorder individually. By taking into account the same parameters discussed in the first step, we model the behavior of a recorder towards the musicians in each of the songs. For that a similar database table, with an extra column (recorder) is used. By computing a normalized weight for each recorder towards each of the performers in each of the songs, we can derive their affection level, which as assumed, greatly influences the overall time a recorder spends capturing a specific musician. Based on these data, we can model relationships (was the performer his daughter? Was a friend of his daughter?), and thus provide information for the profiling process. Figure 4.8 shows the results of analyzing the metadata associated to the media captured during the high school concert in Amsterdam. In the figure, the recording behavior of a mother towards her kid is compared with the average behavior of the rest of the parents. We can observe that the affection level towards a performer is greatly influenced by the normalized weight of a particular recorder. In other words, the recording habits provide an important cue about the social relationships between recorders and featured performers. Finally, the profiling process takes into account the user activity when browsing the shared media repository (e.g., videos a user watched, videos a user liked, most watched videos overall, most liked videos overall). This approach provides dynamic information when compared with the previous steps. 84 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Figure 4.8. Comparison between the recording behavior of a mother towards her kid (in blue) and the average behavior of the rest of the parents (in red). The information from these three steps is stored globally in the database and it is accessible by different engines. Based on normalized weights, inputs can be provided to the narrative engine, so automatic compilations do not only take into account narrative constructs, but as well interpersonal relationships between the users of the system and the people depicted in the video clips. This approach is directly aligned with our second requirement. 4.2.2 Automatic Generation of Stories The first requirement identified in Section 4.1 was to provide automated authoring functionality, so the author does not have to start from scratch. Our system includes a reimplementation of the narrative engine used in the first phase. The new engine provides an initial story, as a playlist of video fragments. By itself, this functionality addresses our first functional requirement. The narrative server wraps a narrative engine as a Web application, so that engine instances can be launched on the server. The Web application runs inside a generic Java Application Server (Tomcat) and it can handle request from other applications. These requests include the command dispatcher for starting/stopping the engine and the playlist dispatcher for requesting playlists. Further information Hybrid Multimedia Authoring 85 Figure 4.9. Director’s Cut: an initial video compilation is created automatically by the system. about the NSL language can be consulted elsewhere [51]. Figure 4.9 shows a video compilation created out of the ‘Adagio and Allegro from Sonata No 6 in E’ song. As we will see below, the implementation of the narrative engine presented in Chapter 2 was modified to provide a set of alternatives (video clips) that can replace specific parts of the initial Director’s Cut, while still maintaining the narrative structure and the story line. 4.2.3 End-User Personalization of Stories The third requirement we identified was the need for fine-tuning and further personalizing the automatically generated productions. To support manual personalization, the narrative engine does not only create a Director’s Cut, but it also provides a set of alternative clips that can potentially replace parts of the compilation (see Figure 4.10). Once an initial compilation is ready, the user can modify it, allowing for personal imprint (third requirement). In order to enable such functionality we use a structured playlist format. In our work, we selected W3C’s SMIL playlist profile [17]. The benefit of SMIL is that it aims at integrating a set of independent multimedia objects (in our case video fragments) into a synchronized multimedia 86 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Figure 4.10. Director’s Cut: visualizing alternative clips available. presentation. It contains references to the media items, not the media content itself, and instructions on how those media items should be combined spatially and temporally. Other approaches on video mashups typically provide a final encoded video item, in which it is not possible to modify or enrich individual sequences. In our case, the richness of the SMIL language permits the user to perform dynamic operations on the initial video stories by simply modifying a text document (the SMIL file). The actual process of manipulating the document is hidden from the author, who simply sees an interactive user interface in the browser’s Web page. The video compilation generated by the narrative engine contains a set of references to video fragments (using clipBegin and clipEnd parameters). In addition, it provides a number of switch containers (<switch>) that contain the alternative clips (or set of clips), which can be selected for personalizing the initial story. Such alternative video clips have been selected by the narrative engine, so the narrative intent is not lost. For example, it will offer the option of selecting a different camera angle or of selecting a different point/person of interest. In addition to these features offered by the narrative engine, the end-user can decide to perform more radical modifications by adding other assets from the database or Hybrid Multimedia Authoring 87 Figure 4.11. Director’s Cut: song player. by enriching the video compilation (e.g., adding comments). All these modifications will be incorporated into the original SMIL file. For viewing purposes we use the Ambulant Player, which provides a full implementation of the SMIL language (see Figure 4.11). The benefit of using SMIL is that the recipient of the video can easily further enrich and modify the video compilation, and send it to others or maybe return it to the original author, enabling reciprocity. In Chapter 5 we will discuss our efforts to support personalized end-user enrichment within third-party content. The combination of a profiling infrastructure based on interpersonal relationships, a narrative engine capable of creating attractive video compilations, and the use of manual mechanisms for tweaking and personalizing such compilations results in a unique authoring tool. The validation of this authoring tool for creation of highly personalized (but compelling) productions characterizes the major contribution of this chapter, as reported in the next section. 88 4.3 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Evaluation Nine (9) participants, enrolled in the second phase of the evaluations, filled in the questionnaires about the Director’s Cut functionality (for more information about the evaluation process, please refer to Chapter 2). Based on our observations, responses to the questionnaires, and analysis of the collected audio/video material from the interviews, in this section we present the results and discuss the findings from the evaluation process. 4.3.1 Results and Findings Figure 4.12 shows the answers given by the participants after making use of the Director’s Cut functionality. In general, all participants appreciated the new prototype (Q2.1). Six participants said that the Director’s Cut offers a better way to edit videos if compared to existing video editing software they know (Q2.2). The other 3 users claimed they were unfamiliar with such tools, and therefore, they were unable to judge. Again, similarly to the results obtained with the initial system (Q1.1-Q1.3), almost all participants argued that they would create more video stories (Q2.3) and quicker (Q2.4) because the tool was easy to use (Q2.5). “It was very easy to use and it selected which videos I wanted well.” (Brother of a performer about the automatic generation component) “Very easy to use (editing based on alternative clips). I wouldn’t want to spend hours looking at a help menu. This was simple enough for me.” (Mother of a performer) When asked whether they would add themselves to personalize a story (Q2.6), 6 users, mainly youngsters, mentioned this would be a good functionality. However, our senior participants claimed they would not do so. A similar feedback was obtained in the first evaluation process (Q1.6). “It would be interesting to have a functionality to add other videos (that not only the ones suggested).” (Father of a performer) Evaluation Figure 4.12. Results of the questionnaires in Woodbridge (UK). 89 90 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? Figure 4.13. Participants interacting with the Director’s Cut. All participants indicated they would like to create different types of productions (Q2.7). When questioned about the types of video stories they envisaged, the ‘song-based’ video came up as first choice among most of them. Some argued that depending on the social situation they would create and share different versions with different audiences. “If my family misses my performance I would send the full performance to them… but if I want to send it to my singing teacher I would share a more focused version.” (Performer) Figure 4.13 shows some participants during the evaluation process. A oneto-one comparison between the first and the second phases would be unfair (different users, events, tools). What we can say is that in the second evaluation both the automatic generation of initial video stories and the manual tools for tweaking had extremely good scores (Q2.8 and Q2.9 respectively). These results provide strong evidences that a hybrid framework builds on the best of each approach: assisting on complex tasks (start from scratch) but still making sure the user plays an active role in the process whenever desired (personal imprint). Discussion 4.4 91 Discussion Creating compelling multimedia presentations remains a complex task. This is true for both professional and personal content. For professional content, extensive production support is typically available during creation. Content assets are well structured, content fragments are professionally produced with high quality, and production assets are often highly annotated (within the scope of the production model). For personal content, nearly none of these conditions exist: content is a collection of assets that are structured only by linear recording time, of mediocre technical quality (on an absolute scale), and with only basic automatic annotations. The problem is made worse when authors use community assets of an event. In events such as high school concerts, a single concert can generate hundreds of video clips, taken from multiple vantage points, using tens of cameras. With our initial prototype we could generate syntactically correct automated stories that served generic needs (much like a conventional video mashup). Our users found these compilations compelling but not their own: they missed a personal touch. In this chapter, we reported on a hybrid authoring approach that provides mixed support for automated creation (requirement i) and manual enhancement of personalized video stories (requirement iii). We targeted small-scale events, where lightly annotated assets are provided. Our assumption is that editors at these events will want to highlight personal aspects: a particular instrument, a particular child, a particular solo (or goal). This places demand on a system to help users to select appropriate content of personal interest (requirement ii), and to help build compelling stories with minimum effort (in accordance with the personal effort guideline presented in Chapter 2). We acknowledge there are some limitations regarding the amount of automated personalization that a system can provide. Abstractly, given unlimited personalized annotations and unlimited information on all members of a potential target user community, we suspect that great strides could be made in automated personalization. The reality is, however, that for community assets, personalized annotations are limited, and the target user group is lightly profiled. This requires an interface that allows direct user intervention in creating content. Providing direct user intervention has tremendous benefits: the user best knows his/her target audience. The differences between uncle Henry’s interest and those of Grandma are often clear in the head of the human author, but largely inaccessible to an automated system. At the same time, end-users have only a limited amount of time and energy to create personalized stories (many are busy 92 Chapter 4. Automatic and Manual Processes for Creating Personalized Stories from Community Assets: Where is the Balance? recording new content, rather than editing old content!). This requires a balance of complexity and functionality. We feel that our approach provides this balance. Based on user feedback as part of our four-year study, we feel that we have shown that it is possible to satisfy casual content creators while still allow extensive personalization to take place if needed. These results directly answer our research question (and fulfills the requirements on effortless interaction, personal effort and intimacy introduced in Chapter 2). We feel that the combination of automatic and manual processes is unique and powerful. While concentrating in the creation process, we cannot forget that multimedia sharing can also stimulate user comments and reactions, which is as well part of the authoring workflow. This is the topic of next chapter, in which we present our efforts on empowering users in commenting within personalized multimedia presentations. 5 Supporting Personalized End-User Comments within Third-Party Online Videos1 In the previous chapters, we have reported on our efforts to empower end-users to browse a shared video collection based on personal interests and to create personalized, but still compelling, personal stories from it. In this chapter, we now shift our focus from the author to the recipient of the story. Successful commercial video sharing systems have provided ample proof that video is a first-class Web object. Even social networks like Facebook, originally conceived for status updating, have become important distribution channels for both consumer and professionally generated video [73]. In these sharing systems, video content serves both as a medium for communicating a story (using implicit or explicit cinematic rules), and as a catalyst for communication between third-party viewers of that content [12][25][50]. 1 This chapter is based on the following papers: R.L. Guimarães, P. Cesar, and D.C.A. Bulterman. 2012. “Let me comment on your video”: supporting personalized end-user comments within third-party online videos. In Proceedings of the 18th Brazilian Symposium on Multimedia and the Web (WebMedia ‘12). ACM, New York, NY, USA, 253-260. DOI=10.1145/2382636.2382690 http://doi.acm.org/10.1145/2382636.2382690 (30% acceptance rate) R.L. Guimarães, P. Cesar, and D.C.A. Bulterman. 2010. Creating and sharing personalized timebased annotations of videos on the web. In Proceedings of the 10th ACM symposium on Document engineering (DocEng ‘10). ACM, New York, NY, USA, 27-36. DOI=10.1145/1860559.1860567 http://doi.acm.org/10.1145/1860559.1860567 (31% acceptance rate) 93 94 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos Recent developments by video service providers have extended the means for third-party communication in ways that have never been possible with conventional broadcast or personal video systems. In addition to the base video content, a typical YouTube page also provides space for end-user generated comments (Figure 5.1). These include implicit forms of commentary (such as the number of views or anonymous ratings, e.g., ‘like’ or ‘dislike’), and explicit comments for interpreted viewers. In the case of online video on demand, textual comments are usually statically placed underneath the media player. If desired, users need to make explicit any reference to a particular event that happens within the video object (e.g., “Look at that shiny, beautiful trombone at 1:56” in Figure 5.1). In YouTube, for example, when a user writes out a particular time code in the comment, it automatically turns into a ‘temporal hyperlink’, that when clicked takes the interested viewer to that part of the video. However, such comments do not reproduce the ‘commenting while watching’ activity people perform when consuming media together. In general, users cannot add comments that are synchronized with the video, unless the owner (who uploaded it) has given editing rights to the base video content. Primarily, this chapter considers the scenario in which a recipient of the content – not necessarily the owner of the video or who created a personal video story, adds personalized comments that are synchronized to specific events within the video. By personalized we mean comments created to highlight a particular event that is interesting to, for instance, the end-user social circle. By synchronized, we mean that such comments will be rendered during video playback at the time such particular event happens, unlike the static comments displayed underneath, as in YouTube or Facebook. Supporting this functionality, which is aligned with the intimacy and reciprocity guidelines specified in Chapter 2, we expect to reproduce asynchronously the commenting experience people have when watching media together. In this direction, we have asked the following research question: Question 1.5 Does the support for timed end-user commenting within preauthored narratives provide an identifiable improvement over current media commenting approaches? Motivated by a survey research on current media watching and commenting practices, this chapter reports on the design, implementation and user-centric evaluation of a video commenting paradigm for structuring synchronized Media Consumption and Commenting Practices 95 Figure 5.1. Typical end-user comment in YouTube. It appears statically underneath the third-party video. comments within media. Our results indicate that users appreciate the functionalities of our system and find it better to comment when compared to current video commenting tools. 96 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos In order to realize our video commenting system, we also specify and describe a set of temporal transformations for multimedia documents. Our approach, unlike current solutions, allows end-users to create and share personalized timed text comments within third-party online videos. It also permits end-users to identify temporal navigation points by using hyperlinks within comments, and to associate contextual metadata (e.g., who wrote the comment and when). The benefit over current solutions lays in the usage of a rich commenting format that is not embedded into a specific video encoding format. This chapter is structured as follows. Section 5.1 motivates our work, while Section 5.2 proposes a set of multimedia document transformations that allow endusers to add timed comments within third-party Web videos. Section 5.3 describes the design and implementation of a Web-based video commenting tool, which realizes such document transformations. In this section we also report on a predictive timing model for helping users to incidentally synchronize text comments with specific events within a video. Lastly, in Section 5.4 we present the results from the evaluation process, while in Section 5.5 we discuss the lessons learned and how these fit in the context of this thesis work. 5.1 Media Consumption and Commenting Practices A sample group of 21 people were invited to participate in an evaluation process during the first quarter of 20122. All participants were regular Internet users. Eighteen (18) people were in the 21-40-age range, while the other 3 were over 40 years old. Participants were from different nationalities including Brazilian, Chinese, Dutch, German, Hungarian and Irish. We used semi-structured electronic questionnaires to collect users’ feedback. While multiple-choice questions allowed us to explore patters and find trends (quantitative methods), open-ended questions aimed at capturing further insights into participants’ opinions and perceptions. The user study was divided in 3 parts. The first part, which is the focus of this section, consisted of a questionnaire to gather background information about respondents’ commenting practices when watching video content. Feedback answers were anonymous. 2 This was an independent study and it counted with a different set of participants from the ones involved in the evaluation process discussed in the previous chapters. Media Consumption and Commenting Practices 97 5.1.1 Survey Research Figure 5.2 summarizes the results obtained in our survey about media consumption and commenting habits. As users’ practices were different, for each question we present the weighted average (colored column) and the respective standard deviation (bar). In Figure 5.2 the questions also have been clustered in two groups according to the consumption experience: synchronous (blue) and asynchronous (red) watching. For each scenario we also asked about participants’ conversational and commenting practices around media. A wide range of TV watching habits has been reported by our participants. In average, our users watch TV every week (Q1.1). This was the second highest frequency score among our questions. Participants also reported having the habit (between occasionally and every week) of talking with family and friends about a TV show they have watched (Q1.2). When asked about the frequency they would converse about a TV program with collocated people while watching, the average answer was occasionally (Q1.3). The lowest score though was obtained in the question about how often they would send tweets related to a TV show they were watching (Q1.4). As reported elsewhere [13], this activity is becoming popular over the years and, in some cases, it can be used as an interactive return channel in which the audience can influence on live TV programs. In the second media consumption scenario, we asked our participants about the habit of watching live video feeds on the Web (e.g., Justin.tv, Ustream.tv). The average of their feedback was around occasionally (Q1.5). Regarding the activity of commenting on the video event while watching, we asked how often they make use of the built-in open textual chat rooms generally available on those services. Again, rather small the frequency stayed between never and occasionally, which was slightly higher than the one reported for tweet messages (Q1.6). Regarding on demand (asynchronous) watching, we asked our participants about the usage of YouTube, Facebook and SoundCloud. Validating previous research [73], YouTube was often (between every week and every day) used by our participants to watch online videos (Q1.7). However, posting comments to the video page did not seem to be a common practice among our participants (Q1.8). In conjunction with the use of YouTube, we also witnessed a fair high frequency of video viewing on social networking sites (Q1.9). In this case though, participants habitually comment more on videos when compared to the comments added in YouTube (Q1.10). One possible explanation for this behavior is that participants are more likely to post comments within their social circle than in the open. 98 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos Figure 5.2. Survey research about media consumption and commenting practices. Blue columns indicate synchronous watching and related conversational habits. In red, on demand (asynchronous) consumption and commenting. Media Consumption and Commenting Practices 99 Figure 5.3. Requirements gathering: utility and usefulness of timed comments within media. Finally, we asked our users to report on their practices using online audio streaming services such as SoundCloud. In average, respondents occasionally listen to music on these platforms (Q1.11) and they do not have the habit of adding timed comments to songs (Q1.12). In the second part of the survey we asked our participants about the possibility of adding timed text comments to particular events within a third-party online video (see Figure 5.3). Thirteen (13) out of 21 participants said they would possibly (yes or maybe) add timed comments to YouTube or Facebook videos if they could (Q1.13). One of the participants expressed that such a feature would be a “nice way to highlight sections/points of the video”. When asked whether they would share these synchronized comments within their social circle the number raised to 16 out of 21 (Q1.14), fairly higher than the number reported for sharing comments with everyone (Q1.15). In one case, a user justified by writing “I’m never interested in sharing my comments with the public... but to have a link that I just could send to friends”. At last, we asked a question related to digital rights and ownership. In this case, only 2 participants expressed they would mind if other people could add timed comments to their videos (Q1.16). In these lines, one participant highlighted the necessity of having control over the commenting activity: ”If everybody could add comments to any video it would become a real mess. Some people would use it 100 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos to damage the image of others”. This result seems to contradict the privacy issues discussed in the previous chapters, but it is important to keep in mind that here the videos were not necessarily personal (as opposed to the ones discussed in the extensive long-term evaluation process in the UK and the Netherlands). 5.1.2 Requirements Gathering In the study presented above we looked at media consumption and commenting practices of a group of Internet users using different applications. Even though the group was small and provided results with a high variance, we obtained strong indications that people consume media and doing so, they eventually comment and share such moments within their close circle and, sometimes, with the general public. Our respondents also appreciated the utility and usefulness of synchronized comments, as they would comment on particular events within media (Q1.13Q1.15). These results led to the specification of the requirements described below. These were used as the basis for designing and validating the online video commenting system presented in the remaining of this chapter. i. Retain base video integrity: viewers should not be able to alter the base video content, either in terms of adding embedded captions/comments or of providing visual overlays on the base content — this right is reserved to the content owner; ii. Allow multiple-video aggregation: in certain occasions, end-users might watch a collection of videos that are played as a continuous playlist (e.g., a personalized video story or compilation, as shown in the previous chapter). In these cases, end-users should be able to create comments that would span across the multiple videos composing the playlist; iii. Allow multiple-provider integration: the user should not be locked into a single video service provider (or source) for candidate content, but should be able to populate a playlist from multiple sources; iv. Allow timed end-user comments: when watching an online video, viewers should be able to add comments that are time synchronized. This feature would reproduce (asynchronously) the watching and commenting activity people have when watching media collocated; v. Allow micro-personalized timed comments: end-users should be able to create different sets of time-based comments for individual Media Commenting meets Multimedia Document Engineering 101 users/communities, or share these as ‘broadcast’ comments (similar to existing approaches in YouTube and similar systems); vi. Allow selective end-user viewing: end-users might be able to select and watch comments by specific individuals and/or user communities, by topic etc. This is important because some comments might be targeted to individual users while others might be intended to the general public; and vii. Allow timed end-user navigation: end-user comments should be able to include direct navigation support via timed anchors in the text content. This will allow others to navigate to other interesting parts within the same collection or to link to external media. 5.2 Media Commenting meets Multimedia Document Engineering To address the requirements discussed above, we modeled the problem of creating timed comments within online videos from a multimedia document engineering perspective, and thus identified a set of document transformations. By document transformations we mean manipulations that can be applied to add non-embedded, flexible temporal end-user comments. Video commenting has been dealt with in many ways, ranging from the usage of models that are not timed (e.g., HTML) or are unstructured (e.g., Flash) to standards such as MPEG-7 [1] and NCL [33]. Based on our analysis [62], we rely on SMIL 3.0 [17] as the basic framework that meets our requirements. First, we create a structured multimedia document around an input video(s). The document model of SMIL 3.0 retains the base video integrity, and it allows multiple-video aggregation and multiple-provider integration. Timed text content and temporal hyperlinks allow end-users to add synchronized comments and to include timed end-user navigation points. Contextual information allows targeting timed comments to different audiences. Finally, the structured underlying model enables selective viewing. 5.2.1 Document Model SMIL can integrate and compose a collection of audio, graphics, image, text, and video media items into a single presentation. As Web resources are distributed by nature – and might be very large in size –, in SMIL media objects are included by reference (using a URI - Uniform Resource Identifier). SMIL defines a single generic media object (<ref> element) that allows the integration of external 102 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos media resources into a SMIL presentation. However, it is also possible to use more intuitive tags when referencing external media resources (e.g., the <video> element is a more specific alias for the generic SMIL media reference element). Note that as an implication of the use of references, the integrity of the base media is preserved, meeting requirement i. In addition, SMIL provides a powerful hierarchical composition model from which individual presentation timelines can be generated. The main temporal structuring elements are the parallel (<par>) and sequential (<seq>) containers, each of which provides a local time base for scheduling media objects (e.g., external videos) or children time containers. By using such time containers, it is possible to combine videos and comments in different temporal ways, as illustrated in Figure 5.4. In this example, three videos, stored in different video servers, are rendered as a continuous video, while the comments span across the videos. The structured temporal container behavior satisfies requirements ii and iii. 5.2.2 Timed Text Content Unlike most text formats [15], text content in SMIL is not only constrained by its style and layout capabilities, but also by the temporal context of the presentation. For instance, text must be rendered simultaneously with related objects, and it must be hidden when these are finished. Authors can define small amounts of lightly formatted text containing embedded temporal markup within the context of a SMIL presentation. Such text may be used for labels within a presentation or for incidental comments or foreignlanguage subtitles. It is also possible to use large amounts of structured text (with or without temporal markup), but in this case it is recommended the use of SMILText as a text media object, or the use of objects encoded in formats such as XHTML or DFXP (Distribution Format eXchange Profile) [27]. The SMILText also define a set of additional elements and attributes to control timed text rendering (see Figure 5.4). All SMILText content is processed in a manner consistent with other SMIL media. The SMILText profile also allows SMILText to be used as an external format. Moreover, since the smilText elements and attributes are defined in a series of modules, designers of other markup languages may reuse these modules when they wish to include a simple form of timed text functionality into their language. SMILText, as a text container with an explicit content model for defining timed text, meets requirement iv. Media Commenting meets Multimedia Document Engineering Figure 5.4. SMIL document model and temporal containers. 103 104 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos Figure 5.5. Timed text content and temporal hyperlinks. Media Commenting meets Multimedia Document Engineering 105 5.2.3 Temporal Hyperlinks SMIL 3.0 Linking Modules define SMIL 3.0 document attributes and elements for navigational hyperlinking. SMIL hyperlinks may be triggered by user interaction or other events, such as temporal events. SMIL 3.0 provides only inline link elements. Links are limited to unidirectional single-headed links (i.e. all links have exactly one source and one destination resource). As with styled time-based text comments, adding temporal hyperlinks via text content can enrich the content viewing experience for end-users and for their social circle. This association makes SMIL meet requirement vii. It is important to highlight that our document model allows links to be added to content without violating the legal rights of any party. This is possible because navigation points within the video are encoded as a series of content events in the SMIL document. Two classes of links can be provided as illustrated in Figure 5.5: • Intra-video Navigation Link: a text link that takes the viewer to another location within the active video; and • Inter-video Navigation Link: a text link that takes the viewer to another piece of content, outside the active video. 5.2.4 Contextual Information Current Web-based video solutions provide limited support for including metadata related to the comments. For example, they do not allow end-users, at authoring time, to create different views on the comments, depending on the target audience. As discussed in the previous chapter, one might not send the same set of comments to her family and for to singing teacher. SMIL 3.0 allows associating meta-information to any element within the document body, including timed text comments. This makes it possible to provide information with semantic intent within the presentation information, by binding relevant nodes with meta-information. As mentioned before, SMILText allows text comments to be described as single structured units that can be targeted to different audiences. Therefore, we can consider each comment entry as the smallest unit that can be tagged. In order to share a video with comments, we should add contextual metadata, such as who has created the comment, when, why, how, and to whom [60]. Support for targeted 106 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos comments might increase the authoring overhead, but it provides a level of personalization that is lacking in common Web environments. SMIL can tackle the contextual problem, requirement v, by allowing metadata to be associated with timed text comments. Figure 5.6 illustrates this process. Here we see a master comment stream that has been composed by Dick specifically targeted for all viewers within his social circle. 5.2.5 Selective Viewing One shortcoming of current video captioning/commenting systems – whether closed captions or stream of comments on a Web page – is that every user visualizes the same collection of information. It is doubtful that even the most interested reader will go through the dozens of comments created by unknown individuals – but there is a much stronger incentive to view the 20 or so comments that are likely to be generated by family members or close personal friends (as indicated by the results presented in Section 5.1). In order to deal with this problem, video commenting tools can make use of the structured nature of SMIL to selectively present content, requirement vi. Video commenting tools can enable users to — besides the traditional turn on/off all comments — select and watch comments created by a certain individual or community, about specific topics, or created on a certain day. Moreover, aggregated comments and metadata can be used for generating diagrams of hotspots within videos. All of this is possible thanks to the document model — structured text comments can be analyzed — and to the contextual information associated to the comments. Figure 5.6 illustrates a scenario in which a viewer is interested in a certain category of comments. 5.3 A Timed Text Video Commenting System Based on the temporal transformations discussed in the previous section, we designed and implemented a video commenting system as an independent application. Our solution allows end-users to easily add timed text comments to particular events within third-party online videos. In the remaining of this section we detail the technical aspects of such commenting system. A Timed Text Video Commenting System Figure 5.6. Contextual information and selective viewing. 107 108 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos 5.3.1 Infrastructure The high-level workflow of our video commenting system is illustrated in Figure 5.7. The interaction starts when a user requests a video. For that we make use of the YouTube Data API (Application Programming Interface), which provides programmatic access to the videos stored in YouTube. It allows us to retrieve a set of videos matching a user-specified search term or retrieve standard feeds (e.g., most viewed today). The data is requested using AJAX (Asynchronous JavaScript and XML) and returned in the XML (eXtensible Markup Language) format, then parsed and presented to the end-user. For video playback we use the YouTube Player API, which is exposed via JavaScript. It allows us to control not only the ‘Look and Feel’ of the player, but also the playback behavior of the videos in our Web application. With the current YouTube infrastructure, the client Web browser must be HTML5 compliant or have Flash Player 10.1, or higher, installed. Most importantly, the Player API provides the necessary time information to synchronize the text comments within a video. This feature is obtained by listening to specific events, which are fired accordingly (e.g., time update event). A similar infrastructure would be necessary for making the commenting tool available for videos hosted in different providers. In this case, the YouTube Data and Player APIs should be replaced and the interfaces of the new provider adapted accordingly. Since the viewer has no rights to add comments to the base video, the timed comments are stored separately on our Web server. As mentioned previously, the actual format used to encapsulate the multimedia presentation (base video plus a layered collection of timed comments) is W3C’s SMIL 3.0. In fact, timed comments are specified in SMILText, the embedded text format for use within SMIL 3.0. SMIL allows us to respect the video owner’s rights and to keep a provider-agnostic enriched video. As such, comments can be shared, modified and analyzed independently. For the synchronized playback of end-user comments we implemented a SMILText JavaScript engine that runs on the client’s Web browser. Its API allows us to embed SMILText functionalities in Web pages and have the presentation controlled by an external source, in this case the YouTube video player. The SMILText engine has reasonably complete coverage of the features defined in the SMIL 3.0 SMILText External Profile. The API also provides a number of other utilities for adding and manipulating timed text content, making possible the creation of applications such as the commenting tool presented in this chapter. A Timed Text Video Commenting System 109 Figure 5.7. Workflow of our online video commenting system. 5.3.2 User Interface In order to allow end-users to comment the videos we need a user interface that hides all the complexity from them. This is achieved with our video commenting system, which wraps the video content and all the timed text comments in a multimedia presentation. The commenting interface (Figure 5.8) is composed of a video rendering area (1), a rendering space for comments (2), an input area (3) and the sidebar controls (4). In most cases, relative passive end-users simply will watch a piece of video content forwarded to them. If the content itself has embedded comments, these can be selectively turned on or off via the sidebar controls 110 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos Figure 5.8. Video commenting interface. interface. During playback users may also choose to insert new comments in the input area (Figure 5.8 (3)). One key feature of our video commenting system is its ability to (semi-) automatically compute the temporal alignment of user-generated comments. To explain how this functionality works, consider the example in Figure 5.8. In a usage scenario, we assume users will interact after a certain moment of interest has passed (e.g., after seeing the trombone on screen). In this case, comments need to be synchronized in such a way as to avoid situations in which the comment – “Look at that shiny, beautiful trombone!” – appears after the instrument is not longer visible. Our approach for this use case is as follows. When an end-user indicates s/he wants to add a comment, the video playback is paused and the input area gains focus (Figure 5.8 (2)). As the interaction is performed right after listening to or watching an event of interest, we assume that the current moment (tnow) is the end of the comment (tend = tnow). As an initial guess, we consider that the start time of the comment (tstart) is equal to the current time (tnow) minus a minimal duration (MinDur) that a comment should stay on screen for being effectively read (tstart = tguess = tnow - MinDur). A Timed Text Video Commenting System 111 Based on our prediction model and its parameters – e.g., the number of words in a comment (N), the average duration of a character/phoneme of a word in a specific language (α) and the average duration of pauses (β) – tguess is recalculated, being tstart then determined by the maximum value among tguess, the end Figure 5.9. Predictive timing support for end-user comments. 112 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos of the previous existing comment (tend’) and zero. Figure 5.9 illustrates scenarios in which tstart assumes different values. In the example of Figure 5.8, the start and end time computed for “Look at that shiny, beautiful trombone!” stayed around 3min57s. When the user saves the comment, the video playback is resumed. The predictive timing functionality often provides coarse temporal support; users may fine-tune the timing if desired. In our experience, such fine-tuning is not necessary unless tightly coupled subtitles are being created. 5.4 Evaluation As mentioned before, the survey discussed in Section 5.1 was followed by 2 other experiments. First, participants were instructed to interact with the prototype system presented in Section 5.3 and then, fill in a questionnaire to report their experiences. In the second and last part, they were asked to further explore the commenting activity by close captioning a sample video (approx. 7 minutes duration) and fill in another questionnaire. Table 5.1 summarizes the number of participants involved in each part of the evaluation process presented in this chapter. In the next sections we present the results and discuss the findings from the evaluation of our online video commenting tool. Number of Respondents % of Respondents Table 5.1. Composition of participants in evaluation process. 1. Current video watching and commenting practices 21 100.0% 2. Commenting on videos with our prototype system 18 85.7% 3. Captioning videos using our prototype system 12 57.1% Evaluation Questionnaire Evaluation 113 5.4.1 Commenting on Videos In general, participants’ feedback regarding our video commenting tool was very positive (see Figure 5.10). When asked how much they liked the service (Q2.1), 13 out of 18 answered some or a lot. All respondents reported that our video commenting tool is helpful for adding synchronized comments to YouTube videos (Q2.2). Some expressed such appreciation by saying that “synchronization is much better” and “I can easily add comments to specific moments in the video. In Facebook I think I can’t. In YouTube I can but I have to type the time moment in the comment”. When compared to regular comment threads in YouTube or Facebook, 9 users said our tool is better or much better (Q2.3). A user justified his/her answer by saying that “the possibility to comment on a specific moment in the video adds a lot of functionality. Instead of saying, ‘after 16 seconds he does this’, you can just comment at that moment. This also works quite well on SoundCloud as far as I have seen”. On the other hand, 5 participants said they were unable to judge. One of them explained: “I have never added comments to Facebook nor YouTube. However, the way to add comments in this (video commenting) tool is intuitive”. 5.4.2 Close Captioning Videos The last experiment was the most time consuming one, and for this reason, only 12 participants committed to complete it. Users were kindly requested to close caption a 7 minutes speech video. This task was first performed using our video commenting tool, and later using a standard video player and a text editor. This experiment allowed us to evaluate the effectiveness and usefulness of the time prediction algorithm provided in our commenting tool. Using our tool, participants spent in average 61 minutes (Standard Deviation: SD = 48 minutes), and other 101 minutes (SD = 33 minutes) without it. The utility of our commenting system has also been reflected in the answers to the questionnaire (see Figure 5.11). When asked how much easier it was to add close captions with our system compared to the other method, all respondents said it was much easier or easier (Q3.1). A similar feedback has been obtained in the question regarding participants’ appreciation for the predictive synchronization of captions/comments (Q3.2). In this case, 7 users reported to have liked a lot. In one case, one participant mentioned that “most captions were synchronized nice to the video, and the prediction algorithm does work. It saves a lot of time having not to 114 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos fine tune the start and end points, as you have to do with the SRT format”. And another user added: “the prediction works really good, the captions are usually where they are supposed to be!”. Figure 5.10. Results from the evaluation of our commenting tool. Evaluation 115 Figure 5.11. Results from the close captioning activity using our prototype system. Although our primary objective was not to provide a close caption authoring tool, the point we want to make here is that video commenting systems like ours should not only allow users to add timed comments, but also help them by offering automatic processes that make the commenting task simpler and more intuitive. 116 5.5 Chapter 5. Supporting Personalized End-User Comments within Third-Party Online Videos Discussion In this chapter we presented our efforts in supporting personalized content enrichment. Motivated by a survey on media watching and commenting practices, we introduced and evaluated a video commenting paradigm that follows the intimacy and reciprocity guidelines introduced in Chapter 2. Results from the evaluation process show that that users appreciated the functionalities of our system and would potentially use it to communicate with their close circle (requirements on intimacy and reciprocity) and, also, with the general public. The survey research about media watching and commenting practices represents the first major contribution of this chapter. While this study is relevant to analyze user behavior; it is even more to motivate our work. Do people want to add timed comments within videos? Our results provide evidences that regular Internet users would add synchronized comments while consuming video on demand if they had the appropriate tools for doing that. From a document model perspective, all the requirements presented in Section 5.1 are met by using a structured multimedia language like SMIL. In this work we focused on text, but a similar approach could be used for other types of user-generated enrichments [56][61]. The video commenting tool reported in this chapter also addresses the functional requirements. The transformation process starts when a video URL is given as an input. Next, our video commenting system applies a document model transformation, which respects the owners’ rights by retaining the video integrity (requirement i) and allow compilations that include video clips from multiple sources (requirement iii). Timed text content is applied as soon a user clicks the input area (requirement iv). This means that given a multimedia document, our tool adds a parallel container that synchronizes comments with a particular video. Whenever a new comment entry is inserted, implicit metadata is automatically added (requirement v). As these comments can be targeted to different audiences, they can be selectively rendered (requirement vi). Multiple-video aggregation and timed end-user navigation (requirements ii and vii, respectively) can be met by integrating the personalized narratives presented in the previous chapter. The evaluation of our video commenting system represents the second major contribution of this chapter. It shows that this paradigm brings a measurable increment over existing commenting systems. It also shows that the burden of synchronizing comments can be minimized by the use of predictive timing. These results answer our research question. Finally, we do not claim synchronized Discussion 117 comments should replace traditional ones, but rather be complementary. Regular comments are targeted to a fundamentally different use case than the ones offered by our system. On the one hand, in Facebook or YouTube, people can comment about a video, but also give feedback to the author or start a conversation about something unrelated. On the other hand, our video commenting system can be used to highlight interesting things for other viewers, maybe to make a point about a particular event within the video. Such textual comments should be preferably simple; otherwise viewers will have problems to read while watching a video. 6 Conclusions1 During the past 20 years, authoring has been part of the multimedia community’s research agenda. Unfortunately, multimedia authoring has been seen as an initial enterprise that occurs before ‘real’ content processing takes place. This limits the options open to authors and to viewers of rich multimedia content in creating and receiving focused, highly personal media presentations. This thesis reflects on the multimedia authoring workflow and we argue that a fresh new look is required. We focused on the particular task of supporting socially-aware multimedia authoring, in which the relationships within particular social groups among authors and viewers can be exploited to create highly personal media experiences. Our framework is centered on empowering users in telling stories and commenting on personal media artifacts, considering the long-term social context of the user’s social environment. We provided an overview of the requirements and characteristics of socially-aware multimedia authoring within the context of exploiting community content. In particular, our research involved the study of different mechanisms to allow users to explore, create, enrich and share videos based on personal relationships. Our methodology integrated knowledge from Human-Computer Interaction (e.g., focus groups/interviews for need assessment, iterative prototyping and user evaluation) and document engineering. 1 This chapter contains extracts from the following article: D.C.A. Bulterman, P. Cesar and R.L. Guimarães. 2013. Socially-Aware Multimedia Authoring: Past, Present and Future. ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), Volume 9, Issue 1s, Article 35 (October 2013), 23 pages. DOI=10.1145/2491893 http://doi.acm.org/10.1145/2491893 119 120 Chapter 6. Conclusions In this chapter we first revisit and answer the research questions of the thesis. We then reflect on the lessons learned, before concluding with a discussion of the issues that we feel can provide a fruitful basis for future multimedia authoring support. We argue that providing support for socially-aware multimedia authoring can have a profound impact on the nature and architecture of the entire multimedia information processing pipeline. 6.1 Revisiting the Research Questions Much of the media landscape has been, and continues to be, dominated by commercially produced content. Whether image, video, audio or (to a lessor extent) text, users today have become accustomed to experiencing highly polished media messages. In spite of the dramatic impact of user contributed content sites (such as YouTube and Facebook), the amount of personal content being shared with family and friends (to say nothing of wide anonymous audiences) is minimal. A conservative estimate of media use indicates that average owners of smartphones and portable cameras capture hours of videos yearly, but that only minutes (or seconds) of content are being shared. Does this mean that user-generated content is less important? No. Personal archives have a high degree of personal value: photos of family and friends, videos of small children, audio fragments that capture the sounds of people who have played an important role in one’s life. Although there may always be exceptions, it is clear, that a short video showing a child’s first violin solo will not attract the same audience as, say, a slickly-produced commercial music video. This does not make the violin fragment less valuable. In this thesis, we focused on community authoring applications, where content is contributed from many amateur sources and distributed within a relatively closed circle of viewers who have varying degrees of affinity with the content produced. We concentrated on support for situations in which both the original presentation creator and the presentation viewer play a role in determining presentation content. Given this context, we discuss and answer each of the research questions according to the work presented in the bulk of the thesis. Question 1.1 Can a socially-aware multimedia authoring system be defined in terms of existing social science theories and human-centered processes, and if so, which? Revisiting the Research Questions 121 In this thesis we reformulated the research problem of multimedia authoring, by investigating mechanisms and principles for togetherness and social connectivity around personal media. Our focus was on parents, family members and friends of students participating in a small-scale social event. In this scenario, parents capture recordings of their children for later viewing and possible sharing with friends and relatives. Based on a 4-year evaluation process, we specified a set of guidelines for the design and implementation of socially-aware multimedia authoring and sharing tools. We aim at nurturing strong ties and improving social connectedness by supporting emotional intensity, personal effort, and by supporting intimacy and enabling reciprocity. With these guidelines we intend to increase the feeling of connectedness, particularly among family members and friends who could not attend the social event. As shown in Chapter 2, our sociallyaware multimedia authoring paradigm is aligned with the requirements needed for social communities that are not addressed by existing social media Web applications. These guidelines directly address research Question 1.1. Question 1.2 Does the functionality provided by a socially-aware multimedia authoring system provide an identifiable improvement over traditional authoring and sharing solutions? If so, how can these improvements be validated? To evaluate the utility and usefulness of socially-aware multimedia authoring, we realized the guidelines mentioned above in a two-phased prototype system called MyVideos. We have actively participated in the design, implementation and integration of this system, and our contributions enabled us to perform extensive field trials and these were a major part of the TA2’s success2. Working with a test group at local high schools in two different countries (UK and the Netherlands), we investigated how focused content can be extracted from a shared repository, and how content can be enhanced and tailored to form the basis of a personalized multimedia artifact, that can be eventually transferred and shared with family and friends (each with different degrees of connectedness with the performer and his/her parents). Results from a long-term evaluation process show that all our participants (from phase 1 and 2) liked the functionality provided by our system and considered this a valid alternative to strength social interactions when apart. Therefore, using our system they would share more videos with friends 2 The pan-European Project Together Anywhere, Together Anytime – http://ta2-project.eu. 122 Chapter 6. Conclusions and family. These results – complemented by more specific findings on media exploration, creation of personal memories and content enrichment (Chapters 3-5) – directly answer research Question 1.2. Question 1.3 Does a socially-aware video exploration system provide an identifiable improvement over current approaches for accessing and navigating a repository of shared media? While following the emotional intensity guideline, in Chapter 3 we discussed a two-phased design, development and experimentation of an interface for browsing a collection of user-generated videos from a shared event. Users could explore and navigate (fragments of) video clips recorded by several people based on their own personal/social interests. The design, deployment and evaluation of the system resulted in the identification of key requirements for this novel type of browsing interfaces. In particular, our approach 1) supports exploration based on the inherent event structure; 2) it makes use of contextual information to help in the navigation process; 3) it allows for flexible searches based on a combination of filters; and finally, 4) it provides a way to switch between cameras angles that might have captured different aspects of the event. Results of the evaluation process show that all participants appreciated the browsing interface and indicated that it is better than traditional tools to explore videos they care about. Therefore, they would find videos more efficiently using our system. These results clearly indicate that a socially-aware video exploration system like ours provides an improvement over current tools for accessing and navigating a repository of shared media assets, directly answering research Question 1.3. Question 1.4 Where is the balance between automatic and manual processes when authoring personalized narratives users care about? As for browsing a shared video collection, social relationships are key for authoring personalized stories users care about. In Chapter 4 we reported on our efforts to support the creation of personalized video stories reusing collective content. We developed a first version of an authoring system, subjected it to user testing, and then developed an improved version that follows the personal effort guideline of socially-aware multimedia authoring. Our initial results showed a general enthusiasm from participants, which were validated in the first evaluation phase. While the video compilations automatically produced by the initial system Reflection and Further Directions 123 were considered visually compelling, users missed the capability of personalizing those by adding their own ‘imprint’. To address this limitation, we proposed a hybrid authoring approach that provides mixed support for automated creation by selecting content of personal interest and manual enhancement of personalized video stories. Based on user feedback as part of our four-year study, we have demonstrated that it is possible to satisfy casual content creators while still allowing extensive personalization to take place, if needed. These results directly answer research Question 1.4. We believe that the combination of automatic and manual processes provides the balance of complexity and functionality. Question 1.5 Does the support for timed end-user commenting within preauthored narratives provide an identifiable improvement over current media commenting approaches? While concentrating in the creation process, we cannot forget that content enrichment also plays an important role in the socially-aware multimedia authoring workflow. Motivated by a survey on media watching and commenting practices, in Chapter 5 we reported on the design, implementation and user-centric evaluation of a video commenting framework that follows the intimacy and reciprocity guidelines. To realize such framework, we specified and described a set of temporal transformations for multimedia documents. Our approach allows endusers to create and share personalized timed text comments within third-party online videos. The benefit over current solutions lays in the usage of a rich commenting format – in our case SMIL [17] – which is not embedded into a specific video encoding format. The evaluation of a video commenting system that realizes our framework clearly indicates that participants appreciated our system (13/18 or 72% of the participants), and considered it helpful (100%). Our results also show that 50% of the participants considered our video commenting approach better than the one offered in YouTube and/or Facebook. These results show that our commenting framework brings a measurable increment over existing commenting systems, and directly answer research Question 1.5. 6.2 Reflection and Further Directions In this thesis we provide useful insights into how a socially-aware multimedia authoring and sharing system should be designed and architected, for helping users 124 Chapter 6. Conclusions in recalling personal memories and in nurturing their close circle relationships. The main contribution of our work does not lie in the use of a specific technology (e.g., SMIL, NSL or Web standards) but in further understanding the fundamental tradeoffs that enable better sharing of ‘personal’ media. Results from our evaluation process show that socially-aware multimedia authoring provides a more fruitful approach than earlier work. Although our research has reached its aims, there are some unavoidable limitations. First, the total amount of time spent to annotate the footage (see in Chapter 2) demonstrates that this is still a very challenging problem, especially when we consider dimly lit user-generated content with different quality, encoding etc. Although these annotations are essential in our authoring framework, this thesis does not aim at solving this problem. As to the number of subjects participating in the evaluation, we agree that ‘more is more’, but note: each subject needed to agree to spend some hours per evaluation (about 1h30min recording concerts plus 2h in lab studies). We found it difficult to find high school parents who would commit to this load. We are pleased that our parents – about 25% of the concert participants! – were motivated contribute this block of time. The goals of the study make it impossible to do crowdsource testing, given the focus on common personal media. Moreover, we are not aware of other studies that provide the same breadth. Another limitation could be that we focused on a particular use case scenario. We reiterate that our participants represent a realistic sample of users: actual family members from 2 countries (NL and UK) that have been involved in the concert recordings and prototype evaluation. We agree that generalization to other events is an important problem, but before getting there we need to start somewhere. We see this as a topic for future work. Providing support for socially-aware multimedia will significantly impact the support required for effective encoding, storage, classification, selection, transmission, protection and sharing of (potentially composite) media artifacts. The principal reason for this is that the context in which media is used will strongly determine how it is classified and accessed. Annotations and metadata will become multifaceted and dynamic, and will be determined by use rather than by design. In the following subsections we highlight some opportunities for future research in socially-aware multimedia. Reflection and Further Directions 125 6.2.1 Media Encoding and Storage At present, media encoding is based on an agnostic view of content. This has been used to great advantage on sharing Websites and physical distribution media. The assumption has been, however, that all of the fragments related to a single story are compressed into a single fixed media object. There are usually no facilities for packaging custom versions of content from a single base encoding. Each personal version of a video (or video fragment) must be re-encoded in a new document. One important difference required to support end-user composition is that small logical groups of media would be stored on several servers, each as individual fragments. These fragments could be mixed/matched dynamically at viewing time to support the interests of the viewer. In terms of our school concert example, this would mean that all of the individual assets captured by all of the parents could be saved in a cloud over servers. Individual presentations could then be stitched together on demand. Having a logical media object be constructed out of dynamically combined physical fragments allows customized navigation to be supported. In YouTube (as in other commercial video sharing systems), dynamic mashups are not supported. End-users have to find suitable source material, cut it into shots, and assemble an encoded final video. While this solution does not impose hard requirements on delivery and rendering, it is limited in terms of adaptability, user interaction and seamless playback [36]. One approach to implementing such dynamic combination is supported by DASH (Dynamic Adaptive Streaming over HTTP), a system for HTTP-based streaming [70]. Although some efforts have investigated the use of DASH with Rich Media services [9][10], at present, it is typically used for storing pre-defined fragment encodings, nearly always based on support bitrate-adaptive resolutions (During presentation, the quality of the content can be adjusted based on environmental factors such as available bandwidth or end-user screen size). Similarly, dynamic media compositions could be achieved using a combination of HTML5 and W3C Media Fragments [21] and/or JavaScript code (e.g., Popcorn.js or Kaltura Video Platform). Adaptivity in our work can leverage this support, but our main interests are in supporting a more abstract form of content selection: providing more trombone content to the father of the trombone player and more clarinet content to the mother of the clarinetist. This is a matter of dynamic content selection rather than (or at least in addition to) dynamic encoding selection. The selection (or generation) of 126 Chapter 6. Conclusions dynamic content requires more illusive criteria for content selection, such as a profile of the viewer in addition to profiles of the available content, and a contentwide temporal model that exposes logical divergence and convergence points for creating content streams. It also requires a container format that allows differential segment length to exist across candidate segments. To support this, the current model of content streaming would need to be revised: the seamless integration of individual content fragments (as opposed to encoding fragments) into a logical whole is a composition concept that most media servers and media container languages are as-of-yet ill-equipped to support. 6.2.2 Media Classification and Annotation Personal media classification and annotation remains a challenge for supporting effective content sharing. For professional content, content is often highly segmented along the lines of established commercial distribution models. For personal content, the situation is vastly different. This shift in emphasis is new for multimedia, but there are many established examples in music, art and literature where the intentions of the composer, artist or writer are decoupled from the applications of the media itself. At present, personal content annotation is driven by device-supplied metadata (e.g., clocks, location coordinates, file names, as well as objects and faces). For socially-aware multimedia, it is also necessary to encode relative social relationships among interested parties – plus to maintain those relationships over time. As with any large software system, the long-term maintenance costs of media will dominate the short-term development costs. This will require a new generation of iterative, socially-aware media classification tools. The analysis of content becomes then a continuing task, not an import activity. In the same vein, content recommendation needs to not only use such information, but also be sensitive to the context of use: are you watching alone, with your spouse, with your children, with your friends? 6.2.3 Customized Media Selection Perhaps the most significant innovation in (broadcast) content selection occurred with the introduction of the video tape recorder. For the first time in history, it was the viewer that determined when content would be watched – on the precondition that it had been broadcast and recorded earlier. A next, but more minor, innovation Reflection and Further Directions 127 came with the introduction of the digital set-top box, which included an embedded program guide, providing the opportunity for more automated content selection and recording. The next logical development is to remove the TV guide altogether and to have the system itself recommend content for the family, which it found based on metadata encoded by the content providers. One drawback of many home content systems is that a set-top box is typically not aware of who is actually watching TV. Some form of personalization is supported, but at a fairly impersonal level. At present, much research is being expended on recommender system technology. These systems depend heavily on producer-generated metadata for determining available candidate content. For socially-aware multimedia, the granularity of the metadata needs to be refocused to personal content. Another change in focus is that content selection will need to move from selecting ‘programs’ to selecting fragments of content. For a given viewing experience, several fragments typically will need to be dynamically combined to support end-user engagement. 6.2.4 Content-Based Navigation One of the challenges with temporal searching along a timeline is that it is a highly unstructured activity. For instance, in a conventional YouTube interface for navigating through a video object, users can only select key frames without any higher-level narrative guidance. We note that even 1980’s generation DVD technology provided more significant control through its chaptering interface. In general, the time axis provides no information on the logical structuring of the event, letting alone the performers in the concert or their relationships. Still, in the absence of any semantic structuring of content, it is often all that is available. It will be necessary to study new mechanisms to replace timeline searching with navigation based on an overlay of structure components. One approach to provide this structure in our school concert use case is based on graphs of performers, instruments, songs or solo’s. It could also be based on cinematographic classifications, such as long shots, pan shots, tight shots. 6.2.5 Ownership and Digital Rights Reusing content brings with it questions of ownership. In printed documents, this is a solved problem: even though the base content is copyright protected, there is a 128 Chapter 6. Conclusions clear distinction between ‘my’ media and that of the original authors. For web pages and online content, the relationship is less simple. If transparent sheets had been placed between all of the pages, we could take all of the user’s comments and distribute them as separate items – all fully within current law. The content added could be further aggregated with the context created across a social network (or across the Internet), and analyzed. What are the most marked-up pages in the book? Does these represent the most interesting or most unclear sections of text? Do the markup patterns change over time? Which comments are appropriate for which users? When annotating a piece of media – whether it is text, audio, images, or whatever – the implication has been that the annotations are of a highly personal nature. Of course, if many of these personal notes are collected and analyzed, they could provide valuable insights into the reusability of personal media assets. Even a simple density analysis of multiple media annotations could provide interesting clues for socially-aware recommender systems. 6.2.6 Security and Privacy Concerns Content can be used or misused by various members of a user community, depending on their (possibly time-variant) relationships. Research is required to support content access and content protection that reflect time-variant social and personal relationships. One aspect of security and privacy of socially-aware multimedia is that personal metadata will likely become too sensitive to simply place on a third-party storage system (like Facebook or Google): all of us will want to take back our identity and maintain our own control of our life-long information. This will require convenient interfaces. It will also probably require users to become accustomed to paying for media access and sharing services. 6.3 Closing Thoughts Much has changed in the ‘world’ of multimedia. Who would have expected twenty years ago that within two decades, it would be commonplace to not only listen to music via your computer, but buy it there as well? That books would not only be written on a PC, but that the PC and its technological ‘cousins’ would become a handy way to read them, or to have them read aloud. That the computer would Closing Thoughts 129 threaten to replace not only the television, but also the movie theatre as a venue for the shared watching of content. And, perhaps more significantly in the long term, that the computer would not only render a wide range of real and artificial images, but that it would attempt to understand them as well. In this thesis we have outlined what we mean by socially-aware multimedia. We have argued that the impact of supporting user-in-the-small transcends the incremental and provides a number of (fascinating) new challenges that require fundamental research results across a wide range of multimedia disciplines. This thesis has presented the idea of socially-aware multimedia as a next step in the evolution of media authoring. By introducing the notional of a temporally-variant social content into media storage, access and sharing, we hope to stimulate a new generator of media research in which the multimedia user is given the central role that she deserves. Bibliography [1] 2002. MPEG-7: The Generic Multimedia Content Description Standard, Part 1. IEEE MultiMedia 9, 2 (April 2002), 78-87. DOI=10.1109/93.998074 http://dx.doi.org/10.1109/93.998074 [2] A. Eliëns, H.C. Huurdeman, M.R. van de Watering and W. Bhikharie. 2008. XIMPEL Interactive Video - between narrative(s) and game play. In Proceedings of GAMEON, 132-136. [3] A. Hanjalic and L. Xu. 2005. Affective video content representation and modeling. IEEE Transactions on Multimedia. 7, 1 (February 2005), 143-154. DOI= http://doi.ieeecomputersociety.org/10.1109/TMM.2004.840618 [4] A. Macaranas, G. Venolia, K. Inkpen, and J. Tang. 2013. Sharing Experiences over Video: Watching Video Programs Together at a Distance. In Proceedings of the 14th IFIP TC13 Conference on Human-Computer Interaction (INTERACT ‘13). Springer-Verlag Berlin Heidelberg. [5] A. Piacenza, F. Guerrini, N. Adami, R. Leonardi, J. Porteous, J. Teutenberg, and M. Cavazza. 2011. Generating story variants with constrained video recombination. In Proceedings of the 19th ACM international conference on Multimedia (MM ‘11). ACM, New York, NY, USA, 223-232. DOI=10.1145/2072298.2072329 http://doi.acm.org/10.1145/2072298.2072329 [6] B. Adams, S. Venkatesh, and R. Jain. 2005. IMCE: Integrated media creation environment. ACM Trans. Multimedia Comput. Commun. Appl. 1, 3 (August 2005), 211-247. DOI=10.1145/1083314.1083315 http://doi.acm.org/10.1145/1083314.1083315 [7] B.T. Truong and S. Venkatesh. 2007. Video abstraction: A systematic review and classification. ACM Trans. Multimedia Comput. Commun. Appl. 3, 1, Article 3 (February 2007). DOI=10.1145/1198302.1198305 http://doi.acm.org/10.1145/1198302.1198305 131 132 Bibliography [8] C.A. Lee, H.R. Tibbo, D. Howard, Y. Song, T. Russell, and P. Jones. 2006. Keeping the context: an investigation in preserving collections of digital video. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (JCDL ‘06). ACM, New York, NY, USA, 363-363. DOI=10.1145/1141753.1141858 http://doi.acm.org/10.1145/1141753.1141858 [9] C. Concolato and J. Le Feuvre. 2013. Live HTTP streaming of video and subtitles within a browser. In Proceedings of the 4th ACM Multimedia Systems Conference (MMSys ‘13). ACM, New York, NY, USA, 146-150. DOI=10.1145/2483977.2483997 http://doi.acm.org/10.1145/2483977.2483997 [10] C. Concolato, J. Le Feuvre, and R. Bouqueau. 2011. Usages of DASH for rich media services. In Proceedings of the second annual ACM conference on Multimedia systems (MMSys ‘11). ACM, New York, NY, USA, 265-270. DOI=10.1145/1943552.1943587 http://doi.acm.org/10.1145/1943552.1943587 [11] C.G.M. Snoek, B. Freiburg, J. Oomen, and R. Ordelman. Crowdsourcing rock n’ roll multimedia retrieval. In Proceedings of the international conference on Multimedia (MM ‘10). ACM, New York, NY, USA, 1535-1538. DOI=10.1145/1873951.1874278 http://doi.acm.org/10.1145/1873951.1874278 [12] D.A. Shamma, L. Kennedy, and E.F. Churchill. 2012. Watching and talking: media content as social nexus. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (ICMR ‘12). ACM, New York, NY, USA, Article 12, 8 pages. DOI=10.1145/2324796.2324811 http://doi.acm.org/10.1145/2324796.2324811 [13] D.A. Shamma, L. Kennedy, and E.F. Churchill. 2009. Tweet the debates: understanding community annotation of uncollected sources. In Proceedings of the first SIGMM workshop on Social media (WSM ‘09). ACM, New York, NY, USA, 3-10. DOI=10.1145/1631144.1631148 http://doi.acm.org/10.1145/1631144.1631148 [14] D.A. Shamma, R. Shaw, P.L. Shafton, and Y. Liu. 2007. Watch what I watch: using community activity to understand content. In Proceedings of the international workshop on Workshop on multimedia information retrieval (MIR ‘07). ACM, New York, NY, USA, 275-284. Bibliography 133 DOI=10.1145/1290082.1290120 http://doi.acm.org/10.1145/1290082.1290120 [15] D.C.A. Bulterman, A.J. Jansen, P. Cesar, and S. Cruz-Lara. 2007. An efficient, streamable text format for multimedia captions and subtitles. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng ‘07). ACM, New York, NY, USA, 101-110. DOI=10.1145/1284420.1284451 http://doi.acm.org/10.1145/1284420.1284451 [16] D.C.A. Bulterman and L. Hardman. 2005. Structured multimedia authoring. ACM Trans. Multimedia Comput. Commun. Appl. 1, 1 (February 2005), 89109. DOI=10.1145/1047936.1047943 http://doi.acm.org/10.1145/1047936.104794 [17] D.C.A. Bulterman and L. Rutledge. 2009. SMIL 3.0: Flexible Multimedia for Web, Mobile Devices and Daisy Talking Books. Springer-Verlag Berlin Heidelberg, 2nd ed., XXVIII, 508 pages, ISBN 978-3-540-78546-0 [18] D.J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. 2009. Mapping the world’s photos. In Proceedings of the 18th international conference on World wide web (WWW ‘09). ACM, New York, NY, USA, 761-770.DOI=10.1145/1526709.1526812 [19] D. Kirk, A. Sellen, R. Harper, and K. Wood. 2007. Understanding videowork. In Proceedings of the SIGCHI conference on Human factors in computing systems (CHI ‘07). ACM, New York, NY, USA, 61-70. DOI=10.1145/1240624.1240634 http://doi.acm.org/10.1145/1240624.1240634 [20] D. Korchagin, P.N. Garner and J. Dines. 2010. Automatic Temporal Alignment of AV Data with Confidence Estimation. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 269-272. DOI=10.1109/ICASSP.2010.5495953 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5495953&isnumbe r=5494886 [21] D. Van Deursen, R. Troncy, E. Mannens, S. Pfeiffer, Y. Lafon, and R. Van de Walle. 2010. Implementing the media fragments URI specification. In Proceedings of the 19th international conference on World wide web (WWW ‘10). ACM, New York, NY, USA, 1361-1364. 134 Bibliography DOI=10.1145/1772690.1772931 http://doi.acm.org/10.1145/1772690.1772931 [22] D. Williams, M.F. Ursu, P. Cesar, K. Bergström, I. Kegel, and J. Meenowa. 2009. An emergent role for TV in social communication. In Proceedings of the seventh european conference on European interactive television conference (EuroITV ‘09). ACM, New York, NY, USA, 19-28. DOI=10.1145/1542084.1542088 http://doi.acm.org/10.1145/1542084.1542088 [23] E. Durkheim. 1971. The elementary forms of the religious life. Allen and Unwin. ISBN: 0042000033. [24] E. Gilbert and K. Karahalios. 2009. Predicting tie strength with social media. In Proceedings of the 27th international conference on Human factors in computing systems (CHI ‘09). ACM, New York, NY, USA, 211-220. DOI=10.1145/1518701.1518736 http://doi.acm.org/10.1145/1518701.1518736 [25] F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida and K. Ross. 2009. Video interactions in online video social networks. ACM Trans. Multimedia Comput. Commun. Appl. 5, 4, Article 30 (November 2009), 25 pages. DOI=10.1145/1596990.1596994 http://doi.acm.org/10.1145/1596990.1596994 [26] F. Shipman, A. Girgensohn, and L. Wilcox. 2008. Authoring, viewing, and generating hypervideo: An overview of Hyper-Hitchcock. ACM Trans. Multimedia Comput. Commun. Appl. 5, 2, Article 15 (November 2008), 19 pages. DOI=10.1145/1413862.1413868 http://doi.acm.org/10.1145/1413862.1413868 [27] G. Adams. 2006. Timed Text (TT) Authoring Format 1.0 – Distribution Format Exchange Profile (DFXP), W3C. Available at http://www.w3.org/TR/2006/CR-ttaf1-dfxp-20061116/. Last access on May 15th 2013. [28] G.D. Abowd, M. Gauger, and A. Lachenmann. 2003. The Family Video Archive: an annotation and browsing environment for home movies. In Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval (MIR ‘03). ACM, New York, NY, USA, 1-8. DOI=10.1145/973264.973266 http://doi.acm.org/10.1145/973264.973266 Bibliography 135 [29] G. Fischer. 2001. User Modeling in Human-Computer Interaction. User Modeling and User-Adapted Interaction 11, 1-2 (March 2001), 65-86. DOI=10.1023/A:1011145532042 http://dx.doi.org/10.1023/A:1011145532042 [30] G. Rizzo, T. Steiner, R. Troncy, R. Verborgh, J.L.R. García, and R. Van de Walle. 2012. What fresh media are you looking for?: retrieving media items from multiple social networks. In Proceedings of the 2012 international workshop on Socially-aware multimedia (SAM ‘12). ACM, New York, NY, USA,15-20.DOI=10.1145/2390876.2390882 http://doi.acm.org/10.1145/2390876.2390882 [31] H. Becker, D. Iter, M. Naaman, and L. Gravano. 2012. Identifying content for planned events across social media sites. In Proceedings of the fifth ACM international conference on Web search and data mining (WSDM ‘12). ACM, New York, NY, USA, 533-542. DOI=10.1145/2124295.2124360 http://doi.acm.org/10.1145/2124295.2124360 [32] H. Sundaram and S. Chang. 2000. Determining computable scenes in films and their structures using audio-visual memory models. In Proceedings of the eighth ACM international conference on Multimedia (MULTIMEDIA ‘00). ACM, New York, NY, USA, 95-104. DOI=10.1145/354384.354440 http://doi.acm.org/10.1145/354384.35444 [33] ITU-T Rec. H.761, Nested Context Language (NCL) and Ginga-NCL for IPTV Services, Geneva, Apr. 2009. Available at http://www.itu.int/rec/TREC-H.761. Last access on May 15th 2013. [34] J.D. Weisz, S. Kiesler, H. Zhang, Y. Ren, R.E. Kraut, and J.A. Konstan. 2007. Watching together: integrating text chat with video. In Proceedings of the SIGCHI conference on Human factors in computing systems (CHI ‘07). ACM, New York, NY, USA, 877-886. DOI=10.1145/1240624.1240756 http://doi.acm.org/10.1145/1240624.1240756 [35] J. Ibáñez, R. Aylett, C. Delgado-Mata, and B. Molinuevo. 2008. On the implications of the virtual storyteller’s point of view. Knowl. Eng. Rev. 23, 4 (December 2008), 339-367. DOI=10.1017/S0269888908000039 http://dx.doi.org/10.1017/S0269888908000039 [36] J. Jansen, P. Cesar, R.L. Guimarães, and D.C.A. Bulterman. 2012. Just-intime personalized video presentations. In Proceedings of the 2012 ACM symposium on Document engineering (DocEng ‘12). ACM, New York, NY, 136 Bibliography USA, 59-68. DOI=10.1145/2361354.2361368 http://doi.acm.org/10.1145/2361354.2361368 [37] K. Inkpen, H. Du, A. Roseway, A. Hoff, and P. Johns. 2012. Video kids: augmenting close friendships with asynchronous video conversations in videopal. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ‘12). ACM, New York, NY, USA, 2387-2396. DOI=10.1145/2207676.2208400 http://doi.acm.org/10.1145/2207676.2208400 [38] K. Purcell. 2010. The State of Online Video, The Pew Research Center’s Internet & American Life Project, June 2010. Available at http://www.pewinternet.org/Reports/2010/State-of-Online-Video.aspx [39] K. Su, M. Naaman, A. Gurjar, M. Patel, and D.P.W. Ellis. 2012. Making a scene: alignment of complete sets of clips based on pairwise audio match. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (ICMR ‘12). ACM, New York, NY, USA, Article 26, 8 pages. DOI=10.1145/2324796.2324829 http://doi.acm.org/10.1145/2324796.2324829 [40] L. Aroyo, P. Bellekens, M. Bjorkman, G.J. Houben, P. Akkermans, and A. Kaptein. 2007. SenSee framework for personalized access to TV content. In Proceedings of the 5th European conference on Interactive TV: a shared experience (EuroITV ‘07), Pablo Cesar, Konstantinos Chorianopoulos, and Jens F. Jensen (Eds.). Springer-Verlag, Berlin, Heidelberg, 156-165. [41] L.A. Rowe and R. Jain. 2005. ACM SIGMM retreat report on future directions in multimedia research. ACM Trans. Multimedia Comput. Commun. Appl. 1, 1 (February 2005), 3-13. DOI=10.1145/1047936.1047938 http://doi.acm.org/10.1145/1047936.1047938 [42] L. Dib, D. Petrelli, and S. Whittaker. 2010. Sonic souvenirs: exploring the paradoxes of recorded sound for family remembering. In Proceedings of the 2010 ACM conference on Computer supported cooperative work (CSCW ‘10). ACM, New York, NY, USA, 391-400. DOI=10.1145/1718918.1718985 http://doi.acm.org/10.1145/1718918.1718985 [43] L. Hardman, G. van Rossum, and D.C.A. Bulterman. 1993. Structured multimedia authoring. In Proceedings of the first ACM international conference on Multimedia (MULTIMEDIA ‘93). ACM, New York, NY, USA, 283-289. DOI=10.1145/166266.168402 http://doi.acm.org/10.1145/166266.168402 Bibliography 137 [44] L. Kennedy and M. Naaman. 2009. Less talk, more rock: automated organization of community-contributed collections of concert videos. In Proceedings of the 18th international conference on World wide web (WWW ‘09). ACM, New York, NY, USA, 311-320. DOI=10.1145/1526709.1526752 http://doi.acm.org/10.1145/1526709.1526752 [45] L. Rainie, J. Brenner and K. Purcell. 2012. Photos and Videos as Social Currency Online, The Pew Research Center’s Internet & American Life Project, September 2012. Available at http://pewinternet.org/Reports/2012/Online-Pictures.aspx [46] L. Xie, A. Natsev, J.R. Kender, M. Hill, and J.R. Smith. 2011. Visual memes in social media: tracking real-world news in YouTube videos. In Proceedings of the 19th ACM international conference on Multimedia (MM ‘11). ACM, New York, NY, USA, 53-62. DOI=10.1145/2072298.2072307 http://doi.acm.org/10.1145/2072298.2072307 [47] M.A. Hearst. 2009. Search User Interfaces (1st ed.). Cambridge University Press, New York, NY, USA. ISBN:0521113792 9780521113793 http://dl.acm.org/citation.cfm?id=1631268 [48] M. Ames and M. Naaman. 2007. Why we tag: motivations for annotation in mobile and online media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ‘07). ACM, New York, NY, USA, 971-980. DOI=10.1145/1240624.1240772 http://doi.acm.org/10.1145/1240624.1240772 [49] M. Cavazza, R. Champagnat, and R. Leonardi. 2009. The IRIS Network of Excellence: Future Directions in Interactive Storytelling. In Proceedings of the 2nd Joint International Conference on Interactive Digital Storytelling: Interactive Storytelling (ICIDS ‘09), Ido A. Iurgel, Nelson Zagalo, and Paolo Petta (Eds.). Springer-Verlag, Berlin, Heidelberg, 8-13. DOI=10.1007/978-3642-10643-9_4 http://dx.doi.org/10.1007/978-3-642-10643-9_4 [50] M.D. Choudhury, H. Sundaram, A. John, and D.D. Seligmann. 2009. What makes conversations interesting?: themes, participants and consequences of conversations in online social media. In Proceedings of the 18th international conference on World wide web (WWW ‘09). ACM, New York, NY, USA, 331-340.DOI=10.1145/1526709.1526754 http://doi.acm.org/10.1145/1526709.1526754 138 Bibliography [51] M. Frantzis, V. Zsombori, M. Ursu, R.L. Guimarães, I. Kegel and R. Craigie. 2012. Interactive Video Stories from User Generated Content: a School Concert Use Case. In Proceedings of the 5th International Conference on Interactive Digital Storytelling (ICIDS ‘12). Springer Berlin Heidelberg, 183195. DOI=10.1007/978-3-642-34851-8_18 http://dx.doi.org/10.1007/978-3642-34851-8_18 [52] M.F. Ursu, M. Thomas, I. Kegel, D. Williams, M. Tuomola, I. Lindstedt, T. Wright, A. Leurdijk, V. Zsombori, J. Sussner, U. Myrestam, and N. Hall. 2008. Interactive TV narratives: Opportunities, progress, and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 4, 4, Article 25 (November 2008), 39 pages. DOI=10.1145/1412196.1412198 http://doi.acm.org/10.1145/1412196.1412198 [53] M.K. Saini, R. Gadde, S. Yan, and W.T. Ooi. 2012. MoViMash: online mobile video mashup. In Proceedings of the 20th ACM international conference on Multimedia (MM ‘12). ACM, New York, NY, USA, 139-148. DOI=10.1145/2393347.2393373 http://doi.acm.org/10.1145/2393347.2393373 [54] M.O. Riedl and R.M. Young. 2006. From Linear Story Generation to Branching Story Graphs. IEEE Comput. Graph. Appl. 26, 3 (May 2006), 2331. DOI=10.1109/MCG.2006.56 http://dx.doi.org/10.1109/MCG.2006.56 [55] M.S. Granovetter. 1973. The Strength of Weak Ties. American Journal of Sociology, 78(6): 1360-1380. [56] P. Cesar, D.C.A. Bulterman, D. Geerts, J. Jansen, H. Knoche, and W. Seager. 2008. Enhancing social sharing of videos: fragment, annotate, enrich, and share. In Proceedings of the 16th ACM international conference on Multimedia (MM ‘08). ACM, New York, NY, USA, 11-20. DOI=10.1145/1459359.1459362 http://doi.acm.org/10.1145/1459359.1459362 [57] P. Cesar, D.C.A. Bulterman, R.L. Guimarães and I. Kegel. 2010. WebMediated Communication: in Search of Togetherness. In Proceedings of the 2nd Web Science Conference (WebSci 2010). Available at http://journal.webscience.org/371/ [58] P. Obrador, R. Oliveira, and N. Oliver. 2010. Supporting personal photo storytelling for social albums. In Proceedings of the international conference on Multimedia (MM ‘10). ACM, New York, NY, USA, 561-570. Bibliography 139 DOI=10.1145/1873951.1874025 http://doi.acm.org/10.1145/1873951.1874025 [59] P. Shrestha, P.H.N. de With, H. Weda, M. Barbieri, and E.H.L. Aarts. 2010. Automatic mashup generation from multiple-camera concert recordings. In Proceedings of the international conference on Multimedia (MM ‘10). ACM, New York, NY, USA, 541-550. DOI=10.1145/1873951.1874023 http://doi.acm.org/10.1145/1873951.1874023 [60] R. Fagá Jr., B.C. Furtado, F. Maximino, R.G. Cattelan, and M.G.C. Pimentel. 2009. Context information exchange and sharing in a peer-to-peer community: a video annotation scenario. In Proceedings of the 27th ACM international conference on Design of communication (SIGDOC ‘09). ACM, New York, NY, USA, 265-272. DOI=10.1145/1621995.1622048 http://doi.acm.org/10.1145/1621995.1622048 [61] R.G. Cattelan, C. Teixeira, R. Goularte, and M.G.C. Pimentel. 2008. Watchand-comment as a paradigm toward ubiquitous interactive video editing. ACM Trans. Multimedia Comput. Commun. Appl. 4, 4, Article 28 (November 2008), 24 pages. DOI=10.1145/1412196.1412201 http://doi.acm.org/10.1145/1412196.1412201 [62] R.L. Guimarães, P. Cesar, and D.C.A. Bulterman. 2010. Creating and sharing personalized time-based annotations of videos on the web. In Proceedings of the 10th ACM symposium on Document engineering (DocEng ‘10). ACM, New York, NY, USA, 27-36. DOI=10.1145/1860559.1860567 http://doi.acm.org/10.1145/1860559.1860567 [63] R.L. Guimarães, P. Cesar, D.C.A. Bulterman, I. Kegel, and P. Ljungstrand. 2011. Social Practices around Personal Videos using the Web. In Proceedings of the ACM International Conference on Web Science. Available at http://journal.webscience.org/437/ [64] R.L. Guimarães, R. Kaiser, A. Hofmann, P. Cesar, and D. Bulterman. 2010. Video analysis tools for annotating user-generated content from social events. In Proceedings of the 5th international conference on Semantic and digital media technologies (SAMT ‘10), Thierry Declerck, Michael Granitzer, Marcin Grzegorzek, Massimo Romanelli, and Stefan Rüger (Eds.). SpringerVerlag, Berlin, Heidelberg, 188-189. DOI= 10.1007/978-3-642-23017-2_14 http://www.springerlink.com/content/3x57j441142825h9/ 140 Bibliography [65] R. Lienhart. 1999. Abstracting home video automatically. In Proceedings of the seventh ACM international conference on Multimedia (Part 2) (MULTIMEDIA ‘99). ACM, New York, NY, USA, 37-40. DOI=10.1145/319878.319888 http://doi.acm.org/10.1145/319878.319888 [66] R. Pea, M. Mills, J. Rosen, K. Dauber, W. Effelsberg, and E. Hoffert. 2004. The Diver Project: Interactive Digital Video Repurposing. IEEE MultiMedia 11, 1 (January 2004), 54-61. DOI=10.1109/MMUL.2004.1261108 http://dx.doi.org/10.1109/MMUL.2004.1261108 [67] R. Rettie. 2003. Connectedness, awareness and social presence. In Proceedings of the 6th Annual International Workshop on Presence. Available at http://eprints.kingston.ac.uk/2106/. Last access on May 15th 2013. [68] S. Bocconi, F. Nack, and L. Hardman. 2008. Automatic generation of matterof-opinion video documentaries. Web Semant. 6, 2 (April 2008), 139-150. DOI=10.1016/j.websem.2008.01.004 http://dx.doi.org/10.1016/j.websem.2008.01.004 [69] S. Eisenstein. 1949. Film Form: Essays in Film Theory. New York: Hartcourt, 279 pages, ISBN 0156309203 [70] S. Lederer, C. Mueller, C. Timmerer, C. Concolato, J. Le Feuvre, and K. Fliegel. 2013. Distributed DASH dataset. In Proceedings of the 4th ACM Multimedia Systems Conference (MMSys ‘13). ACM, New York, NY, USA, 131-135.DOI=10.1145/2483977.2483994 http://doi.acm.org/10.1145/2483977.2483994 [71] S.U. Naci and A. Hanjalic. 2007. Intelligent browsing of concert videos. In Proceedings of the 15th international conference on Multimedia (MULTIMEDIA ‘07). ACM, New York, NY, USA, 150-151. DOI=10.1145/1291233.1291264 http://doi.acm.org/10.1145/1291233.1291264 [72] S. Whittaker, O. Bergman, and P. Clough. 2010. Easy on that trigger dad: a study of long term family photo retrieval. Personal Ubiquitous Comput. 14, 1 (January 2010), 31-43. DOI=10.1007/s00779-009-0218-7 http://dx.doi.org/10.1007/s00779-009-0218-7 [73] The Nielsen Company. 2010. How People Watch – A Global Nielsen Consumer Report. Available at http://www.nielsen.com/us/en/insights/reports-downloads/2010/How-We- Bibliography 141 Watch-The-Global-State-of-Video-Consumption.html. Last access on May 15th 2013. [74] U. Westermann and R. Jain. 2007. Toward a Common Event Model for Multimedia Applications. IEEE MultiMedia 14, 1 (January 2007), 19-29. DOI=10.1109/MMUL.2007.23 http://dx.doi.org/10.1109/MMUL.2007.23 [75] V.K. Singh, J. Luo, D. Joshi, P. Lei, M. Das, and P. Stubler. 2011. Reliving on demand: a total viewer experience. In Proceedings of the 19th ACM international conference on Multimedia (MM ‘11). ACM, New York, NY, USA, 333-342. DOI=10.1145/2072298.2072342 http://doi.acm.org/10.1145/2072298.2072342 [76] V. Zsombori, M. Frantzis, R.L. Guimaraes, M.F. Ursu, P. Cesar, I. Kegel, R. Craigie, and D.C.A. Bulterman. 2011. Automatic generation of video narratives from shared UGC. In Proceedings of the 22nd ACM conference on Hypertext and hypermedia (HT ‘11). ACM, New York, NY, USA, 325-334. DOI=10.1145/1995966.1996009 http://doi.acm.org/10.1145/1995966.1996009 [77] Z. Yu, N. Diakopoulos, and M. Naaman. The multiplayer: multi-perspective social video navigation. In Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST ‘10). ACM, New York, NY, USA, 413-414. DOI=10.1145/1866218.1866246 http://doi.acm.org/10.1145/1866218.1866246 Summary Creating compelling multimedia productions is a non-trivial problem. This is true for both professional and personal content. For professional content, extensive production support is typically available during creation. Content assets are well structured, content fragments are professionally produced with high quality, and production assets are often highly annotated (within the scope of the production model). For personal content, nearly none of these conditions exist: content is a collection of assets that are structured only by linear recording time, of mediocre technical quality (on an absolute scale), and with only basic automatic annotations. These conditions limit the options open to casual authors and to viewers of rich multimedia content in creating and receiving focused, highly personal media presentations. The problem is compounded when authors want to integrate community media assets: media fragments donated from a potentially wide and anonymous recording community. In this thesis we reflect on the traditional multimedia authoring workflow and we argue that a fresh new look is required. Our experimental methodology aims at meeting the requirements needed for social communities that are not addressed by traditional authoring and sharing applications. We focus on the particular task of supporting socially-aware multimedia authoring, in which the relationships within particular social groups can be exploited to create highly personal media experiences. Our framework is centered on empowering users in telling stories and commenting on personal media artifacts, considering the long-term social context of the user. The work has been evaluated through a number of prototype tools that allow users to explore, create, enrich and share rich multimedia artifacts. Results from our evaluation process provide useful insights into how a socially-aware multimedia authoring and sharing system should be designed and architected, for helping users in recalling personal memories and in nurturing their close circle relationships. 143 Nederlandse Samenvatting Goede, aantrekkelijke multimedia producties maken is een ingewikkeld probleem. Dit geldt voor zowel professionele als persoonlijke mediadocumenten. Voor professionele producties is in de regel een uitgebreid instrumentarium ter beschikking. De fragmenten zijn goed gestructureerd, professioneel gemaakt, van uitstekende kwaliteit en vaak van annotaties voorzien (binnen de kaders van het productie model). Voor persoonlijke mediadocumenten gelden bijna geen van deze condities: het materiaal is doorgaans een verzameling fragmenten met als enige structuur de lineaire opnametijd, een (op een absolute schaal) matige technische kwaliteit en minimaal van annotaties voorzien. Deze condities beperken de mogelijkheden voor minder ervaren producenten en consumenten om verrijkte multimedia presentaties te maken die toegespitst en in hoge mate persoonlijk zijn. Dit is nog lastiger als men ook gemeenschappelijk materiaal wil gebruiken: media fragmenten beschikbaar gesteld door andere, potentieel uiteenlopende en vaak anonieme bronnen. In dit proefschrift bekijken wij de traditionele manier van werken en stellen dat een nieuwe kijk nodig is. Onze experimentele aanpak is gericht op de behoefte van groepen mensen die niet geboden wordt door de gebruikelijke programmatuur voor het bewerken en verspreiden van multimedia. Wij kijken in het bijzonder naar ondersteuning voor het vervaardigen van ‘sociaalbewuste multimediale producties’, waarbij relaties binnen bepaalde sociale groepen kunnen worden gebruikt om zeer persoonlijke multimediale ervaringen te creëren. Onze opzet is erop gericht om mensen in staat te stellen om hun persoonlijk verhaal te vertellen en commentaar te geven bij persoonlijke media artefacten, waarbij de bestendige sociale context van de gebruiker een rol speelt. Dit werk is geëvalueerd door een aantal prototypen te ontwikkelen, waarmee mensen hun multimedia producten kunnen overzien, bewerken, verrijken en verspreiden. Resultaten van deze evaluatie leveren goed bruikbare inzichten op hoe men het beste systemen kan ontwerpen voor sociaal-bewuste media productie en verspreiding waarbij gebruikers in staat worden gesteld om hun persoonlijke herinneringen op te halen en daarmee de relaties in hun eigen kring te onderhouden. 145 Curriculum Vitae 1981 Born on July 9th in Vitória/ES, Brazil. 1996–1999 Technical High School (Electrotechnics), Federal Center for Technological Education of Espírito Santo (CEFETES), Vitória/ES, Brazil. 1999–2004 Engineer (Computer Engineering), Federal University of Espírito Santo (UFES), Vitória/ES, Brazil. Final project: Desenvolvimento da Infraestrutura Computacional de Gerenciamento e Comunicação de Dados do Ambiente MultiJADE, supervised by prof.dr. F.M. Varejão. 2005–2007 M.Sc. in Informatics (Computer Networks and Hypermedia Systems), Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro/RJ, Brazil. Dissertation: Composer – An Authoring Tool of NCL Documents for Interactive Digital TV, supervised by prof.dr. L.F.G. Soares. 2007–2012 Ph.D. student in the Distributed and Interactive Systems group, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands. Thesis: Socially-Aware Multimedia Authoring, supervised by prof.dr. D.C.A Bulterman and dr. P.S. Cesar. 147