@phdthesis{oai:nagoya.repo.nii.ac.jp:00007046, author = {Yonezawa, Tomoko and 米澤, 朋子}, month = {Mar}, note = {The purpose of this dissertation is to propose and investigate multimodal expression of personified media, with multiple modalities, focusing on the expressive strength of voice and gesture. The results are expected to be used in a practical construction of personified representation. There are many of research works on and products of, personified media such as educational puppets, humanoid robots, virtual agents, communication avatars, etc. which employ a feature that people treat an inanimate being as a human. They aim to create an illusion of personification for virtual agent systems in presenting information, and embodied robots in communication support systems. In those cases, it is desirable to build a shape, internal states, and expressions like humans or living beings as the designer expects. Although detailed personified expressions are partially discussed, their consolidated representation in personification is little researched, and most aspects of personified representations are still in research phase. People usually communicate with each other by combining multimodal expressions with verbal and non-verbal cues. Their daily communication includes detailed expressions of vocal tone, facial expression, and motion of the body. They also interpret the other’s detailed expressions as a general representation of the person. Consequently, we experimentally discuss the effectiveness of the effectiveness of multimodal representation focusing on vocal and gestural expressions for generally personified media in this dissertation. The naturalness of multimodal representation does not necessarily correspond to the naturalness in each modality. This dissertation therefore describes the investigation on both expressions in each modality and effect of their coordination (cross-modality). Personified expressions involve timing, kind, and strength of the component modalities. Although there are studies related to the former two, the latter one is not much discussed. The expressive strength changes with time in human expression. Accordingly, we need to produce desired and continuous strength of multimodal representation and each constituent modal expression for the simulation of human expression. This dissertation is organized with the following chapters for denoting the proposing method and its verification of personified expressions. Chapter 1 clarifies the purpose of this dissertation in the context of background research on multimodal and personified expressions and their problems. Chapter 2 gives an overview of related studies to each personified medium and multimodal expression confirming the standpoint and perspective of this dissertation. Chapter 3 discusses characteristics of puppet interfaces as a basis of our system construction in personified media. To clarify the effects of characteristics of puppets, a puppet-robot was used in a non-face-to-face communication experiment, which allowed expression via motion and vocal cues. Two conditions were explored, one in which a subject could see the real puppet-robot and the other in which the user viewed the puppet-robot via a video link. The results demonstrated the importance of the physical presence of the puppet-robot for communication. We also explored the effect of the appearance of the puppetrobot. In one condition, a robot was placed inside a stuffed bear, whereas in the other condition a skeletal robot was used. Based on analysis of the subject’s movements and utterances, we found that the appearance of the stuffed bear puppet-robot was more effective for communication. From the results, we clarify the physical presence and appearance of the puppet-robot play an important role in non-verbal communication. Chapter 4 is devoted to proposing and evaluating a method for synthesizing continuous expressions in vocal sound by gradually changing the musical expression based on speech morphing algorithm with differently expressed singing voices. The importance and the advantages of perceptually continuous expressions are experimentally shown by comparison with binary discrete transformation between different expressions. In order to synthesize various and continuous strengths of singing voice expressions, a singing voice without expression, “normal,” is used as the base of morphing, and singing voices of the particular singer with three different expressions, “dark,” “whispery” and “wet,” are used as targets. Through statistical analyses of perceptual evaluations, we confirmed i) the proposed morphing algorithm permits the perceptually continuous interpolation of expressive strength, ii) non-categorical perception was confirmed even with a non-linear curve as with categorical perception in favor of continuous perception, iii) an approximate equation of the perceptual sense can be used to calculate the morph ratio at a perceptually linear interval, and iv) our gradual transformation method can effectively work on perceived naturalness. Chapter 5 describes the HandySinger system, a personified tool that was developed to naturally express a singing voice controlled by the gestures of a hand puppet. Assuming that a singing voice is a kind of musical expression, natural expressions of the singing voice are important for personification. We adopt a singing voice morphing algorithm that effectively smoothes out the expressive strength delivered with a singing voice. The system consists of a hand puppet and an internal glove with seven bend sensors and two pressure sensors. It sensitively captures the user’s motion as a personified puppet’s gesture. This configuration enables to control singing voice expressions intuitively. In the experiment, we have found that i) the morphing algorithm interpolates expressive strength in a perceptual sense, ii) the hand-puppet interface provides gesture data at sufficient resolution, and iii) the gestural mapping of the current system works as planned. Chapter 6 attempts to clarify the relationship of the expressive strengths between gestures and voice for embodied and personified interfaces. We conduct perceptual tests using a puppet interface, while controlling singing-voice expressions, to empirically determine the naturalness and strength of various combinations of gesture and voice. The results we obtained show that i) the strength of cross-modal perception is affected more by gestural expression than by the expressions of a singing voice, and ii) the appropriateness of cross-modal perception is affected by expressive combinations of singing voice and gestures in personified expressions. As a promising solution, we suggest balancing singing voice and gestural expressions by extending and correcting the width and shape of the curve of expressive strength in the singing voice. Finally, Chapter 7 concludes this dissertation and describes future directions. We summarize the conclusions of the proposed method for multimodal representation and the verification by the results of statistical analyses of perceptual tests. Personified media, which have been assigned a role for reducing psychological burden of user interactions, are being adapted to a new alternative role to living beings such as living partners and pets by developed embodiments and intelligence. They are differently personified according to the purpose and situation, such as androids, animal-shaped robots, etc. However, they require highly natural expression regardless of their shapes. It is important to construct multimodal representation while considering the perceived strength in the expression. Consequently, the results of this dissertation are expected to provide a practical foundation for constructing personified representation in embodied media., 名古屋大学博士学位論文 学位の種類:博士(情報科学)(課程) 学位授与年月日;平成19年3月23日}, school = {名古屋大学, Nagoya University}, title = {Multimodal Representation of Personified Media with Expressive Strengths of Voice and Gesture}, year = {2007} }