Published on September 29, 2007
Semantic Analysis for Video Contents Extraction -Spotting by Association in News Video: Semantic Analysis for Video Contents Extraction -Spotting by Association in News Video Paper by – Yuichi NAKAMURA Takeo KANADE Presented By- Hemant Joshi Introduction: Introduction Enormous amount of multimedia data Linking two news matters together Semantic linking Using closed-captioning along with video Video Content Spotting by Association: Video Content Spotting by Association Necessity for multiple Modalities video content extraction from only language or image data is not reliable ``They say'' – difficult to determine without semantics. Situation Spotting by Association: Situation Spotting by Association Association between language and image clues is important key. Two advantages Reliable detection utilizing both images and language The data explained by both modalities is clearly understandable to users. Situation Spotting by Association (Con.): Situation Spotting by Association (Con.) Situation Spotting by Association (Con.): Situation Spotting by Association (Con.) Language Clue Detection: Language Clue Detection Simple Keyword Spotting Direct Vs. Indirect narration Keyword usage for speech Language Clue Detection (Cont.): Language Clue Detection (Cont.) Keyword usage for meeting and visiting Screening Keywords: Screening Keywords To avoid false detection of keywords not related to subject matter of interest, parse the sentence in transcripts, check the role of each keyword and check the semantics of the subject, the verb and the objects. Also consider following: Part-of-speech of each word can be used as keyword. Example- “talk” as verb If keyword is verb, subject or object is checked semantically. For semantic checking, use Hypernym relation in WordNet Negative sentences or those in future tense can be ignored. Location name which follows several kinds of prepositions such as “in”, ”to” is considered as a language clue. Process - Conditions for key-sentence detection : Process - Conditions for key-sentence detection In key-sentence detection, keywords are detected from transcripts. Keywords are syntactically and semantically checked and evaluated by using the parsing results. we focus only on subjects and verbs, results are more acceptable. (80% correct –CNN news headlines) A sentence including one or more words which satisfy these conditions is considered a key-sentence. Process - Key-sentence detection result : Process - Key-sentence detection result The figure (X/Y/Z) in each table shows the numbers of detected key-sentence X is the number of sentences which include keywords Y is the sentences removed by the above keyword screening Z is the number of sentences incorrectly removed Image Clue Detection – Key Image: Image Clue Detection – Key Image Image Clues ? Face close-ups People Images Outdoor Scenes Usage of Face close-up Key Image – Usage of People Images: Key Image – Usage of People Images usage of people images is the description about crowds, such as people in a demonstration Key Image – Outdoor Scenes: Key Image – Outdoor Scenes In the case of outdoor scenes, images describe the place, the degree of a disasters, etc. Key Image Detection: Key Image Detection Face Close-up Detection In this research, human faces are detected by the neural-network based face detection program. Most face close-ups are easily detected because they are large and frontal. Therefore, most frontal faces, less than half of the small faces and profiles are detected. People Image and Outdoor Scene Detection As for images with many people, the problem becomes difficult because small faces and human figures are more difficult to detect. The same can be said of outdoor scene detection. Automatic face and outdoor scene detection is still under development. For the experiments in this paper, we manually pick them. Since the representative image of each cut is automatically detected, it takes only a few minutes for us to pick those images from a 30-minute news video. Association by Dynamic Programming: Association by Dynamic Programming Basic Idea The detected data is the sequence of key images and that of key-sentences to which starting and ending time is given. If a key image duration and a key-sentence duration have enough overlap (or close to each other) and the suggested situations are compatible, they should be associated. Basic Assumption Order of a key image sequence and that of a key-sentence sequence are the same. The basic idea is to minimize the following penalty value P. P = Sumj \in Sn Skips(j) + Sumk \in In Skipi(k) + Sumj \in S, k \in I Match(j, k) where S and I are the key-sentences and key images which have corresponding clues in the other modality, Sn and In are those without corresponding clues. Skips is the penalty value for a key-sentence without inter-modal correspondence, Skipi is for a key image without inter-modal correspondence, and Match(j,k) is the penalty for the correspondence between the j-th key-sentence and the k-th key image. Association by DP - Cost Evaluation: Association by DP - Cost Evaluation Skipping Cost(Skip) The penalty values are determined by the importance of the data, that is the possibility of each data having the inter-modal correspondence. In this research, importance evaluation of each clues is calculated by the following formula. The skip penalty Skip is considered as -E. E = EtypeEdata where the Etype is the type of evaluation, for example, the evaluation of a type “face close-up”. Edata is that of each clue, for example, the face size evaluation for a face close-up. Example of cost definition key-sentence: speech 1.0, meeting 0.6, crowd 0.6, travel/visit 0.6, location 0.6 key image: face 1.0, people 0.6, scene 0.6 Association by DP - Cost Evaluation: Association by DP - Cost Evaluation Matching Cost(Match) The evaluation of correspondences is calculated by the following formula. Match(i,j) = Mtime(i, j) Mtype(i, j) where Mtime is the duration compatibility between an image and a sentence. The more their durations overlap, the less the penalty becomes. A key image's duration (di) is the duration of the cut from which the key image is taken; the starting and ending time of a sentence in the speech is used for key-sentence duration (ds). In the case where the exact speech time is difficult to obtain, it is substituted by the time when closed-caption appears. The actual values for Mtype are shown in Table. They are roughly determined by the number of correspondences in our sample videos. Experiments & Results: Experiments & Results Results (Continued.): Results (Continued.) Usage of Results: Usage of Results Summarization and Presentation tool Around 70 segments are spotted for each 30-minute news video. This means an average of 3 segments in a minute. If a topic is not too long, we can place all of the segments in one topic into one window. This view could be a good presentation of a topic as well as a good summarization tool. Each pair of a picture and a sentence is an associated pair. The picture is a key image, and the sentence is a key-sentence. The position of the pair is determined by the situations defined This view enables us to overlook how the topic is organized. Visit and place information is given first, meeting information is given second, then a few public speeches and opinions are given. Usage of Results (Continued.): Usage of Results (Continued.) Data tagging to video segments News Video topic explainer (Category + Time Order): News Video topic explainer (Category + Time Order) Details in Topic Explainer: Details in Topic Explainer Conclusion: Conclusion The idea of the Spotting by Association in news video. video segments with typical semantics are detected by associating language clue and image clue. Most of the detected segments fit the typical situations Proposed new applications by using detected news segments. future work Improvement of key image and key-sentence detection Check the effectiveness of this method with other kinds of videos. Questions?: Questions?