Tension Analysis in Survivor Interviews: A Computational Approach

This study aims to develop computational techniques to analyze and identify points of tensions in interviews with survivors of the 1994 Rwandan genocide. Oral history interviews are a dialogical source composed of questions and answers, producing a conversational narrative. Yet survivor testimony is often approached as though the questions did not exist. This article examines a digital tool that helps us visualize and better understand the underlying interview dynamic that is the heart of oral history and qualitative research more generally. Our tension detection tool identifies those moments in the interview when the interviewer and interviewee are trying to pull the conversation in different directions. This is part of the natural give-and-take of the interview. Hedging, deflection, hesitation, and boosting are all critical components of this interviewer-interviewee tension. By making the interview dynamic central to our analysis, we aim to better understand how the interview dynamic shapes what is being said and what is left unsaid. In this study, we address key components of interview tension and propose a natural language processing model that can efficiently incorporate these components in text-based oral history interviews to identify tension points. With experiments on an annotated transcript, we verify the efficacy of our model. This model provides a framework that can be utilized in future research on the dialogic of the interview.


Introduction
Survivor testimony is central to our understanding of mass violence and its consequences.
More often than not, however, this testimony has been treated by researchers as eyewitness accounts rather than as interviews. As a result, the researcher's questions are usually suppressed in the analysis and rarely included in published excerpts of these first-person accounts. To do so risks undermining their experiential authority.
The interview context is effectively hidden or obscured. Yet, an interview is a dialogical process between the interviewer and interviewee, and the resulting question-andanswer structure largely determines what is and is not said. The interview dynamic is therefore central, leading some oral historians to call the recorded interview a "conversational narrative," as it is effectively co-produced (Grele and Terkel 1991, p. 135). How then can researchers analyze the interview dynamic to better understand how it influences survivor testimony?
An interview is a dialogical source consisting of questions and answers. It is essential that we understand better the role played by the interviewer in directing the conversation, but also in understanding the agency of the interviewee and the underlying interview dynamic itself (Tripp 1983;Koro-Ljungberg 2008;Tanggaard 2009). The interview dynamic is influenced by many factors, including the social and political distance between the interviewer and the interviewee and the chemistry between the two. There is no perfect location for the interviewer, but it is essential that we understand what their positionality affords and forecloses. Back in the 1930s, US President Franklin D. Roosevelt's administration undertook an oral history project with Americans who experienced slavery before 1865. One black man was interviewed twice by mistake; one interviewer was a white woman, and the other was a black man: You wouldn't know it was the same interviewee. Knowing this, we realize that these interviews reveal much about race relations in the US South during the 1930s, a period of Jim Crow segregation and widespread lynching of Black men. It is therefore important that we consider the interview context (Davidson and Lytle 2004). A genocide survivor interviewing another survivor will not be the same as an interview conducted by someone who did not experience it firsthand. Similarly, an interview between family members will not be the same as between strangers. This is not to say that one or the other is "better" positioned, but it influences and shapes the resulting conversation in myriad ways. Gender, race, and class all have immediate bearing on the intelligibility of this "mutual encounter" (Portelli 1991). To what degree, and at what point, are the interviewer and interviewee on the same "wavelength"? How is trust built over the course of the interview? At what point do the interviewer and interviewee struggle to connect, be heard, or work at cross-purposes? Put simply, tension is a "strained state or condition resulting from forces acting in opposition to each other." In a tension state, interviewees may prefer not to discuss a given topic or even challenge the validity of a question being posed (Donovan-Kicken et al. 2013). Or, they may reframe the question or re-direct the conversation in another direction (Greenspan 2010). These points of tensions are not problems that need to be "fixed," though it is useful for interviewers to be able to read these situations. Our tension tool enables us to visualize the underlying interview dynamic and the ways that this conversation structures the transcribed life story. It also contributes to a more grounded oral history training.
The tension tool helps the researcher map the interview relationship and understand this very important interplay between what is asked and what is answered. Knowledge of where tension arises also offers a new way of investigating interview data. Where is tension most likely to surface in an interview? Are certain types of questions more likely to generate tension? How does an interviewer's positionality and their social distance from the interviewee influence the interview dynamic and the relative presence of tension points? What do we learn about interviewee agency and the co-creation process in the process? These are just a few research questions that will allow us to better interpret qualitative interviews as a dialogical source and will make a significant original contribution to understanding survivor narratives and improving our training of potential interviewers.
Tension expresses itself through various linguistic cues and conversational strategies, such as reticence in answering or asking questions (Layman 2009;Greenspan 2010), deflection or redirection (Donovan-Kicken et al. 2013), or in explicit disagreement. With natural language processing and machine learning techniques, we built a tension detection tool that automatically identifies places in the interview where these tension moments can be detected. One usage scenario of the tool is that the researchers need to answer the questions in the previous paragraph and identify patterns in a large amount of interview data (e.g., over 100 interviews). The tool emerges out of the Living Archives of the Rwandan Diaspora, a Social Sciences and Humanities Research Council of Canada-funded partnership development project between the Centre for Oral History and Digital Storytelling (COHDS) and PAGE-Rwanda, which represents Rwandan genocide survivors living in Montreal. The project's goal is to produce an online platform (https://livingarchivesvivantes.org/) where researchers, community members, and students can listen to, and work with, the testimony of thirty survivors of the 1994 genocide that killed hundreds of thousands of Rwandan Tutsi. To facilitate this listening, the project has developed a suite of tools that enable us to search, map, and listen to survivor testimony in new and diverse ways (Caquard and Dimitrovas 2017). The tension tool, developed by one of the authors of this study as his master's thesis in Computer Science, came as a result.
The life story interviews, which vary in duration from ninety minutes to twelve hours, were recorded between 2007 and 2012 by the Montreal Life Stories project, another COHDS-based partnership project that recorded 500 life stories of Montrealers displaced by war, genocide, and other human rights violations. These interviews were then integrated into live theatre performances, radio programs, online digital stories, audio walks, art installations, pedagogical units, and a museum exhibition, and 500 Montreal metro cars were equipped with audio portraits that allowed citizens to listen to these stories. A large number of books and articles (High, Little, and Duong 2014;High 2014;High 2015;Miller, Little, and High 2017) have been written about this earlier project, including some preliminary tool development (Xiao, Luo, and High 2013;Jessee, Zembrzycki, and High 2011;High and Sworn 2009). The Living Archives of the Rwandan Diaspora is one of many initiatives that have built on this research foundation since 2012. In this study, our long-term objective is to identify the tensions in Rwandan genocide victims' transcribed and translated interview transcripts.
To achieve this objective, we explore computational methods to automatically identify the tension moments in the transcripts. In this paper, we report our tension detection tool. The rest of the paper is structured as follows: Firstly, we review the related literature on detecting tension or similar phenomena in interview transcripts. We also discuss earlier works on detecting hedges and emotions from text, as these are crucial components of our architecture for tension detection. Then we discuss in detail our tension analysis framework and our experimental results on a survivor interview and give a thorough analysis of the system. Lastly, we give a summary of the research work that has been done in this study. We also give direction for further work that can be done in this field.

Related work
Though interview dynamics have been studied to some degree in the past (Misztal 2003;Layman 2009;Bornat 2010;Thompson 2017;Ponterotto 2018), there is very little work that has been done to automate the process of detecting tension in interviews with computational approaches. Burnap and colleagues performed conversational analysis and used different text mining rules to identify spikes in tension in social media (Burnap et al. 2015). They illustrated how lexicons of abusive or expletive terms can identify high levels of tension separated from low levels. Their proposed tension detection engine relies solely on the lexicons and membership categorization analysis (MCA) (Sacks 1995). They demonstrated that their model has consistently outperformed several machine learning approaches and sentiment analysis tools.
Distress is a negative affective condition that people experience when they feel upset.
Distress is closely related to tension. McCubbin and colleagues discussed how stressor events produce tension and how stress becomes distress when it is subjectively defined as unpleasant (McCubbin, Sussman, and Patterson 2014). Buechel and colleagues considered the problem of distress and empathy prediction as a regression problem (Buechel et al. 2018). They used a Feed-Forward Neural Network with Fast-Text word embeddings as their inputs and a CNN system with one convolutional layer with three different filter sizes. They claim that CNN models can capture semantic effects from the word order and found that such models are especially successful in detecting distress when compared with detecting empathy from text. The researchers provided the first publicly available dataset for text-based distress and empathy prediction.
While these early studies illustrate the possibility of detecting tensions in interviews using machine learning and natural language processing techniques, they failed to fully leverage the indicators of tensions that are identified from the literature.
For instance, tension can be shown as reticence in the interview. Layman discussed how reticence can cause the interviewees to shift the conversation, thus restricting the interviewees' responses (Layman 2009). It is a common strategy embraced by the interviewees in order to avoid either complete refusal to reply or full disclosure.
Layman also discussed how necessary it is to be conscious of these circumstances so that the interviewer can better judge whether the interviewee should be questioned (Layman 2009). For example, the use of discourse markers such as "not really," "not that I remember," or "well, anyway" in responses shows how reticence in an interview might be influential. This phenomenon reveals tension points in an interview and gives an idea to interviewees that the conversational stream has been interrupted somewhat. Layman also showed how certain topics can lead interviewees to use such strategies to avoid answers to certain questions (Layman 2009). Most commonly, these answers are reticent and short or dismissive. Subjects that address individual trauma, whether tormenting or frightful or humiliating, will probably trigger hesitant narrator-induced reactions. This leads to the judgement of the interviewers whether the interviewee is to be pressed if it is clear that they are unwilling to speak on certain issues.
Conceptually speaking, tension moments are where the interviewer and interviewee are working at cross-purposes or are not quite on the same page. Usually, this involves moments when the interviewer wants the conversation to go in one direction, but the survivor either doesn't want to go "there" (deflection) or wants to go in another direction (booster). It also includes moments of outright, though often subtle, disagreement (Ahn 2010). Hesitation is also significant in an interview, particularly when the subject being explored is mass violence. From the language use perspective, these moments are expected to have hedging or booster words/phrases. Hedging refers to the technique used to add fuzziness to a speaker's propositional content. According to De Figueiredo-Silva, hedging can be viewed as a speaker's reserved attitude towards a claim and towards their audience (De Figueiredo-Silva 2001). It can be as simple as saying "maybe," "almost," or "somewhat" in ordinary discourse. It is a common strategy of hesitation embraced by narrators in interviews with oral history. It gives narrators an opportunity to think and organize their thoughts in order to plan safe answers when they are asked difficult questions. For example, the usage of "I think …" or "Well …" in interviews gives interviewees the authority to shape their stories. For example, the sentence, "I assume he was involved in it," shows how the usage of the hedge word "assume" can weaken the propositional content "he was involved in it." Phrases such as "In other words" or "In my understanding" can also be used to shift a topic either completely or partially.
It can be used as a filler or delaying tactic. This is frequent when there is a disjuncture between the interviewer and the narrator. Often interviewees insist on individualizing their narrative, because they either do not feel authorized to speak for the group, or they have a realization that their story is theirs. Often, as a substitute for hedge words, discourse markers are used during oral history interviews. A discourse marker can be an utterance or a word or a phrase (such as oh, like, well, and you know) that either directs or redirects the flow of conversation without adding any significant meaning to the discourse (Schiffrin 1987). Ponterotto demonstrated how hedging in talks is used to tackle controversial issues (Ponterotto 2018). On the other hand, boosting, using terms such as "obviously," "clearly," and "absolutely," is a communicative strategy for expressing firm commitment to statements. It limits the negotiating room for the audience. It plays a vital role in creating conversational solidarity (Holmes 1984) and in constructing an authoritative persona in interviews (Weiyun He 1993). Interestingly, if booster words are preceded by negated words (e.g., not, without), it can act as hedging (e.g., not sure).
Besides the detection of reticence and the use of hedging and/or booster words/ phrases, the presence of negative emotions can also be indicators of tension moments.
Jurek and colleagues discussed how negative emotion can lead to tension (Jurek, Mulvenna, and Bi 2015). Misztal discussed how emotions lead directly to the past and bring the past somatically and vividly into the present (Misztal 2003). In survivor interviews, interviewees may experience different negative emotions (e.g., anger, sadness, fear, etc.) and feel discomfort. If the interviewer notices this and shifts the topics, the interviewee may come back to the calm state. If the interviewer keeps pushing, however, the interviewee may become too uncomfortable and stop cooperating (e.g., refusing to answer questions). Emotion, therefore, can act as a strong signal of tension in the conversation. In the following examples from our research data, the Rwandan survivor interviews demonstrate the strong negative emotions when words fail us that interviewees may carry in our data contexts. (In all transcript excerpts, the questions by the interviewer will be indicated by "Interviewer," and the interviewee's response by "Narrator." The interview transcripts can be found at http://livingarchivesvivantes. org/. Note: The interviewees gave full consent to use the interview transcripts for research purposes.) 1. Interviewer: You've felt different emotions because of the events in Rwanda, but are there things that have stayed with you even to this day? Narrator: Yes … I couldn't understand how one can commit acts like these, how one can hate and carry out such atrocities against another human being.
2. Narrator: I'm not going to waste my time praying in these circumstances because it's completely-it's hogwash.

Interviewer: Tell me-
Narrator: But what is even more serious is that there are Canadians, especially Quebecers, who stand behind the factions and are even more extremist than we are! Interviewer: Indeed.

Tension analysis framework
The two core components of our proposed framework for detecting tension in interview transcripts are: the Emotion Recognition Module and the Hedge Detection Module. In this section, we provide a brief overview of these components. We also discuss other important features (booster words, markers, etc.) that we found useful during our study in this section. At the end of the section, we provide pseudo-code incorporating all of these components.

Emotion recognition
Emotion plays an important role in recognizing conditions of tension during survivor interviews, as we discussed earlier. To analyze whether and how the interviewee's emotional aspect indicates the tension during the interview, we developed an emotion recognition tool to recognize the interviewee's emotions from the interview transcript. There is often a misconception about sentiments and emotions as these subjectivity terms have been used interchangeably (Munezero et al. 2014). Munezero and colleagues differentiate these two terms along with other subjectivity terms and provide the computational linguistics community with clear concepts for effective analysis of text (Munezero et al. 2014). While sentiment classification tasks (Pang and Lee 2008;Cambria et al. 2017)  Kalchbrenner and colleagues proposed a dynamic CNN model that utilizes a dynamic k-max pooling mechanism (Kalchbrenner, Grefenstette, and Blunsom 2014). Their model is able to generate a feature graph, which captures a variety of word relations.
They showed the efficacy of their model by achieving high performances on binary and multi-class sentiment classification tasks without any feature engineering.
More recently, Islam and colleagues proposed a multi-channel convolutional neural architecture with the incorporation of different lexical features in the neural network model, which significantly improves the performance of emotion and sentiment identification tasks (Islam, Mercer, and Xiao 2019). In this study, in order to identify emotion of an interviewee from interview transcripts, we utilize the model discussed in Islam, Mercer, and Xiao (Islam, Mercer, and Xiao 2019).

Hedge detection
Hedging is a widely used conversational management strategy to show the lack of commitment of the speaker to what they say, which can signify conflicts among the speakers. People use hedging when they try to avoid criticism or evade questions in conversations (Crystal 1988). Identifying hedges in conversational text is another core component of our tension analysis framework. Martín discussed four common hedging strategies: Indetermination, Camouflage, Subjectivization, and Depersonalization (Martín 2003). We provide brief details about these strategies motivated by the description found in Alonso Alonso and colleagues (Alonso Alonso, Alonso Alonso, and Torrado Mariñas 2012). Strategy of Indetermination includes the usage of various epistemic modalities, for example, epistemic verbs (assume, suspect, think), epistemic adverbs (presumably, probably, possibly), epistemic adjectives (apparent, unsure, probably), modal verbs (might, could) and approximators (usually, generally). The use of such epistemic modalities in the interviewee's response creates vagueness and ambiguity. Strategy of Camouflage includes the use of different adverbs (e.g., generally speaking, actually). This approach serves as a lexical tool to stop the interviewer from having a negative reaction. Strategy of Subjectivization is activated by the usage of first-person pronouns followed by verbs of cognition, for example, "I think" or "I feel." These expressions have been given the term "Shield" in Prince, Frader, and Bosk (Prince, Frader, and Bosk 1982). In certain cases, this approach allows the interviewees to openly express their opinions and hear them. Strategy of Depersonalization includes the use of impersonal pronouns or constructs, for example, "we," "you," or "people." This makes it possible for interviewees to hide behind an unknown subject.
The following two examples from a conversational interview transcript demonstrate the use of hedging for these purposes: 1. Narrator: Well, I think we have the duty to our children to teach them where they come from.

Narrator: I don't know if I want to talk about my brothers and sisters just yet.
The use of hedge terms "I think" and "I don't know" demonstrates the instability in their narrative. Besides hedge words, people use discourse markers to hedge in conversations. These can be an utterance or a word or a phrase (such as "oh," "like," "well," and "you know") that either direct or redirect the flow of conversation without adding any significant meaning to the discourse (Schiffrin 1987). For example, "Well, I don't know if there are other things I'd like to share, except that I think that we still have a very, very long journey to go as a nation." Our rule-based hedge detection algorithm leverages lexicons we compiled for hedge words, discourse markers, and booster words. We included different epistemic words in our hedge words lexicon that show their hedging act, such as verbs (suppose, think, presume), adverbs (arguably, barely, seemingly), adjectives (unlikely, unsure, unclear), and modal verbs (might, maybe). We also included various approximators (such as generally, usually) in the lexicon. People also use discourse markers when hedging in conversations. These markers have a variety of functions. For example, when making an unexpected contrast (even though, despite the fact that), making a contrast between two separate things, people, ideas, etc. (anyway, however, rather), clarifying and re-stating (in other words, in a sense, I mean), or to change topic or return to the topic (well, anyway). We also compiled a list of such discourse markers. In order to measure the comparability between the discourse markers of our lexicon and the phrases from the input sentences, we used Jaccard distance, complementary to the Jaccard index. We have built a lexicon for boosting words as well. Boosting, using terms such as absolutely, clearly, and obviously, is a communicative strategy for expressing a firm commitment to statements. Interestingly, if booster words are preceded by negation words such as "not" or "without," they can act as hedges. For example, "I'm still not sure if I would go back; I don't know what it would be like." Here, "sure" is a booster word. However, since it is preceded by a negation word "not," it changes the meaning completely. We handle this kind of situation in our hedge detection algorithm.
Hedging disambiguation is an important part of our algorithm, given that some commonly used hedge terms also have non-hedge senses in conversational interviews.
We apply rules to disambiguate these terms based on the syntactic structure of the sentences. Islam and colleagues discussed several hedge disambiguation rules which we used in this study (Islam, Mercer, and Xiao 2020). We used the Stanford CoreNLP (Manning et al. 2014) parser to parse the sentences (https://stanfordnlp.github.io/ CoreNLP/download.html). One of the main reasons we chose this rule-based approach over a learning-based approach was that there is no large enough benchmark annotated dataset available in this genre that could have been leveraged to build a good classifier.
A statistical model can be very useful in discovery of latent relations between features, which is difficult with a rule-based approach. However, with this study, we mark the start of producing an annotated dataset supervised by the experts in this field that can be utilized in future research. The following is a brief review of a subset of the rules used in our research with examples from our interview datasets.
Hedge Term: Feel, Suggest, Believe, Consider, Doubt, Guess, Hope Rule: If token t is (i) a root word, (ii) has the part-of-speech VB*, and (iii) has an nsubj (nominal subject) dependency with the dependent token being a first person pronoun (i, we), t is a hedge, otherwise, it is a non-hedge.
Hedge: I hope to, someday, but no, I haven't reached it yet.
Non-hedge: A message of hope and daring to shed light on everything we see.

Hedge Term: Think
Rule: If token t is followed by a token with part-of-speech IN, t is a non-hedge, otherwise, hedge.

Hedge: I think it's a little odd.
Non-hedge: I think about this all the time.

Hedge Term: Assume
Rule: If token t has a ccomp (clausal complement) dependent, t is a hedge, otherwise, non-hedge.

Hedge: I assume he was involved in it.
Non-hedge: He wants to assume the role of a counsellor.

Tension detection
In addition to the two modules we discussed above, our proposed tension analysis framework makes use of a few additional features that proved to be important during our research. We provide brief details about these features along with the pseudocode of our proposed algorithm below.

Markers
It is interesting that markers (e.g., laughter, silence, sigh) are used in these interview transcripts. These have various functions. Sometimes markers like "laughter" indicate invitations to the interviewer to ask the next question. At other times they represent hesitation or nervous deflection (i.e., the tension). In this work, we have compiled a list of such markers/cues but acknowledge that further exploration is needed to interpret these cues. The example below shows a use of the marker "laugh": Interviewer: And what would you like Rwandans, your community, to know about you and that maybe we don't already know, maybe we … ? If it were necessary … Narrator: [laughs] … I don't know. It's a difficult question…. I don't know since … I think that all Rwandans, well, every Rwandan has his or her own experience, and I'm not sure that I should be asking them to think about me in a certain way.

Asking questions back
When the interviewed person asks for clarification, posing a question back, then it is also a symbol, a good marker for recognizing tension points. During our research, we have found that asking a question back to the interviewer may possibly be a sign that the interviewee is trying to negotiate. We use this as a possible criterion in our tension detection algorithm. The following example illustrates such a situation: Interviewer: So going back now to the period of 1994, during the genocide-you saw it coming, but how did you live through that time?
Narrator: How do you mean?

Outliers
In cases where an interviewee gives unusually long or short answers to a particular question form, shorter or longer than three standard deviations from the average length of responses of that sort (for example, wh questions, yes/no questions, etc.), that is an important indicator of some sort of change in interview dynamics. During our group discussions, we felt that this type of dynamic could be a sign of tension, so we added this as one of the criteria in our tension detection algorithm. We find the mean (Equation 1) and standard deviation (Equation 2) for each question type. (In this study, we considered wh-question, how, yes-no, and mixed [mix of several question types] as the prime question types.) Here, µ t q indicates the mean for the question type q t , σ t q indicates the standard deviation for the question type q t , w i (q t ) indicates the total number of words in excerpt e i belonging to q t , and N (q t ) indicates the total number of excerpts belonging to each q t . We consider a response to be an outlier, thus a possible point for tension, if it falls below σ t q 3 or is above σ t q 3 .

Algorithm
Here, we provide the pseudo-code for our tension detection algorithm. Our algorithm detects tension on excerpt level considering different factors (emotions, hedging, markers etc.) that are present in the sentences of an excerpt. If an excerpt is not labelled as having tension by our algorithm, this indicates that the algorithm did not find any tension-causing phenomena that we discussed earlier in this article, though we acknowledge there can be a few cases where the algorithm fails due to the constraints posed by transcribed texts versus the actual video interview.

Evaluation of the tension detection tool
In this section, we provide details about the experiments to examine how well our computational approach performs in comparison to the human performance when it is applied to an annotated interview transcript to identify the tensions in the interviewee's responses. Then, we compared the results with the analysis performed by student researchers. There is considerable messiness in manual annotations as researchers identified varying points in the interview. We then identified those places where the majority of the student researchers identified tension and then compared these to the computational results. We believe that this real-life comparison has considerable merit. It also opened up a space in the oral history classroom to discuss these issues. It is a valuable pedagogical exercise in its own right.

Interview transcripts and annotation process
We The transcripts were annotated by a group of students taking a public history course with a focus on the Living Archives of the Rwandan Diaspora (http:// livingarchivesvivantes.org/). These students had been watching interviews each week, working with the transcripts, learning about oral history and mass violence. They had been demonstrated interviewer-interviewee dynamic, which is at the heart of the conversational narrative of the oral history interviews. Then, the students were paired to work together to annotate the whole transcript of one interview following the instructions of the instructor who is a co-author of this paper. Specifically, they annotated four types of incidences in the interviewee's responses: the points of tension (T), the interviewee's hesitation (H), the deflection in the response (D), and the interviewee's boosting (B). Tension is often used as an umbrella term for when the interviewer and interviewee work at cross-purposes, whereas hesitation and deflection have more specific meaning. Deflection can also be followed by boosting.
Interviewees tend to use boosting when they try to drag the interviewer somewhere in the interview. We also acknowledge the fact that transcribed and translated interviews are not the same as the recorded ones. It's an "echo of an echo." What sounded abrupt on the transcribed interview might be perfectly normal (and tension free) in the video.
Similarly, what sounded normal on the transcribed texts might contain tension in the actual video interview as tension might be present in facial and vocal expressions that might not be captured in text.
There were 15 teams in total, each team having 2 students. Team members discussed with each other first, and the annotated results reflect the team's shared interpretation of the categories and the interviewee responses. In total, there were 116 interviewee responses that have been annotated by these teams that have been used in this study for the purpose of evaluating our algorithm. Of the four categories, the point of tension

The comparison of the annotations by the teams vs. our tool
With this transcript, we first segmented the text according to the turns by the interviewer and the interviewee. We applied the tension detection tool to the segmented data and classified each interviewee response as tension or no tension. In total, our tool identified 55 tension points out of 116 interviewee responses.
We compared the performance of our tool with that of the student teams' annotations through three aspects. First, we examined whether the tool would be able to identify all the possible tension points by a researcher. To do so, we considered an interviewee response to be a human annotated tension point if any team marked it as T. Our tool  Acknowledging that our model considers hedging, boosting, and deflection as indicators of tension points, we considered another aspect in the comparison-an interviewee response is marked as a point of tension by human annotators if any team has annotated any category on it. From this aspect, the tool was able to identify 16 of 28 hedging annotations, 21 of 34 boosting annotations, and 7 of 32 deflection annotations. It also incorrectly marked annotations in each of these categories: 32, 25, and 12, respectively.
In the third aspect of the evaluation, we utilized a voting system to determine the final annotation of a response by an interviewee. First, we compiled all four categories used by the teams into one single category representing a tension point (T). Next, we assigned a label for each response of the interviewee when at least 8 teams agreed on the label out of the 15 participated teams. Our tool was able to identify all the 4 annotations classified as T. It also incorrectly classified 51 annotations as T.

Discussion
As a filtering device that facilitates as opposed to replaces the researcher's qualitative analysis process of interview data, this tool has a promising result. Specifically, our evaluation results show that, overall, the tool is able to identify the majority of the interview places that were annotated as containing tensions or indicators of tensions.
However, as shown in the above section, there is room for improvement, mainly to decrease the number of cases where the tool labels as tension points but the human experts do not. Language techniques can be explored to improve this performance.
For example, one of the problems that must be tackled by any description of discourse markers is their poly-functionality, which means it is very important to distinguish the usage of different markers. Although we tried to disambiguate a number of hedge terms in this work, we need a clearer understanding of certain discourse markers.
One of the problems with our strategy is its failure to discern the discourse functions of the marker "well." "Well" has various functions. It has been well investigated by many scholars over the years (Ponterotto 2018;Jucker 1993). It appears in seemingly different contexts. According to Jucker, "well" can be used as a marker of insufficiency, as a face-threat mitigator, as a frame, or as a delay device (Jucker 1993).
Tensions can build up over time during conversations, and various factors can contribute to that, such as the interviewer's questions, the topics covered right before this response, etc. Our current framework has only considered the interviewee response.
We will explore the potential of these contextual factors in detecting tensions in the conversations.
As mentioned in the introduction section, people can keep their tensions internally without letting the other conversation partners notice them. Our work is aimed at detecting those that have external markers in the communication. The external markers can exist in various communication channels-the conversation content; the body language (such as hand movement and facial expression); and the voice and the sound (such as the tone and the pitch). This study has mainly examined the markers in the conversation content with a few more about the voice and sound (e.g., the laughter and the silence). Prior study has shown that prosodic features can be indicative of tensions in interviews (Zhang and Xiao 2020). One of our next steps is to integrate the audio and video recordings of the interview data into the tension detection model.
The last limitation we recognize is the use of Twitter data for training the emotion recognition tool. The interviews were conducted in a conversational style, which offers similarity to the free form of tweets in that sense. On the other hand, interviewees' responses were often much longer than a tweet, and the interview context being about mass violence is very different from day-to-day tweets. These differences between the training data and testing data also put a constraint on the performance of emotion recognition in our study.
Besides survivor interviews, we anticipate that tension between the interviewer and the interviewee may occur in many interviews about sensitive topics. "Sensitive topics" are topics that require participants to reveal their deep personal feelings and/or experiences that are emotionally difficult or stressful for them (Cowles 1988;Johnson and Clarke 2003;Lee 1993), for example, domestic violence, child maltreatment, and sexual behaviour. Unstructured or semi-structured interviews are common in sensitive topic research. Therefore, our work of analyzing tensions in the survivor interviews is expected to contribute to a larger research community that studies sensitive topics through the interview approach (Miller, Little, and High 2017). Our tool is openly accessible at https://github.com/jumayel06/Tension-Analysis. We encourage other scholars in Digital Humanities to conduct tension analysis in their interview projects and further improve the tool.

Conclusion
Oral history has a pivotal role to play in educating individuals and communities about the social preconditions, experiences, and long-term repercussions of mass violence. Among other things, life story interviews offer us "unique glimpses into the lived interior" of survivors (Thomson 1999, p. 26 Moore 2007;Mason 2007;Corti, Witzel, and Bishop 2005). New digital tools and techniques are therefore needed. To start, we must go beyond what Savage calls the "juicy quotes syndrome," to engage with interviews in deeper and more holistic ways (Savage 2005). Our tension tool does that, allowing us to research the interview dynamic that is at the very heart of the interview. We agree with Mayernik, who has argued that: "Digital research data, if curated and made broadly available, promise to enable researchers to ask new kinds of questions and use new kinds of analytical methods in the study of critical scientific and societal issues" (Mayernik et al. 2012).
In this work, we explored interview dynamics and how various factors influence this phenomenon. We also talked about survivor interviews and why analyzing the responses of a narrator to identify situations of tension is important. We provided details about our tension analysis architecture and discussed the components of it.
We utilized a multi-channel convolutional neural network model, which was trained on social media data, to identify emotions from our transcribed interview data, which is a core component of our framework. We have observed how emotion fluctuates throughout survivor interviews, and negative emotion appears to be the source of a stress situation most of the time. Next, we presented a discussion about hedging and boosting in speakers' narratives. These phenomena are crucial in tension detection studies and demonstrate the mood of an interviewee during a conversation. We utilize three manually constructed lexicons of hedge words, discourse markers, and booster words and apply predefined rules to disambiguate hedge terms based on the syntactic structure of the sentences. Our framework also takes length of interviewees' responses and various markers used in such interviews into consideration. We discussed our process of integrating all the discussed components and features by providing an algorithm for detecting tension in oral history interviews.
Our proposed algorithm gives a very good recall score on the transcript that we worked on, and because of its high recall score, it can be used as a filtering tool, which can be of assistance to researchers in this area. Since very little work has been done in this research field, we hope that in the potential continuation of this study, our research findings can be beneficial. It is crucial to have a good understanding of tension phenomenon in order to better analyze such data. Domain experts at Concordia University's Centre for Oral History and Digital Storytelling are going to perform further analysis on the interview transcripts and will provide us with more insights about the dynamics of such interviews. They are also in the process of annotating more interview data, which will help us evaluating our model even better in the future. We also plan to have our interview data annotated with different emotion categories and train our emotion recognition model with data from the same domain, which might potentially improve the performance of the model. Another future direction is to identify and integrate the tension markers from various communication channels in the tension detection framework, which has mostly considered the communication content and ignored other places such as the audio and video recordings of the interviews.
Our work of analyzing tensions in the survivor interviews sheds light on the analysis of interviews and conversations that are expected to have tensions (e.g., interviews about sensitive topics). We make our tension detection tool open source, encouraging scholars to apply the tension analysis in their interview research and/or to further improve the tool.