• Hacking Humanities Midterm Exam

    Reed Schubert

    Introduction

    For this project, I decided to analyze a text file for word frequency and compare these frequencies across the different stories it contains. I created a word cloud along with a line and stacked bar chart to visualize the most used words in the entire document and how these words fluctuate in use amongst the different stories. This approach allowed me to gain insight into the themes of the text, showcasing how specific words can dominate some stories while appearing less frequently in others.

    Sources

    The dataset I used was Zitkala-Sa-American-Indian Stories.txt, which is a .txt file of the American Indian Stories book by Zitkala-Sa. This data is drawn from Project Gutenberg, which contains open-source public-domain texts. American Indian Stories is a collection of autobiographical narratives describing cultural struggles that Native Americans faced in the early 20th century. Because the book is structured as a series of short stories, I decided to split the original data into ten .txt files, all containing their own respective story. By restructuring the dataset in this way, I was able to analyze how word frequency changed between stories.

    Processes

    To analyze my data, I decided to use the online tool Voyant. This allowed me to upload all of the individual .txt files I created to generate a comprehensive analysis of the data while also giving me the ability to visualize trends between stories. Voyant gives you multiple tools to analyze the word frequency in a dataset, however, I was particularly interested in the word cloud and trend graph. The word cloud scans the entire dataset and creates a visualization depicting the most frequently appearing words. The more often the word appears in the dataset, the larger it is in the world cloud. Alternatively, the trend graph displays the frequency of these most common words across each individual story. This is done in a line and stacked bar chart graph, which I felt was the best way to visualize the fluctuations in word usage between stories.

    Presentation

    I decided to embed the word cloud and trend graph in my WordPress website using custom HTML blocks, as this will let visitors of the website interact with the visualizations. For the word cloud, the users can change the number of terms that appear in the word cloud. This allows them to see more than 25 of the most frequent words in the dataset if they wish. For the trend graph, the user can change the type of visualization for the data. By default it is a line and stacked bar chart graph, however, they can switch to other options such as an area or column graph. For the design of the website, I decided to change the theme from the default to utilize a new color palette and typography. I thought that this would make the presentation look cleaner and more professional. Additionally, I decided to display the word cloud and trend graph side-by-side, as this makes it easier to directly compare the overall frequency of words in the dataset with the fluctuation of these words between the stories.

    Significance

    One insight that arises from this project is that word frequency can reveal the underlying themes within a text. Many of the more frequent words, such as “woman”, “mother”, “white”, and “Indian” are reflective of the central ideas in American Indian Stories. Additionally, by examining how these words fluctuate between the individual narratives, we can gain insight into how these different themes are emphasized in specific stories. For example, we see the word “mother” is very prominent in the Impressions of an Indian Childhood and An Indian Teacher Among Indians, but not so much in others. Similarly, the word “Indian” appears the most frequently in A Dream of Her Grandfather, America’s Indian Problem, The Widespread Enigma Concerning Blue-Star Woman, and An Indian Teacher Among Indians. 


    This project relates to the Digital Arts & Humanities rather than data science, as it focuses on drawing cultural and thematic meaning from these word frequencies. While data science typically focuses on statistical analysis and prediction, the Digital Arts and Humanities emphasize the interpretation of data through a cultural or humanistic lens. In this project, the word frequencies were not analyzed for their numerical significance, but rather, for the insight they provide into the themes of Zitkala-Sa’s American Indian Stories. Through examining the fluctuations of word usage across these autobiographical narratives, we can gain insight into underlying themes of Native Americans’ experiences in the early 20th century. This analysis can lead to further cultural and historical exploration.