Skip to Main Content

Exploring Text Analysis for Digital Scholarship: Tools, AI, and Responsible Practices

To conduct research in text analysis

  1. Identify the critical gap which your research will answer
  2. Find your texts
  3. Clean and prepare your texts
  4. Analyze your texts
  5. Visualize your texts

1. Identify the critical gap which your research will answer

Some example research questions that you might be interested in :

 

張愛玲往來書信集】- 2巻

  • 分析歷經四十年書信往來中的個人關係與情感變化
  • 探討書信內容與張愛玲文學創作主題的連結
  • 揭示書信中對歷史與文化背景的觀察與反思

 

Shakespeare Digital Analysis

  • Comprehensive analysis of all 37 plays using multiple tools

  • Character network analysis and thematic pattern identification

  • Language evolution tracking across different periods and genres

 

Historical Newspaper Topic Modeling

  • Processing of 20,000+ historical articles

  • Identification of emerging themes during significant historical periods

  • Visualization of topic evolution over time

 

Social Media Sentiment Analysis

  • Large-scale processing of social media content

  • Real-time public opinion tracking

  • Bias pattern identification across different communities

2. Find your texts

There are many ways to collect your texts (data or corpus):
Possible sources for collecting texts:

Category

Resource Name

Description

Link

Corpus

Brown Corpus

A standard reference corpus for English, useful for linguistic research.

Brown Corpus

 

British National Corpus (BNC)

A 100-million-word collection of samples of written and spoken British English.

BNC

 

COCA (Corpus of Contemporary American English)

A large, balanced corpus of American English from 1990 to present.

COCA

 

CHILDES

A corpus for studying child language acquisition.

CHILDES

Data

DATA.GOV.HK

Various types of datasets across different categories from different providers.

DATA.GOV.HK
  Humanitarian Data Exchange

A platform for sharing humanitarian data to improve decision-making during crises.

HDX
  Kaggle A platform with datasets across various domains for analysis and machine learning. Kaggle
  UCI Machine Learning
Repository
A collection of datasets for machine learning and statistics. UCI Repository
 

Open Data Portal (data.gov)

The home of the U.S. government’s open data.

Data.gov

 

World Bank Data

Access data for global development indicators.

World Bank Data

Literacy Texts

Project Gutenberg

A digital library offering over 60,000 free eBooks, including classic literature.

Project Gutenberg

 

Perseus Digital Library

A collection of historical texts in Greek, Latin, and other ancient languages.

Perseus

 

Internet Archive

A vast collection of books, movies, music, and other digital materials.

Internet Archive

HathiTrust Digital Library

A large-scale collaborative repository of digitized books and journals, including rare texts.

HathiTrust

Early English Books Online (EEBO)

A collection of texts from early modern English literature, history, and culture.

EEBO

 

LibriVox

Free audiobooks of public domain texts, read by volunteers.

LibriVox

Historical Documents

Europeana

Access to millions of digitized items from European cultural heritage institutions.

Europeana

Chronicling America

A collection of historic American newspapers from 1789 to 1963.

Chronicling America

World Digital Library

Historical documents, maps, photos, and more from cultures around the world.

World Digital Library

British History Online

Primary and secondary sources for the history of Britain and Ireland.

British History Online

Avalon Project

Historical legal documents from ancient times to modern, curated by Yale Law School.

Avalon Project

National Archives (UK)

Access to British government documents, wills, and military records.

National Archives

Linguistics

Ethnologue

A comprehensive reference on world languages.

Ethnologue

 

SIL International

Resources for language development and documentation.

SIL International

 

Glottolog

A bibliographic database for lesser-known languages.

Glottolog

 

Linguist List

A global online forum for linguists that includes job postings, resources, and discussions.

Linguist List

 

3. Clean and prepare your texts

Text Pre-processing involves tasks such as:

  • Tokenization (splitting text into individual words or tokens)
     
  • Stopword removal (filtering out uninformative words like "a," "an," and "the")
     
  • Lemmatization or stemming (reducing words to their root or base form)

4. Analyze your texts

Traditional Tools

AI-Based Tools

  • TAPoR (Text Analysis Portal for Research) :
    Gateway to sophisticated text analysis resources
  • Voyant Tools : Enhanced web-based visualization and analysis
  • AntConc : Advanced concordance analysis with statistical metrics
  • MALLET : A machine learning software tool operated via the command line with Python, known for its strong topic modeling and document clustering capabilities.
  • NLTK (Natural Language Toolkit) : Enables users to process and analyze human language data through classification, tokenization, tagging, and other methods.

  • R Packages : A Programming language commonly used by humanists to do statistical analysis and create visualizations.
  • And more…

Traditional Tools vs. AI-Based Tools

  • Traditional tools like MALLET, NLTK, and R require programming skills, often in Python. 
  • Lack of coding knowledge creates a barrier for humanities scholars. 
  • AI-powered tools simplify complex analyses with intuitive, plain-language prompts. 
  • These tools bridge the gap between technical expertise and scholarly research.

5. Visualize your texts

Traditional tools - VOYANT

AI-based tools - (1) Prompt AI to code - Python

1. Describe your goal

2. Copy and paste on IDE* (e.g. Google Colab)

3. Prompt example:

- Draft Python code to fetch the full text of "Persuasion" by Jane Austen from this site: https://www.gutenberg.org/

- the output should be a .txt file.

 

 

 

 

Integrated Development Environment (IDE) - A tool for writing, testing, and running Python code.

4. Run Python code on IDE

5. If error occurs, copy the error message and prompt AI to debug

6. Run the revised code again

7. Download the result e.g. a text file, CSV file

8. Verify the content – ensure it includes Chapter I – Chapter XXIV

AI-based tools - (2) Conduct preliminary text analysis using AI bots

1. Upload the .txt file to an AI Bot
2. Prompt the AI to perform text analysis


Some examples:

(a) Text-based analysis

Perplexity Pro

https://www.perplexity.ai/search/perform-text-analysis-on-conco-6bwu5333QWO43iimPuRD9w?0=d#0

Poe Assistant

https://poe.com/s/GRL0e7eNjYBpAB7yYLIS

Claude-Sonnet-4

https://poe.com/s/2xWr0cluORHqCMbeSjgz

 

(b) Interactive relationship network diagram

e.g 【張愛玲往來書信集- 書不盡言】中有關「傾城之戀」部分的人物關係分析

 

Claude-Sonnet-4

https://poe.com/s/NtgESmfqKEWq7xN0MRsr

 

Click the diagram to explore its interactivity