Some example research questions that you might be interested in :
【張愛玲往來書信集】- 2巻
Shakespeare Digital Analysis
Comprehensive analysis of all 37 plays using multiple tools
Character network analysis and thematic pattern identification
Language evolution tracking across different periods and genres
Historical Newspaper Topic Modeling
Processing of 20,000+ historical articles
Identification of emerging themes during significant historical periods
Visualization of topic evolution over time
Social Media Sentiment Analysis
Large-scale processing of social media content
Real-time public opinion tracking
Bias pattern identification across different communities
There are many ways to collect your texts (data or corpus): |
|
Possible sources for collecting texts: |
Category |
Resource Name |
Description |
Link |
Corpus |
Brown Corpus |
A standard reference corpus for English, useful for linguistic research. |
|
British National Corpus (BNC) |
A 100-million-word collection of samples of written and spoken British English. |
||
COCA (Corpus of Contemporary American English) |
A large, balanced corpus of American English from 1990 to present. |
||
CHILDES |
A corpus for studying child language acquisition. |
||
Data |
DATA.GOV.HK |
Various types of datasets across different categories from different providers. |
DATA.GOV.HK |
Humanitarian Data Exchange |
A platform for sharing humanitarian data to improve decision-making during crises. |
HDX | |
Kaggle | A platform with datasets across various domains for analysis and machine learning. | Kaggle | |
UCI Machine Learning Repository |
A collection of datasets for machine learning and statistics. | UCI Repository | |
Open Data Portal (data.gov) |
The home of the U.S. government’s open data. |
||
World Bank Data |
Access data for global development indicators. |
||
Literacy Texts |
Project Gutenberg |
A digital library offering over 60,000 free eBooks, including classic literature. |
|
Perseus Digital Library |
A collection of historical texts in Greek, Latin, and other ancient languages. |
||
Internet Archive |
A vast collection of books, movies, music, and other digital materials. |
||
|
HathiTrust Digital Library |
A large-scale collaborative repository of digitized books and journals, including rare texts. |
|
|
Early English Books Online (EEBO) |
A collection of texts from early modern English literature, history, and culture. |
|
LibriVox |
Free audiobooks of public domain texts, read by volunteers. |
||
Historical Documents |
Europeana |
Access to millions of digitized items from European cultural heritage institutions. |
|
|
Chronicling America |
A collection of historic American newspapers from 1789 to 1963. |
|
|
World Digital Library |
Historical documents, maps, photos, and more from cultures around the world. |
|
|
British History Online |
Primary and secondary sources for the history of Britain and Ireland. |
|
|
Avalon Project |
Historical legal documents from ancient times to modern, curated by Yale Law School. |
|
|
National Archives (UK) |
Access to British government documents, wills, and military records. |
|
Linguistics |
Ethnologue |
A comprehensive reference on world languages. |
|
SIL International |
Resources for language development and documentation. |
||
Glottolog |
A bibliographic database for lesser-known languages. |
||
Linguist List |
A global online forum for linguists that includes job postings, resources, and discussions. |
Text Pre-processing involves tasks such as:
Traditional Tools |
AI-Based Tools |
|
|
Traditional Tools vs. AI-Based Tools
Traditional tools - VOYANT | |
|
|
AI-based tools - (1) Prompt AI to code - Python | |
1. Describe your goal 2. Copy and paste on IDE* (e.g. Google Colab) 3. Prompt example: - Draft Python code to fetch the full text of "Persuasion" by Jane Austen from this site: https://www.gutenberg.org/ - the output should be a .txt file.
* Integrated Development Environment (IDE) - A tool for writing, testing, and running Python code. |
![]() |
4. Run Python code on IDE 5. If error occurs, copy the error message and prompt AI to debug 6. Run the revised code again 7. Download the result e.g. a text file, CSV file 8. Verify the content – ensure it includes Chapter I – Chapter XXIV |
|
AI-based tools - (2) Conduct preliminary text analysis using AI bots | |
1. Upload the .txt file to an AI Bot Some examples: (a) Text-based analysis Perplexity Pro https://www.perplexity.ai/search/perform-text-analysis-on-conco-6bwu5333QWO43iimPuRD9w?0=d#0 Poe Assistant https://poe.com/s/GRL0e7eNjYBpAB7yYLIS Claude-Sonnet-4 |
|
(b) Interactive relationship network diagram e.g 【張愛玲往來書信集- 書不盡言】中有關「傾城之戀」部分的人物關係分析
Claude-Sonnet-4 https://poe.com/s/NtgESmfqKEWq7xN0MRsr
|
Click the diagram to explore its interactivity ↓ |
![]() |