The Corpus of Mid-20th Century Hong Kong Cantonese (Phase 2)

Welcome to the Corpus

In 2012, with the support of an internal research grant from the Education University of Hong Kong (the then Hong Kong Institute of Education), we developed the first phase of The Corpus of Mid-20th Century Hong Kong Cantonese. The corpus was designed to provide language data for studying the earlier stage of Hong Kong Cantonese.

In 2013, we obtained a research grant under the Early Career Scheme (ECS) from the Research Grants Council to further develop the second phase of our corpus. Altogether 60 movies were transcribed, with about 770,000 character tokens. The movies our project team selected are balanced in terms of genre and speakers (including gender) so that the corpus data can represent the Cantonese language spoken in the mid-20th century.

The ELAN software from Max Planck Institute was used to transcribe the second phase of the corpus.

Part-of-Speech Tagging

The transcribed utterances in the corpus have been segmented and each word has been assigned a part-of-speech tag.

Jyutping Romanisation

To enhance the searchability and usefulness of the corpus data, Chinese characters have been romanised into Jyutping.

Flexible Search

An advanced search interface has been developed to enable users to search by full-text, or by specific token patterns using a combination of Chinese characters, Jyutping romanisation and POS tag.