In 2012, with the support of an internal research grant from the Education University of Hong Kong (the then Hong Kong Institute of Education), we developed the first phase of The Corpus of Mid-20th Century Hong Kong Cantonese. The corpus was designed to provide language data for studying the earlier stage of Hong Kong Cantonese.
In 2013, we obtained a research grant under the Early Career Scheme (ECS) from the Research Grants Council to further develop the second phase of our corpus. Altogether 60 movies were transcribed, with about 770,000 character tokens. The movies our project team selected are balanced in terms of genre and speakers (including gender) so that the corpus data can represent the Cantonese language spoken in the mid-20th century.
The transcribed utterances in the corpus have been segmented and each word has been assigned a part-of-speech tag.
To enhance the searchability and usefulness of the corpus data, Chinese characters have been romanised into Jyutping.
An advanced search interface has been developed to enable users to search by full-text, or by specific token patterns using a combination of Chinese characters, Jyutping romanisation and POS tag.