2012 Worldwide Convention on Asian Language Processing A Corpus-based Evaluation of Combined Code in Hong Kong Speech John Lee Halliday Centre for Clever Purposes of Language Research Division of Chinese language, Translation and Linguistics Metropolis College of Hong Kong [email protected] edu. hk Summary—We current a corpus-based evaluation of using blended code in Hong Kong speech. From transcriptions of Cantonese tv applications, we establish English phrases embedded inside Cantonese utterances, and examine the motivations for such code-switching.
Among the many many motivations noticed in earlier analysis, we discovered that 4 alone account for greater than 95% of using English phrases in our speech information throughout genres, genders, and age teams. We carried out analyses over greater than 60 hours of transcribed speech, leading to one of many largest empirical research to-date on this linguistic phenomenon. Key phrases-code-mixing; English; corpus linguistics. code-switching; Cantonese; II. PREVIOUS RESEARCH I. INTRODUCTION
Whereas Cantonese is the mom tongue for the overwhelming majority of the folks in Hong Kong, English can be spoken by 43% of the inhabitants [1], reflecting the town’s heritage as a British colony. A well known characteristic of the speech in Hong Kong is code-switching, i. e. , “the juxtaposition of passages of speech belonging to 2 completely different grammatical programs or sub-systems, inside the similar alternate” [2]. Particularly, within the case of Hong Kong, the 2 grammatical programs are Cantonese and English.

The previous serves because the ‘matrix language’, and the latter because the ‘embedded language’, leading to Cantonese sentences with English segments equivalent to (instance taken from [3]): canteen heoi3 canteen jam2 caa4 ‘let’s go to the canteen for lunch’ Right here, the English phase incorporates just one phrase (‘canteen’), however basically, it may be an entire clause. We’ll use the final time period ‘code-switching’ slightly than the extra particular time period ‘code-mixing’, which refers to switching under the clause degree, though most English segments in our corpus certainly comprise just one or two phrases (see Desk three).
There’s already a big physique of literature dedicated to the examine of Cantonese-English code-switching from the theoretical linguistic standpoint [3,4,5]. This paper investigates the motivations behind using blended code, on the premise of a big dataset of speech transcribed from tv applications. In Part II, we define earlier analysis on the motivations of code-switching, and focus on how our investigation enhances theirs. In Part III, we describe our methodology for corpus building, specifically the design of the taxonomy of code-switching motivations.
In Part IV, we current an evaluation of those motivations in keeping with style, gender and age. The primary main framework for classifying codeswitching motivations in Hong Kong consists of two classes: ‘expedient’ and ‘orientational’ [6]. Central to this framework is the excellence between phrases in ‘excessive Cantonese’ and ‘low Cantonese’. In on a regular basis conversations, a speaker generally can’t discover any phrase from ‘low Cantonese’ to explain an object, establishment or thought (e. g. , ‘software kind’). Utilizing a phrase from ‘excessive Cantonese’ (e. g. , biu2 gaak3), nevertheless, would sound too formal and subsequently stylistically inappropriate.
In expedient mixing, the speaker resorts to an English phrase; the blending is pragmatically motivated. In distinction, orientational mixing is socially motivated. The speaker chooses to make use of English (e. g. , ‘barbecue’) regardless of the provision of equal phrases from each ‘low Cantonese’ (e. g. , siu2 je5 sik6) and ‘excessive Cantonese’ (e. g. , siu1 haau1), since he perceives the subject material to be inherently extra ‘western’. This dichotomy has been criticized as overly simplistic, due to the anomaly in defining lexical and stylistic equivalents amongst ‘low Cantonese’, ‘excessive Cantonese’, and English.
As a substitute, a four-way taxonomy is proposed: euphemism, specificity, bilingual punning, and the precept of economic system [7]. This taxonomy is then additional prolonged, in a examine of code-switching in textual content media [8], to incorporate quotations, doubling, id marking, and interjection. These classes can be defined intimately in Part III. Whereas these classification programs are complete and properly grounded, they don’t per se convey any sense of the relative significance or distribution of the varied motivations.
Our objective is, first, to empirically confirm the protection of those classification programs on a big dataset of transcribed speech; and, second, to offer quantitative solutions to questions equivalent to: Which sorts of motivations are probably the most outstanding? Does the vary of motivations differ in keeping with the speech style, or to the speaker’s gender or age? We now flip our consideration to the methodology for developing and annotating a speech corpus for these analysis functions. III. DATA A. Supply Materials Our corpus is constructed from tv applications broadcast in Hong Kong inside the final 4 years by Tv Broadcasts Restricted (TVB).
The applications belong to a wide range of genres, together with two drama collection, three current-affairs reveals, a information program, and a chat present. The information program, TVB Information at Six-Thirty, carries probably the most formal register, containing principally pre-planned 165 978-Zero-7695-4886-9/12 $26. 00 © 2012 IEEE DOI 10. 1109/IALP. 2012. 10 speech by the anchor. The present-affairs reveals, Tuesday Report, Sunday Report and Hong Kong Connection, are critical in tone however comprise spontaneous discussions. The discuss present, My Sweets, is about food and drinks.
It additionally incorporates spontaneous discussions, however the matters are usually lighter. Though pre-planned, the speech in each drama collection, Moonlight Resonance and Sure Sir, Sorry Sir, is arguably the least formal in register, designed to mirror pure speech in on a regular basis life. Particulars of those TV applications are offered in Desk 1. Desk 1: Tv applications that function the supply materials of our corpus. Style Program Size Present Tuesday Report ( ), 135 episodes affairs ), X 20 minutes Sunday Report ( Hong Kong Connection ( ) Speak 24 episodes My Sweets ( ) present X 30 minutes
Euphemism: When a Cantonese phrase explicitly mentions one thing that the speaker finds embarrassing, s/he may go for an English phrase that incorporates no such point out. For instance, to keep away from the feminine physique half hung1 ‘breast’ within the phrase hung1 wai4 ‘bra’, the speaker may desire to make use of the English ‘bra’ (all examples are taken from [7]): bra tau3 bra gaak3 gaak3 ‘A princess whose bra is seen’ Specificity: “Generally an English expression is most popular as a result of its which means is extra normal or particular in contrast with its near-synonymous counterparts,” [7] in both low or excessive Cantonese.
For instance, the verb ‘to ebook’ means ‘to make a reservation for which no cash or deposit is required’, which is extra particular than its closest equal in Cantonese, deng6 ‘to make a reservation’. It’s usually utilized in sentences equivalent to: ebook ngo5 soeng2 ebook saam1 dim2 ‘I wish to ebook three o’clock’ Precept of Economic system: “An English expression may additionally be most popular as a result of it’s shorter and thus requires much less linguistic effort in contrast with its Chinese language/Cantonese equal. ” [7] Whereas the phrase ‘check-in’ has two syllables, its Cantonese equal baan6 lei5 dang1 gei1 sau2 zuk6 ‘check-in [on a plane]’ has six.
The precept of economic system is thus seemingly the explanation behind blended code equivalent to: check-in nei5 check-in zo2 mei6 aa3 ‘Have you ever checked in already? ’ The taxonomy in [8] builds on the one in [7], additional enriching it with categories2 under: Citation: When citing textual content or another person’s speech, one usually prefers to make use of the unique code to keep away from having to carry out translation. An instance is direct speech: “What do you suppose? ” jau5 go3 pang4 jau5 man6 ngo5 what do you suppose ‘A pal requested me, “What do you suppose? ’ Doubling: Initially named ‘Emphasis or avoidance of repetition’ [8], it will likely be known as ‘Doubling’ [9] right here to make it express, as this class refers to English phrases which might be embedded alongside Cantonese phrases which have the identical or practically the identical which means. The aim is to emphasise the thought or to keep away from repetitions. Within the following sentence, it serves as an emphasis: 2 Information Drama TVB Information at Six-Thirty ( ) Moonlight Resonance ( ), Sure Sir, Sorry Sir ( Sir Sir) 5 episodes X 20 minutes four episodes X 45 minutes B.
Knowledge Processing From the tv applications listed in Desk 1, all code-mixed utterances had been transcribed, preserving the unique languages, both Cantonese or English. Following normal observe, mortgage phrases usually are not thought-about to be blended code; in our context, all English phrases (e. g. , ‘taxi’) which were tailored into Cantonese phonology (e. g. , dik1 si2) had been excluded. The TV captions corresponding to every of those utterances are additionally recorded as a part of the corpus. These captions are in normal Chinese language, slightly than Cantonese.
Moreover, alignments between the Chinese language phrase(s) within the caption and the English phrase(s) within the utterance are annotated. This data can be used within the classification of motivations. Lastly, two sorts of metadata in regards to the speaker are recorded: gender (male or feminine) and age group (teenager or grownup). C. Taxonomy of Code-Switching Motivations Our objective is to quantitatively characterize the motivations behind code-switching; to this finish, every English phase within the Cantonese sentences in our corpus is to be labeled with a motivation. Because of time constraint, this classification was carried out solely on the currentaffairs and discuss reveals.
The ‘expedient’ vs. ‘orientational’ classification system is simply too coarse for our objective. As a substitute, we adopted the taxonomy in [7,8] as our place to begin, then launched some new classes to accommodate our information. The classes in [7] are1: 1 A fourth class, ‘bilingual punning’, is excluded from our taxonomy. As could also be anticipated, punning is rarer in speech, and is certainly not present in our corpus. Amongst these classes is ‘id marking’, for blended code that marks “social traits equivalent to social standing, schooling standing, occupation, in addition to regional affiliation. [8] We discovered it troublesome to objectively establish this motivation, and excluded it from our taxonomy. 166 Superb excellent m4 co3 aa1 ‘Superb, excellent! ’ Interjection: English interjections could also be inserted into the Cantonese sentence. For instance: Anyway anyway nei5 hou2 sai1 lei6 ak1 ‘Anyway, you’re superior! ’ A big quantity of blended code in our corpus, nevertheless, nonetheless doesn’t match into any of the above classes. Most fall beneath one in all two causes, ‘Private Title’ and ‘Register’.
We subsequently added them to our taxonomy: Register: That is roughly equal to the ‘expedient’ class in [6], however can be known as ‘Register’ on this paper to make the motivation express. Generally, the speaker can’t discover any equal ‘low Cantonese’ phrase, however feels awkward to make use of a extra formal ‘excessive Cantonese’ phrase (e. g. , paai1 deoi3 ‘celebration’). Because of this, s/he resorts to an English equal as an alternative. For instance, celebration hoi1 ci2 laa1 ngo5 dei6 go3 celebration ‘Our celebration is beginning’ Private Title: It’s common observe amongst Hong Kong folks to undertake an English title.
Though this phenomenon could also be thought-about ‘orientational’ codemixing when it comes to the ‘western’ notion [6], it’s given its personal class, as a result of it is rather particular and accounts for a considerable quantity of our information. A typical instance is: Teresa, Teresa ngo5 dei6 zing2 dak1 leng3 m4 leng3 ‘Teresa, did we make it properly? ’ D. Annotation Process We thus have a complete of eight classes in our taxonomy of code-switching motivations. 5 of those classes – particularly, ‘euphemism’, ‘citation’, ‘doubling’, ‘interjection’, and ‘private title’ – can often be unambiguously discerned.
The annotator, nevertheless, has usually discovered it troublesome to differentiate between ‘specificity’, ‘register’, and ‘precept of economic system’. To keep up consistency, we adopted the next process. When an English phase doesn’t match into any of the 5 “straightforward” classes, the annotator is to determine whether or not it has the identical which means because the Chinese language phrase within the caption to which it’s aligned. Whether it is deemed to not have the identical which means, then it’s assigned ‘specificity’. Whether it is equal in which means, and the annotator can’t consider any equal in ‘low Cantonese’, then it’s labeled ‘register’.
Lastly, if there’s a ‘low Cantonese’ equal, however its variety of syllables is bigger than that of the English phase, then the motivation is ‘precept of economic system’. IV. ANALYSIS English segments in Cantonese speech (part A), then focus on the distribution of the classes of motivations, each total and with respect to genres, genders, and age teams (part B). A. Density and Size of English Segments It’s well-known that English phrases are sprinkled slightly liberally within the Cantonese speech in Hong Kong. We measure how the frequency of English segments varies throughout completely different genres.
As proven in Desk 2, the frequency correlates with the register of the style (see Part III. A). Within the drama collection, probably the most colloquial style, one and a half English phrases are uttered per minute on common. The discuss present occupies second place, and the present affairs reveals have barely much less frequent English phrases. Within the information program, the place the speech is preplanned, the anchor didn’t utter any English phrase. Desk 2: The entire variety of Cantonese sentences containing English segments, and the whole variety of English phrases transcribed. The final column reveals how usually an English phrase is uttered.
Program style Drama Speak present Present affairs Information # despatched with English 219 487 1495 Zero # English phrases 259 625 1995 Zero Frequency (phrases/min) 1. four Zero. 87 Zero. 74 Zero Second, we measure the size of the English segments. Desk three reveals that the overwhelming majority of English segments comprise not more than two phrases. Throughout all genres, greater than 80% of the English segments encompass just one English phrase. This determine is akin to the 81. four% for textual content information reported in [8]. Desk three: Proportion of English segments with just one (e. g. , “canteen”) or two phrases (e. g. , “thanks”).
Program style Drama Present affairs Speak present One-word 85% 85% 81% Two-word 11% 11% 17% This part presents some preliminary analyses on this corpus. We first think about the frequency and size of B. Motivations for using blended code A plethora of motivations have been posited for using blended code in Hong Kong (see Part II). Making use of our proposed classification system (see Part III. C) on our corpus of transcribed speech, we goal now to discern the relative prevalence of the varied sorts of codeswitching motivations. Desk four reveals the distribution of those motivations within the current-affairs and the discuss reveals.
4 dominant motivations – mainly ‘register’, but in addition ‘private title’, ‘precept of economic system’, and ‘specificity’ – are attributed to greater than 95% of the English segments. This pattern is similar throughout genres (current-affairs and discuss reveals), genders (see Desk 6), and age teams (see Desk 5). All different classes, together with quotations, euphemism, doubling, and interjection, are comparatively rare. Genres. Among the many 4 dominant motivations, ‘register’ – using appropriately casual phrases – is probably the most frequent motivation in each the current-affairs and 167 discuss reveals.
Its proportion, nevertheless, is considerably extra marked (47. four%) within the discuss present than in present affairs (36. four%), reflecting the extra casual nature of the previous. Desk four: Distribution of code-switching motivations, contrasted between genres. Motivation Present affairs Speak present Register 36. four% 47. four% Private Title 26. eight% 24. 5% Precept of economic system 19. Zero% 17. 6% Specificity 13. 2% eight. 2% Citation 2. 1% 1. Zero% Doubling 1. four% Zero. four% Interjection Zero. 9% 1. Zero% Euphemism Zero. three% Zero% Age teams. Desk 5 contrasts the distributions of code-switching motivations between adults and youngsters within the current-affairs reveals three .
As talked about above, the 4 main motivations stay fixed. Nevertheless, youngsters are more likely than adults to make use of English phrases to realize extra casual register (52. four% vs. 35. 1%). Additionally they have a tendency extra to go for English to avoid wasting effort (23. eight% vs. 18. 6%). Considerably surprisingly at first look, youngsters handle others in English names much less usually than adults (2. four% vs. 28. eight%); it seems that within the conversations in our corpus, youngsters usually desire to handle adults with the extra formal Chinese language names, seemingly out of respect.
Desk 5: Distribution of code-switching motivations, contrasted between age teams. Motivation Adults Youngsters Register 35. 1% 52. four% Private Title 28. eight% 2. four% Precept of economic system 18. 6% 23. eight% Specificity 13. 1% 14. three% Citation 1. 9% four. Zero% Doubling 1. three% 2. four% Interjection Zero. 9% Zero% Euphemism Zero. three% Zero. eight% use English names to handle others (32. 9% vs. 18. 9%); males, then again, extra regularly use English phrases to scale back effort (22. 9% vs. 14. eight%). V. CONCLUSIONS We now have described the development of a corpus of Cantonese-English blended code, primarily based on speech transcribed from tv applications in Hong Kong.
Drawn from greater than 60 hours of speech, this corpus is among the many largest of its kind. A novel characteristic of the corpus is the annotation of the motivation behind every code-mixed utterance. Having proposed a classification system for these motivations, we utilized it on our corpus, and reported variations in using blended code between genres, genders and age teams. A key discovering is that 4 major motivations – ‘register’, ‘private title’, ‘precept of economic system’, and ‘specificity’ — account for greater than 95% of the embedded English segments.
ACKNOWLEDGMENT This challenge was partially funded by a Small-Scale Analysis Grant from the Division of Chinese language, Translation and Linguistics at Metropolis College of Hong Kong. We thank Man Chong Mak and Hiu Yan Wong for compiling the corpus and performing annotation. REFERENCES [1] Okay. H. Y. Chen, “The Social Distinctiveness of Two Code-mixing Types in Hong Kong,” in Proceedings of the 4th Worldwide Symposium on Bilingualism, MA: Cascadilla Press, 2005, pp. 527541. J. Gumperz, “The sociolinguistic significance of conversational code-switching,” in RELC Journal eight(2), 1977, pp. 1—34. J.
Gibbons, “Code-mixing and koineizing within the speech of scholars on the college of Hong Kong”, in Anthropological Linguistics 21(three), 1979, pp. 113—123. B. H. -S. Chan, “How does Cantonese-English code-mixing work? ”, in Language in Hong Kong at Century’s Finish, M. C. Pennington (ed. ), 1998, pp. 191—216, Hong Kong: Hong Kong College Press. D. C. S. Li, “Linguistic convergence: Affect of English on Hong Kong Cantonese,” in Asian Englishes 2(1), 1999, pp. 5—36. Okay. Okay. Luke, “Why two languages could be higher than one: motivations of language mixing in Hong Kong”, in Language in Hong Kong at Century’s Finish, M.
C. Pennington (ed. ), 1998, pp. 145—159, Hong Kong: Hong Kong College Press. D. C. S. Li, “Cantonese-English code-switching analysis in Hong Kong: a Y2K assessment,” in World Englishes 19(three), 2000, pp. 305— 322. H. Cao, “Improvement of a Cantonese-English code-mixing speech recognition system,” PhD dissertation, Chinese language College of Hong Kong, 2011. R. Appel and P. Muysken, Language contact and bilingualism. London: Arnold, 1987. [2] [3] [4] [5] [6] Desk 6: Distribution of code-switching motivations, contrasted between genders.
Motivation Feminine Male Register 37. 5% 40. 7% Private Title 32. 9% 18. 9% Precept of economic system 14. eight% 22. 9% Specificity 10. 9% 13. 2% Citation 1. 9% 1. 7% Doubling 1. 1% 1. three% Interjection Zero. 7% 1. 1% Euphemism Zero. three% Zero. 2% Genders. Lastly, we examine whether or not codeswitching motivations are biased in keeping with gender. Aggregating statistics from each the current-affairs and discuss reveals, Desk 6 compares the motivations of males and people of females. Females are proven to be extra prone to three [7] [8] [9] The audio system within the discuss present are predominantly adults. 168

Published by
Essays
View all posts