English Usage Comparison between Native and non-Native English Speakers in Academic Writing Bei Yu beiyu@uiuc.edu University of Illinois at Urbana-Champaign Qiaozhu Mei qmei2@uiuc.edu University of Illinois at Urbana-Champaign Chengxiang Zhai czhai@cs.uiuc.edu University of Illinois at Urbana-Champaign Introduction Discovering the differences in the language usage of native and non-native speakers contributes to contrastive linguistics research and second language acquisition. Unlike grammatical errors, the language usage differences are grammatically correct, but somehow do not conform to the native expressions. For example, if an English phrase is used popularly in a native English speaker group (G1) but not in a Chinese speaker group (G2), or vice versa, one possible reason may be first language (L1) impact on second language (L2). Such an impact may be due to grammar differences or even culture differences. Isolating the language usage differences is never an easy task. There are generally two approaches to detect such differences: controlled experiments and corpus-based analysis. Many English as a Second Language (ESL) studies were conducted in a controlled environment, for example, through manually comparing the short in-class writing samples by a small number of college students in G1 and G2. The small data sets and the small subject groups undermined the result generalization because conflicting results were sometimes produced regarding the same language usage. In contrast, corpus-based approaches (for example, computational stylistics) facilitate analysis of larger sample sets written by many subjects. Such approaches involve three major steps: 1) building a corpus of two comparable subsets; 2) automatically extracting a language usage feature set; 3) selecting a subset of the features that distinguishes G1 and G2. Given a corpus and a feature set, we can transform the above task into a text categorization problem. The goal is to categorize the texts by the authors' language background. Picking the subset is then transformed into the feature selection problem in text categorization. Oakes used Chi-square test to find vocabulary subsets more typical of British English or American English. Oakes' work focused on identifying two feature categories: (1) the features common in G1 but not in G2; and (2) the features common in G2 but not in G1, while we hypothesize the existence of a third discriminative feature category (3): features common in both G1 and G2, but with different usage frequencies. All the other features are considered irrelevant. Distinguishing these different categories of features allows us to discover subtle differences in the language usage. In this paper, we propose a simple approach of comparative feature analysis to compare the language usage between native English speakers and Chinese speakers in academic writing. We first select candidate features that are common in at least one group and then categorize them into the above three feature categories. Features in category (1) and (2) are ranked by their differences in document frequency, and features in category (3) are ranked by their difference indices as defined in the next few paragraphs. We use two heuristic constraints to select a "good" candidate discriminative feature: •Constraint (1): The feature should be common within at least one of the groups. •Constraint (2): The feature should be contrastive across the groups. We use the normalized document frequency (DF) to measure the feature commonality within a group. DF means the number of documents containing this feature. Denote DF_1 and DF_2 as the document frequencies in G1 and G2, respectively, and T=0.3 as a DF threshold. A feature falls into •category (1) if (DF_1-DF_2)>=T; •category (2) if (DF_2-DF_1)>=T; •category (3) if |DF_2-DF_1|= T) and (DF_2 >= T). Intuitively, the features in category (3) are those that are sufficiently popular in both G1 and G2 and have comparable DFs in G1 and G2. After categorizing the features, we then define a Difference Index (DI) to measure the feature discriminating power and rank the features in category (3) by DI. We define TF as the number of occurrences of a feature in a document. Let m_1 and m_2 be the mean of the TF value of a feature in G1 and G2, respectively, we define DI as DI =sig(m_1 - m_2)* max(m_1, m_2)/min(m_1, m_2). sig(m_1 - m_2) is 1 if m_1>m_2 and is -1 otherwise. A positive DI means the feature is more heavily used in G1 and a negative DI means it is more popular in G2. The larger the |DI| is, the bigger the difference is. Corpus Construction We propose the following fairness constraints for corpus construction: 1) the subjects in each group should have similar English proficiency; 2) the text genre and the topic should not interfere with the language usage analysis. We collect two datasets that satisfy the constraints above. The first has 40 selected electronic theses and dissertations (ETD) from the ETD database at Virginia Tech., in which 20 are from Chinese students and 20 from American students, all from computer science and electronic and engineering departments to avoid genre interference. The second has 40 selected research articles downloaded from Microsoft Research (MSR), in which 20 are contributed by Chinese researchers in Beijing, China and 20 by their British colleagues in Cambridge, UK. The documents in ETD collection are long and each strictly attributed to one author, while the documents in MSR collection are much shorter and many are co-authored. We restrict the co-authors to the same language background. The biographies in theses and the resumes on MSR website help us identify the authors' first and second language. Feature Extraction We extracted plain text from the original pdf and ps files. We limit our search for features at the lexical level because syntactic parsing does not perform well in such a technical writing corpus due to formulas, tables and figure captions, etc. In order to avoid topic interference, we choose a feature set popular in computational stylistic analysis (Koppel et. al.), the n-gram (we set 0