Dissertation > Language, writing > Linguistics > Applied Linguistics

Research into the Structure of Corpora and Its Application

Author ZhangMin
Tutor LiXueNing
School Jiangnan University
Course English Language and Literature
Keywords Corpus Structure Continuum Theory Application
Type Master's thesis
Year 2012
Downloads 140
Quotes 0
Based on authentic language date, corpus linguistics conducts scientific research with a panoramic view by the method of probability analysis. In this sense, corpus linguistics offers us a new research paradigm. Serving as base for corpus linguistics, an all-around corpus of high representativeness is highly significant in justifying the research accomplishment. When it comes to corpus establishment, many suggest that a serial of constructional issues should be taken into account, such as the size for the corpus, the source of the material, the type of the material and so on.The issue of being representative, that is, whether the date can fully represent all research domain, is accordingly reflected by how reasonable the structure is. Corpus structure mainly refers to stratum criterion and the relative percentage. Taking first step in investigating structures of the well-established western corpora, this paper intends to explore the underlying rules for structure arrangement in light of the continuum theory. Halliday, the founder of Systematic Functional Grammar, views the language on the whole as a continuum with spoken and written language situated on each side. In particular, the overlapped parts in the continuum enjoy both written and spoken characteristics, which evolve into classical spoken or written form later. Dismissing arguments that spoken form dominates in language or the other way around, continuum theory presents a dialectical illustration of language. Assisted by this theory, this paper finds that the SEU Corpus、Brown Corpus、LOB Corpus and ICE-GB Corpus has fully taken the language form into consideration, especially for the SEU Corpus. Within SEU Corpus, all dates have been divided into written origin, scripted to be spoken and spoken origin, which indicates the transitional development from written English to Spoken English. In addition, the dividing standard of scripted to be spoken proves to include the talks, plays, scripted oration and news. This structure layout matches the overlap part or the intermediate segment in the continuum theory quite well, which shows a scientific corpus construction. By contrast, the Brown Corpus and LOB Corpus exclude the spoken language. Nevertheless, these corpora exemplify how to arrange structure for corpus of sole written form. Modeled on the continuum theory diagram, main layer standards and percentages have been drawn and marked on the diagram. The symmetrical graphical representation embodies a high consistency, specifying these analyzed corpora are of high representativeness. However, the language formability is far from the only dividing standard for structures. It’s not out of expectation to find out that BNC Corpus、LLELC Corpus and MCLC Corpus adopt subject field standard to differentiate the structure. Moreover, the further study ascertains that the two methods are not isolated with each other. In ICE-GB, both of them have been referred as guidance under the branch of learned and the popular.The above analysis makes it clear that the stratum criterion from the perspective of language formality and subject fields are the most common two ways currently. This paper hasn’t been restricted by the above research results. Furthermore, it has discussed the structure layout with the temptation to build Learner Corpus for English Majors for relevant knowledge. First of all, this study inclines to survey actual needs for English on the society. Only in this way can the corpus be conducive to English students who have to face society after years of leaning. The required date related to the demand for English on the society has been extracted from 102 students’internship logs due to graduate in 2006. After a deep look, 34 students have chosen career having nothing to do with usage of English. Therefore, this part should be eliminated from database. According to the main concerns expressed by students in logs, it’s easy to conclude the most needed English knowledge and skills. Mainly, they are classified as foreign trade English, teaching English, English for translation and so on. Each field’s percentage will be calculated in line with the number of students involved in the specific field. Combing the stratum standard of subject field, this paper finally comes up with the structure configuration by trimming the date further.This paper emphasizes the study of the corpora structures from both theoretical research and practical perspective to build a corpus. However, due to limited research time and efforts, later work is still wanted in the future. Under the background that corpus linguistics alerts more and more attention among linguistic researchers, this paper is hopefully expected to be helpful for corpus construction.

