We are thrilled to announce the Lancaster-Northern Arizona Corpus (LANA), a new corpus of spoken and written American English that will be freely available for non-commercial research upon completion. We hope that this corpus will enable linguists to carry out descriptive studies of spoken and written American English as well as to improve language teaching, learning, and assessment, along with aiding in a wide range of other applications. We anticipate that, similar to its British English counterpart, the BNC 2014, LANA will be used to research various phenomena (such as word and phrase frequency, communicative purpose, and social justice issues, to name a few).
The corpus will be as representative of natural spoken and written language in the US as possible. This objective of course comes with inherent challenges, such how to address the question “How do we actually go about trying to represent language in the US?”. The country is comprised of millions of individuals with various language and educational backgrounds, occupations, race/ethnicities, and living situations, who live in different regions and use language in specific and varied contexts. The team, which includes researchers from Lancaster University (Tony McEnery, Vaclav Brezina, Paul Baker, Gavin Brookes, and Isobelle Clarke) and Northern Arizona University (Jesse Egbert, Tove Larsson, Elizabeth Hanks, Doug Biber, and Randi Reppen), is hard at work taking steps toward overcoming these and other challenges. We also have the generous support of Lancaster University’s Global Advancement Fund, which is instrumental in recruiting participants from various populations across the United States.
The NAU team (with a great deal of help from our colleagues at Lancaster) is spearheading the spoken component of the corpus, The Lancaster-Northern Arizona Corpus of American Spoken English (LANA-CASE). We will collect recordings of unplanned spoken discourse from a variety of conversational contexts. Careful recruitment and sampling are integral to building a corpus that effectively represents the United States in terms of age, race/ethnicity, geographic region, and setting (i.e., urban, suburban, or rural), and it wouldn’t be possible without the support of our CL community and friends – and here is where you come in: Once we are ready to start data collection, we would love to get help recruiting participants who would be open to devoting some time to the project. You can find more information on how to get involved here.
Your help in recruiting participants is vital to compiling a spoken corpus that is highly representative of the U.S. population. It goes without saying that the findings from all future research using this corpus will only be as generalizable as the corpus is representative. In other words, a corpus that only marginally reflects the United States population will yield only marginally generalizable results, while a corpus that effectively reflects the U.S. population will yield largely generalizable results. This is why your support is invaluable. By working together to recruit participants from the diverse populations listed above, we can compile a highly representative corpus that will be used as a resource for linguistics research for many years to come.
We would love for you to follow along with the project and help out to the extent you are able. Feel free to visit our website, follow us on Twitter (@LANA_corpus), and/or sign up to receive project updates. Those who help to coordinate contributions will have our undying gratitude, in addition to early access to LANA! To echo Tony McEnery’s sentiment when announcing LANA at the AACL conference on September 10, “[We] really want to see what colleagues [we] respect and admire can make of what promises to be really fascinating data!”