KUCERA (Kucera-Francis Word-frequency Count) Notes provided by Roger Mitton, Dept of Computer Science, Birkbeck College, Malet Street, London WC1E 7HX November 1984 KUCERA contains over 50,000 entries from the Kucera-Francis Frequency Count of items in the corpus of text collected at Brown University (commonly referred to as the Brown Corpus). Details of the corpus are given in 'Computational Analysis of Present-day American English' by Henry Kucera and W. Nelson Francis, Brown University Press, 1967, and also in 'Frequency Analysis of English Usage: Lexicon and Grammar' by the same authors, published by Houghton Mifflin, 1982. The following is from the latter book: 'The corpus consists of approximately 1,014,000 graphic words of running text, all of which was first printed in the United States in the year 1961. The text is divided into five hundred samples of about two thousand words each, which are assigned to fifteen categories or genres.' [The categories are types of text such as newspaper editorials, learned journals, detective fiction and so on.] They give this explanation of 'graphic words': 'Graphic words are sorted and counted simply on the basis of their graphic form without regard to their lexical status. This means that homographs are not distinguished but are counted together. The frequency figure given for 'will', for example, includes the totals for the modal auxiliary (future-tense marker), the lexical verb meaning 'to wish, desire and bequeath', and the noun designating 'purpose, desire', or 'a document expressing intent to bequeath'. [Conversely,] all graphic words are treated as separate types, so that variant forms, whether inflectional variants or variant spellings, are counted and listed separately'. Each line of the file contains one entry of 57 characters. Columns 1-5 give the frequency with which the word occurs in the corpus; 7-8 give the number of categories the word occurred in; 10-12 give the number of samples it occurred in, and columns 14 onwards give the word itself. As can be seen from the extracts below , some items that qualify as 'graphic words' are not what one would normally consider as words, such as '.09' and '$1,000'. These occur mostly at the beginning and end of the file. It appears that the second tape block of the file (see the first 50 lines of the file in the appendix) has been corrupted. Additional note: The Brown Corpus from which this count was obtained has been the subject of much further work and revision since this file was prepared. In particular, a syntactically analysed form has been prepared. Up to date details are available from the Text Archive, from ICAME (Oslo) - and of course from Kucera himself at Brown University. First 50 lines, then 10 lines every 4000, then the last 50. Line 1 1 01 001 .0044**K 1 01 001 .01 1 01 001 .020 2 01 001 .027 1 01 001 .028 1 01 001 .05 1 01 001 .05**K 3 01 001 .07 1 01 001 .076 1 01 001 .09 1 01 001 .1 1 01 001 .130 1 01 001 .143 1 01 001 .179 12 02 002 .22 3 03 003 .22-CALIBER 1 01 001 .222'S 1 01 001 .243 1 01 001 .255 2 01 001 .264 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ -^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ ^ ^ ^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ` ^ ^ ^ '- ^ ^ ^ ^ ^ 4 01 002 .45 1 01 001 .45-CALIBER 1 01 001 .455 2 01 001 .458 2 02 002 .5 1 01 001 .50 1 01 001 .500 1 01 001 .7 1 01 001 .75 1 01 001 .7854 Line 4001 2 01 001 AZALEA 3 02 002 AZALEAS 1 01 001 AZERBAIJAN 1 01 001 AZUSA 117 09 064 B 1 01 001 B+ 1 01 001 B**.+*O 2 02 002 B**.*A 1 01 001 B**.*B**.*C 1 01 001 B**.*B**.*C**.'S Line 8001 3 01 002 CHATHAM 3 03 003 CHATTANOOGA 1 01 001 CHATTE 2 02 002 CHATTED 1 01 001 CHATTELS 7 05 006 CHATTER 3 03 003 CHATTERED 6 05 006 CHATTERING 2 02 002 CHATTING 1 01 001 CHATTY Line 12001 4 03 003 DENMARK 3 03 003 DENNIS 2 01 001 DENNY 1 01 001 DENNY'S 1 01 001 DENOMINATED 8 04 005 DENOMINATION 2 01 001 DENOMINATION'S 9 04 005 DENOMINATIONAL 1 01 001 DENOMINATIONALLY 15 03 005 DENOMINATIONS Line 16001 1 01 001 FAR-OUT 1 01 001 FAR-RANGING 4 03 004 FAR-REACHING 1 01 001 FAR-SIGHTED 3 02 003 FARCE 1 01 001 FARCES 1 01 001 FARDULLI'S 7 04 007 FARE 3 01 003 FARES 14 10 014 FAREWELL Line 20001 2439 15 408 HAS 1 01 001 HASH 1 01 001 HASHER 2 01 001 HASKELL 1 01 001 HASKINS 20 11 014 HASN'T 1 01 001 HASPS 2 01 001 HASSELTINE 1 01 001 HAST 9 06 007 HASTE Line 24001 23 07 014 KILLING 4 01 001 KILLINGSWORTH 19 01 001 KILLPATH 6 01 001 KILLPATH'S 8 07 007 KILLS 8 01 001 KILOMETER 3 02 003 KILOMETERS 1 01 001 KILOTON 1 01 001 KILOWATT 4 01 001 KILOWATT-HOUR Line 28001 3 01 001 MONIC 2 02 002 MONICA 1 01 001 MONIES 1 01 001 MONILIA 3 02 003 MONITOR 1 01 001 MONITORED 13 03 003 MONITORING 2 01 001 MONITORS 1 01 001 MONIUSZKO'S 16 05 007 MONK Line 32001 1 01 001 PETITE 15 05 009 PETITION 4 03 004 PETITIONED 29 02 002 PETITIONER 2 01 001 PETITIONER'S 8 02 004 PETITIONS 2 02 002 PETITS 1 01 001 PETRARCHAN 3 01 001 PETRIE 2 02 002 PETRIFIED Line 36001 8 05 007 REQUESTING 14 07 013 REQUESTS 86 10 066 REQUIRE 182 12 107 REQUIRED 27 07 022 REQUIREMENT 83 09 051 REQUIREMENTS 57 11 051 REQUIRES 16 06 014 REQUIRING 1 01 001 REQUISITES 1 01 001 REQUISITION Line 40001 1 01 001 SOCIALLY-ORIENTED 4 02 002 SOCIETAL 1 01 001 SOCIETE 41 08 023 SOCIETIES 237 14 101 SOCIETY 3 02 003 SOCIETY'S 1 01 001 SOCINIANISM 1 01 001 SOCIO-ARCHAEOLOGICAL 3 02 002 SOCIO-ECONOMIC 1 01 001 SOCIO-POLITICAL Line 44001 832 15 336 TOO 1 01 001 TOO-EXPENSIVE 1 01 001 TOO-HEARTY 2 02 002 TOO-LARGE 1 01 001 TOO-NAKED 1 01 001 TOO-SHINY 1 01 001 TOO-SIMPLE-TO-BE-TRUE 2 01 001 TOOBIN 2 01 001 TOODLE 426 15 227 TOOK Line 48001 1 01 001 WORSENED 1 01 001 WORSENS 36 12 024 WORSHIP 1 01 001 WORSHIPED 2 02 002 WORSHIPFUL 3 01 002 WORSHIPING 2 02 002 WORSHIPPED 1 01 001 WORSHIPPERS 1 01 001 WORSHIPPING 34 11 030 WORST Line 50357 1 01 001 9**C40 1 01 001 9**C47 1 01 001 9**J*N 1 01 001 9**JA 1 01 001 9**JB 1 01 001 9**JE 1 01 001 9-1/2 1 01 001 9-11 1 01 001 9-6 1 01 001 9-7 1 01 001 9/32 1 01 001 9,273 1 01 001 9,748,000 1 01 001 9,910,741 7 01 002 9TH 12 08 012 90 3 01 002 90*+0 1 01 001 90*+0*F 4 04 004 90**K 1 01 001 90-DAY 2 01 001 90-DEGREE 1 01 001 90,000 1 01 001 90S 3 03 003 900 1 01 001 900-CALORIE 1 01 001 900-STUDENT 1 01 001 900,000 1 01 001 91 3 03 003 92 1 01 001 92.5 1 01 001 920 1 01 001 923,076 1 01 001 9230 3 02 003 93 1 01 001 9329 1 01 001 940*Y 1 01 001 943 1 01 001 944 1 01 001 949 6 04 006 95 1 01 001 950 1 01 001 954 5 03 003 96 1 01 001 960-**J*MC 1 01 001 963 4 03 003 97 3 03 003 98 1 01 001 989 8 03 003 99 1 01 001 99.1