Diacriticized Characters from Library of Congress Bibliographic Records

Here is a data file containing information about diacriticized character combinations found in two different Library of Congress bibliographic record files.

The file contains 1156 records, one for each diacriticized character combination found in either file. Each record contains seven columns, separated by the semicolon (;) character. The columns are as follows:

Fully decomposed Unicode equivalent of the combination (characters in 4-digit hexadecimal separated by plus signs). This is the sort key.
Fully composed Unicode character equivalent (in 4-digit hexadecimal), or null string if there is no such equivalent.
The original ALA EBCDIC encoding of the characters (in 2-digit hexadecimal concatenated).
USMARC extended ASCII encoding of the characters (in 2-digit hexadecimal concatenated).
Number of occurrences of the character combination in the MUMS Books bibliographic database.
Number of occurrences of the character combination in the JACKPHY bibliographic database.
Name of the character combination. This is based on the Unicode 2.0 name with the words "LATIN" and "LETTER" omitted for brevity. Combinations that do not have Unicode equivalents are named by analogy to the Unicode 2.0 naming system.

WARNING: An earlier version of this page claimed that the JACKPHY frequency appeared in column 5, and the MUMS Books frequency appeared in column 6. That was incorrect.

The MUMS Books database contains 3,279,507 records and 1,353,406,304 characters. There are 9,948,061 characters with diacritics, of which 57,255 exhibit more than one diacritic.

The JACKPHY database contains 335,589 records and 139,542,423 characters. There are 2,473,467 characters with diacritics, of which 415 exhibit more than one diacritic.

The data should not be accepted uncritically, as transcription errors are almost certainly present: for example, the 16 occurrences of LATIN SMALL LETTER A WITH ACUTE AND ACUTE almost certainly represent a double-acute mark rather than vertically stacked acute accents.

The data were provided by the courtesy of James Agenbroad of the Library of Congress, and were typed in and massaged by John Cowan <cowan@ccil.org>.