Author: John Cowan
Status: Informational
Last Edited: 2009-04-22
Thesis: Structured elements should not be used to represent the components of personal names or telephone numbers, because those components are too variable in nature, too easy to get wrong when parsing, and have disastrous consequences when wrong. People wind up being (perhaps rudely) designated by the wrong parts of their names, cannot be found in alphabetical lists where others expect to find them, and have phone numbers that are confusing or hard to dial.
Some names are simple: a first name (personal name) and a last name (surname, family name), like "John Jones". We Americans address people as "John" if we know them well, "John Jones" or "Mr. Jones" if we don't. In some contexts, like the military or certain kinds of jobs, we use just "Jones". Some people have different titles (Ms., Dr., The Most Reverend), and some people have generation numbers (Jr., Sr., II, XXVI).
Unfortunately, this doesn't begin to cover the full complexity of the situation. Some people have a middle name which functions as a second personal name, as in John Jacob Astor. Contrariwise, some people have a middle name which functions as part of the surname, like David Lloyd George -- but you can't tell this without inside information, as "Lloyd" is most often a personal name. Some people have two middle names, or a single-letter name which is not an abbreviation for anything (Harry S Truman). Some people have multiple names with and without hyphenations: Leone Sextus Denys Oswolf Fraudatifilius Tollemache-Tollemache de Orellana Plantagenet Tollemache-Tollemache, a real person (his surname is the second instance of "Tollemache-Tollemache" and all the rest are personal names).
Some family names begin with old nobility or origin labels: in Germany, "von" traditionally marked the nobility, whereas "Von" is part of the surname, meaning that the family is "from" a certain place. Similar are Dutch "van" or "Van", "ter", and "van der"; likewise French "de", "du", "de l'", and "de la". Multi-syllabic French surnames with "de" keep it when the name is given in full, like "Alexis de Tocqueville", but drop it when just the surname is used: "Tocqueville". Americans care strongly about whether their surname is spelled "de Camp", "De Camp", "DeCamp", or "Decamp" (the French original was "de Camp", but things often changed after Ellis Island).
Nor can you always tell by looking which part is the family name. The great Hungarian physicist Szilárd Leó moved to the United States and became the great American physicist Leo Szilard, because Hungarians put the surname first. The same is true of Chinese names, where the first character is always the surname (almost: a tiny number of families have multi-character surnames): 毛泽东 (Mao Zedong) had the surname 毛 (Mao). Likewise in most other countries of East Asia, but in Vietnam, there are so few surnames that Nguyễn Cao Kỳ is called Kỳ in formal contexts, as Nguyễn would be excessively ambiguous -- about 40% of Vietnamese bear that surname. When East Asians move outside the area, though, they often invert their names: the cellist 马友友, or Mǎ Yǒuyǒu in romanization, is called Yo-Yo Ma in the West. As an exception to the exception, some Chinese and Japanese emigrants keep their name order in romanization -- it's unpredictable.
After the rest of the name come various suffixes indicating academic titles and membership in various organizations (Jesuits put "S.J." after their names). There are hundreds of these, and they can be written with or without periods in various combinations of upper and lower case; there can be more than one, usually separated by commas.
Even in the family-name-last parts of Europe, things are not simple: Spanish names (both in Spain and in other Spanish-speaking countries) are of the form Federico García Lorca, where García is his surname and Lorca is the surname of his mother; sometimes this second surname is dropped when their bearers move to anglophone countries. In French the convention is to write all surnames in CAPITALS, so we have Catherine DENEUVE, John JONES, SZILÁRD Leó, MAO Zedong. (If only everyone did this!) Icelanders don't have surnames, only patronymics (the father's first name plus -sson or -sdottir): consequently, their phone directory is sorted in first name order, as there are more first names in use than patronymics, and they address each other solely by first names. Russians have personal names, patronymics, and family names, and use them in varying combinations to express different social relationships: first name plus patronymic is the rough equivalent of title plus family name in the United States. In Scandinavia, most people have two personal names and often use both of them.
Some cultures don't even have the personal name vs. family name distinction. In Indonesia, almost everyone has only one name; however, former President Megawati uses the patronymic Sukarnoputri (not a family name), to emphasize that she is the daughter of the founding President, Sukarno. The mathematician Ramanujan bore a typical South Indian name: in full it was S. (for Srinivasa, his father's name) Ramanujan Iyengar (a clan name), but "Ramanujan" was his effective name both informally and for most formal purposes. Arabic names can be very complex, with a personal name, a chain of zero or more patronymics, and sometimes a surname, and are furthermore extremely subject to variations in romanization, as there is no generally accepted standard for romanizing Arabic. The titled members of the British royal family don't actually have family names (though their untitled relatives use the surname "Mountbatten-Windsor"): the current Prince of Wales's name is simply "Charles Philip Arthur George".
And then there's the Hawaiian name "Daniel Keiki Oʻkalani kala hoʻo lewa Kamanaʻo" and the Arabic name "Abu Karim Muhammad al-Jamil ibn Nidal ibn Abdulaziz al-Filistini". You get to figure out which parts are what.
In short, it would be nice if we could identify the parts of people's names that are relevant in (a) sorting, (b) formal use, (c) informal use; but we cannot, not with any generality. Even if the parts are already dissected for us by a form, the chances are good that the form is too rigid to handle all possible cases. To make things worse, there are already a lot of name parsing systems out there, and they are not even vaguely consistent with one another; the chance is very small that your parser will work the same in all cases as other well-established parsers.
I also recommend a post by James Clark (about Thai names, but with comments on many other cultures). For more details on Russian names, see this blog comment I wrote.
Here there is more of a standard: "+" followed by the country code followed by the in-country phone number. The "+" is supposed to be replaced by the code for international dialing (011 in North America, usually 00 elsewhere). Country codes are variable-length (1 digit to 4 digits) but unambiguously so: 1 is the country code for North America, so there are no country codes like 10 or 11. The total code has a maximum of 15 digits excluding the "+" sign; spaces (but not punctuation marks) may be used as separators.
The hard part is how to format the country-specific part of the number in local use. In North America, there is a rigid pattern of "212-555-1212", where 212 is the area code, 555 is the exchange, and 1212 is the local number. Area codes are mapped to geography, though sometimes there is more than one area code in the same location. (New York City has 212 and 646 for Manhattan, 718 and 347 for the other boroughs, and 917 for pagers and cell phones -- but not all of them -- throughout the city.) Sometimes the written format is "(212) 555-1212", which used to be universal. A few special-purpose numbers are 3 digit: 411 (directory assistance), 911 (emergency).
There are some legacy cases: a few businesses still use the 2-letter plus 5-digit style of local numbers that used to be universal, and some use 7 letters. There are also a few hundred "non-dialable points" remaining, telephones that are so remote from any central office (and often use such obsolete equipment) that they can only be connected manually by an operator. These have names rather than numbers: MARY'S FARM #2, Nevada; DYE J D, Texas.
In other countries, these conventions don't come close to holding. Many countries don't have anything like area codes. Others use area codes in some parts of the country but not others. The number of digits following the area code, if any, varies from country to country, and need not be consistent even within the country -- the U.K. has variable-length phone numbers, longer in urban areas than in rural ones. Punctuation also varies: periods may be used instead of hyphens or spaces in Europe, for example. People mentally segment numbers differently: 125437 may come out "twelve, fifty-four, thirty-seven" rather than "one two five, four three seven". Lastly, people write and use abbreviated forms differently: in the U.K., many numbers can be dialed locally with just a few digits, or with the full form (which begins with "0"); in North America, there is in general only one way to dial any number, and any attempt to use a full form (beginning with "1") where a shortened form is required, or vice versa, signals an error.
The best thing to do, therefore, is simply to accept telephone numbers as they come, with or without punctuation, and not to try to subdivide them into child elements or reformat them as strings (at least not without giving the user a chance to override the new format).
APPENDIX: How to separate a country code from the country-specific phone number
This algorithm is correct as of June 2019, provided you know a country code is present: