So what’s this massive file of Haitian names I keep going on about?
In July 2015, the Haitian ministry of education–the Ministere de l’Education Nationale et de la Formation Professionelle, or MENFP for short–issued a set of documents listing all the accredited teachers in Haiti at public and private schools, divided by region. The list gives the teachers’ names, their schools, their sex, and because privacy apparently isn’t a thing, their birthdates. It’s the holy grail of Haitian name research: a source that gives the people’s gender–have you ever tried to assign sex to Haitian names? No, you haven’t, because you’re still sane enough to sit still and read this–AND gives their birthdates, AND represents a wide range of ages, AND theoretically represents a complete data set, AND is large enough to draw conclusions from.
How large is it? Oh, my sisters and brothers in scholarship, it’s huge. It’s vast. It’s bigger than many cities. Over 67,000 unique individuals, every one a delectable morsel of verifiable information. Studying name frequencies makes one a size queen, and I am the happiest queen on the planet right now. There’s nothing so delightful as opening a spreadsheet so huge that Excel starts doing Lamaze exercises in the background.
It’s big… and it’s dirty. How dirty? Filthy dirty. There are last names in the personal name column, personal names in the last name column, compound names leaving fragments everywhere. Either teachers can work at more than one school or the list was compiled from an old list that wasn’t updated correctly, because the sheets are sprinkles with duplicate entries. The duplicates reveal another problem, which is that the compilers had a fine disregard for detail. Was someone born on 11/3/58 or 3/11/58? Or were they born on 8/11/88? 3, 5, 8, months, days, these distinctions are so subtle. Who’s to tell? Is someone named Jean Marie or Marie Jean? Eh, either will work. Shall I type with my fingers today, or with my face? Sainit Jen Dufrqesne is a lovely change from standard spelling, don’t you think?
And then the gender. Augh, the gender. You have two options: M and F. The distinction, in a traditional culture, is not subtle. The letters on the keyboard are far apart. This should be easy. And yet there are a whole lot of men named Marie Roseline on the list.
There’s enough good data that I have a rough baseline for finding the bad data, but it’s taken me at least 20 hours to strip the data down to the point where it’s no longer telling me Paul is a common female name. And that’s just the female data. The male data is going to be even worse.
It’s agony. But it’s sweet, sweet agony.