Phonetically Boring Languages

11 ספטמבר, 2018 ב- 03:08 | פורסם בEnglish, ג'ון נאש, יצירתי, ספרדית, עברית, פונטיקה, רשימות | כתיבת תגובה

[This may be my longest-sitting draft that I finally made into a post. I started it in June 2017, shortly after this graphic made the rounds.]

After studying phonetics in my first year at Tel-Aviv University, I developed a pet theory. See, every phenomenon we came across that was "unique" (or rare, or marked) seemed to have somehow skipped over the Hebrew language. No crazy nasals, no retroflexes, no gutturals (in the standard Israeli dialect), no clicks, no ingressives, no voiced alveolar lateral fricative, just the five canonical vowels, et cetera. My thoughts were, since Modern Hebrew is a revived language co-learned by people from very distinct linguistic backgrounds in a relatively messy (high-entropy) distribution, the phonetic portion of it evolved to a low-common-denominator, most phonetically boring language out there.

But where there's data, there's a chance to test out pet theories. So as soon as I got word of the phoible dataset I immediately jumped and put my theory to the test (then waited 15 months for absolutely no reason to actually publish my findings).

Phoible is an open, simple-format data source for phonetic inventories of language. A few clicks, and you have a table of all phonemes across all languages. Thus my definition for boringness of language converged to:

A language is boring if it chooses boring sounds for its inventory.

Now I need to figure out which sounds (phonemes) are boring, but that I just define based on their frequency across languages, which gives me a clear algorithm:

  1. Calculate each phoneme's frequency in the database
  2. For each language, calculate average phoneme frequency from its inventory
  3. Rank languages from high to low average (high = boring)

That's it! For my first finding, Modern Hebrew got a score of 0.415, which ranked it 1,426th most boring in a field of 2,155 languages, an utter refutation of my hypothesis. The most boring language according to this metric is Southern Nuautl with a score of 0.764; the most interesting is !xóõ (yes, that's a click sound it's starting with) with an astounding 0.103. The average score was 0.464 and the score progression is a normal-distributed thing of this:

Phonetic Score CDF

(CDF = cumulative distribution function, meaning: y many languages have score up to x)

Here's a taste of some languages I thought could be of interest. Check out that lovely long tail of Igbo, which has may phonemes but a lot of the frequent ones; or how fast Quechua plummets from frequent to semi-frequent to rare phonemes; or how boring Swahili, a high-contact language dominating the 20's of the x-axis, is (it's what I expected the Hebrew situation to be, and even so its score is a very unboring 0.292); or how English falls so quickly in the beginning, with all its weird vowels and labiodentals and taps and flaps.


This could mean a few things about why I didn't get what I expected, as well as some other unrelated reasons:

  1. Like always, data is dirty (or at least, this data, for my analysis needs). In this case, more well-documented languages may have more phonemes in their dataset, probably leading to some that are rare, than languages with less extensive research done upon.
  2. My metric must suck. See how Hebrew has the fewest phonemes in the selected sample? That's gotta account for boringness and yet, with my mean it doesn't. Look how many frequent phonemes Swahili has, and yet its average is very low. Let's consider some other metrics ("Future work". Remind me to upload the data if I don't do so soon):
    1. Number of phonemes (a boring metric for a boringness question)
    2. % of phonemes above a boringness threshold p.
    3. Area under the boringness curve (this is just the sum of boringnesses again – but maybe cut it off at some point?)
    4. Deep neural net trained on all these features with the single data point <Hebrew, TRUE>.
  3. bug in my code. As soon as I find it I see if I can look deeper. I mean it's kind of a miracle that I have the post-processed TSVs around, to be honest.


Learning to Represent Words by how They’re Spelled

19 מרץ, 2018 ב- 23:02 | פורסם בEnglish, חישובית, כתיב | כתיבת תגובה

A fundamental question in Natural Language Processing (NLP) is how to represent words. If we have a paragraph we want to translate, or a product review we want to determine whether is positive or negative, or a question we want to answer, ultimately the easiest building block to start from is the individual word. The main problem of this approach is that treating each word as just a symbol loses a lot of information. How can we tell from such a representation that the relationship between the symbol PAGE and the symbol PAPER is not the same as that between PAGE and MOON?

Some popular techniques exist that try to learn an abstract representation which identifies these relationships and preserves them. In essence, what these methods do is go over a huge body of text (a corpus), like the entire English Wikipedia, word by word, and come up…

View original post 518 מילים נוספות

Bureaucratic Path to PhD Studies

12 מאי, 2017 ב- 20:09 | פורסם בEnglish, ג'ון נאש, יצירתי, מנהלי | כתיבת תגובה

Here's the path I took, as a non-US-citizen, before moving to the US and starting a CS PhD program. I took the GRE and TOEFL around summer of 2015 and moved in August 2016.

I provide it as a reference for similar-minded folk, but keep in mind times change, circumstances vary, and I may have forgotten crucial steps. In any case, enjoy.

(Created using GraphViz)


Google's Translation Overhaul – Interview on IDF Radio

17 אפריל, 2017 ב- 17:48 | פורסם ב15 דקות, English, אנגלית, עולם דיגיטלי, תרגום | תגובה אחת

This February, I gave an interview to Ido Kenan on Galei Tzahal (IDF Radio) about Google's upgraded Machine Translation system, including its claims that it learns an intermediary abstract language representation, an "Interlingua".

You can listen to the interview here, or read my writeup here on Kenan's blog. Problem is, it's all in Hebrew! Well, what better than to use the fancy new Google Translate to render the thing into English?

Here it is, untouched. See how much you understand. (Retrieved March 16, 2017)

Continue Reading Google's Translation Overhaul – Interview on IDF Radio…

Turnout, Burnout

10 נובמבר, 2016 ב- 19:35 | פורסם בEnglish, פוליטיקה | כתיבת תגובה
Since Tuesday's elections, I've been hearing a lot about the alleged irresponsibility of the American voter, not turning out for the elections. At the same time, there has been the usual fuss over the electoral college system and how some states are meaningless to bother going out to vote.

I have yet to see a piece trying to tie the two together (please correct me if I'm wrong).

My claim is simple: citing the nationwide 56.5% figure as a strong indicator for voter apathy is somewhat misleading. If a Californian feels they don't see the point in voting (and registering beforehand), it's different than a Pennsylvania voter (in this elections cycle at least, but pretty much usually). It's unfair, but understandable, if there's a (say) 15 point difference in their turnout rates.

Let's look at the numbers then, shall we? On the x axis, we'll place the ultimate victory margin (collected Thursday from Wikipedia) as a proxy for how inclined an average voter was to believe that his vote would be crucial. It's not a perfect proxy of course, as there were some state-level surprises. Maybe poll margins prior to registration deadlines would have been a better one. The y axis will denote the voting turnout (collected from

Before the chart, observational data: of the 11 states with highest voter turnout rate, 10 ended up with a margin under 5%. Of the 10 states with margin under 4%, only one had a turnout of less than 60%. Now you can look at the chart, including a simple linear trend line.


Forgive my dataviz-unsavviness. I wish I knew how to add state labels to each point on gsheets.

As you can see, the results are pretty straightforward. With a not-bad correlation of 0.21, it seems voters chose to turn out based on how close they anticipated the race to be in their state. I didn't leave out the outliers but they're not shown in this chart (DC is always ridiculous, this time with an 86% victory margin. Hawaii significantly undervoted with a 34% turnout, way under the next, California at 45.5%). It was cool to see Utah as a special case here with its 3-way race – a 19% D-R margin brought significantly less people to the polls than Montana or Washington state who ended up with about the same margin.

All in all, the voters who mattered in this Presidential election (tough phrasing but that's the way it is) came in at about 65%, much higher than the national average.

It's worth noting that the numbers, even for the swing states, are still low compared to most of the democratic world. But I'll also note that the US has other factors going for it, such as the no-day-off thing, or the huge amount of expats allowed to vote, which is a unique characteristic. According to electproject, these compose roughly 2% of the eligible electorate.

Perfect Ambiguity

12 ינואר, 2015 ב- 13:19 | פורסם בEnglish, אנגלית, אקדמיה, סמנטיקה | תגובה אחת
I just ran into the most perfect case of ambiguity in a signup form to remain of anonymous origin:
Last (Name First)
So, what does the text in parentheses mean?

  1. It's to be parsed as a template: <First> <Last>, meaning the first name should come first;
  2. It's an English-grammatical instruction (as in "first things first"): "Put your first name last", meaning the last name should come first.

I'm going with the first interpretation, but you gotta admit that this is a case where trying to make things clear only makes them confusing.

Series of tubes

27 דצמבר, 2013 ב- 12:53 | פורסם בEnglish, אנגלית, עולם דיגיטלי, פוליטיקה, תרבות | תגובה אחת

In 2006 (here I go, topical as ever) Senator Ted Stevens made his infamous "series of tubes" gaffe, winning him eternal ridicule.

My pet theory for Stevens's misconception has been that he'd heard of Youtube (the timeframe checks out; this is about the time it started emerging) and mis-analyzed it as "U-tube", a serial letter coding a type of tube. Generalizing, the whole Internet must be a series of the A-tube, B-tube, 1-tube and HL/9-tube, no?

I recently tried searching for some verification to this theory, or at least fellow holders of this opinion. To my surprise, I found none. Stevens himself did not give this as an explanation, but I assume it would have been an embarrassing admission the higher Youtube's popularity soared. When asked, he claimed some Internet mavens actually gave him positive feedback for the tube description (equating tubes with "pipes"). Sadly, we can't ask him anymore.

Are you a fellow U-tube-theorist?

Do you know of any?

Alternatively, can you clearly refute this theory?

The Fastest Dash

10 אוגוסט, 2012 ב- 19:35 | פורסם בEnglish, ג'ון נאש, ספורט | כתיבת תגובה
Which Olympic dash is the fastest?It's clear that 400 meters is already "tactical" in strength-saving, so 200 meters is faster than it. The numbers consistently confirm this – while the record for 200 meters has been under 20 seconds for quite a while, the 400 meter record has not yet gone under 43 seconds. So we're left with two – the 200 meters and the 100 meters. (Hurdles? Come on.)

During my growing years the answer to this question seemed to be 200 meters: the second half of the 200 meter race has no start to slow it down, and apparently the runner's battery is still charged enough for the entire 200 meters. The record for 200m was continuously under twice that of 100m. Michael Johnson's 0.34-second world record improvement in the Atlanta games in 1996 could only convince me further.

Over the last few years, however, the situation has been reversed. The 100m record is 9.572 seconds, while the 200m record is at 19.19, half of which is 9.595, slower than the 100m record. Moreover, both were made by the same runner, Usain Bolt. I looked into the situation in the last 45-or-so years, since just before the accurate automatic measurement was introduced, and it turns out that even though most of the time the halved 200m record was better than the 100m record, there were several transitions from one state of affairs to the other.

Can anybody help me with this? How can it be that there's no definite physiological answer to this question? Is there a known "ideal distance" which balances the slowdown of the start with the slowdown of fatigue? Or does it depend on the strength and expertise of contemporary runners?

After unsuccessfully trying to embed the Googledocs-provided html in WordPress, here's a print screen of the largest-scale chart below, and here's a link to the spreadsheet with the data and a "playable" chart. Each data point is where either a new 100m or 200m record was set, and the y-axis represents the difference between the two. A rise is a new 200m record, a fall is a new 100m record.

Blazing dashes: the difference between the 100m world record and half the 200m world record (in seconds)

Translation Errors, Sans Explanation

12 דצמבר, 2011 ב- 12:12 | פורסם בEnglish, אנגלית, עברית, תרגום | 2 תגובות

Perhaps inspired by this collection of PR photos without the accompanying press release, I believe non-Hebrew speakers might find this following list amusing: some of the examples I've come across and taken the trouble of writing down, of horrendous Hebrew subtitling of English dialogue in TV and film. Some mistakes may be easy to figure out, some are not fair of me because I'm also leaving out context, others will hopefully keep you flat-out stumped.

If you do understand Hebrew, it might be more fun starting here before looking for the explanation in any of the posts (Hebrew) where I do explain the (sometimes conjectured) origins.

So here they are, in no particular order (last updated: Dec. 12, 2011):

TV shows:

Show English Hebrew subtitle (my re-translation)
Sex and the City the Lennox Lewis fight the fight between Lennox and Lewis
Entourage Are you Indian now? Are you a Native American now?
How I met Your Mother demeaning demanding
Family Guy women's retreat women's shelter
Cleveland thrifty thriving
American Dad! boysenberry pancakes boys and blueberry pancakes
The Simpsons Boy, is my face red That irritates me
the protagonist the antagonist
strangle struggle
Futurama phone carrier phone case
Friends I call shotgun! I'll call a plumber!
my fish my face
That 70's show bitchin' damn it
They don't let me hang out around the bleachers They don't let me hang out near the cool kids
Life On Mars (UK) want to go to the pictures? want to go see pictures?
CSI: Las Vegas scent accent
refine search define search
30 Rock heroes turning in communists heroes turning into communists
struck by lightning struck by the lighting pole
lady airline pilots "Lady Airline" pilots
pageant girls pregnant girls
Seinfeld I don't know, got a two in Zagat's I don't know, I have a reservation for two at Zagat's
Frasier global warming central heating


Film English Hebrew subtitle (my re-translation)
Armageddon Roger that How will that help?
The Depraved (1957) how's tricks? How's (a person named) Tricks?
A King in New York contempt content
Zelig I hate the country (-side) I hate the country (=state)
socialites socialists
it can be traced to it can be tracked by
the great potato famine the great potato
Leonard struck him and the other doctors with a rake Leonard struck him and the other doctors with the change he had undergone
(title; completely idiomatic in modern-day usage) (a civilization, feminine in Hebrew) Gone with the Wind (masculine) Gone with the Wind

יצירה של אתר חינמי או בלוג ב־
Entries וכן תגובות feeds.