Semantic Null Distribution

This app calculates a null distribution of semantic similarity scores (using the dot product) between a target word and positive versus negative sets of words.
The null distribution is based on randomly selected words which fit within a template sentence. The template sentence uses a $ symbol to represent the word to be randomized. The template sentence also requires a part-of-speech tag (for the pos_tag function from the NLTK toolkit, using the Penn Treebank tagset) that determines the grammatical form allowed for random words. Examples of tags are: 'JJ' for adjectives, 'NN' for nouns, 'RB' for adverbs, 'VB' for verbs.
To pick words fully at random, use only a $ for the sentence and keep the part-of-speech tag empty.
Pre-calculated scores can also be entered for which to calculate p-values.
If there is a problem with the input (e.g., an unrecognized word or tag), an empty distribution will be returned and any p-values will be set to 666.

Similar to tests for biophilic associations in language usage in Gladwin, Markwell & Panno (2022)


Enter sets of words separated by commas.

Target word(s) to test:

Target-contrast word:

Positive word(s):

Negative word(s):

Pre-calculated similarity scores:

Template sentence, where $ represents the word to randomize:

Randomized words' part-of-speech tag:


To test whether the drinks beer, coffee, and wine are related to drunkenness versus alertness in the model, take "beer,coffee,wine" as the target words, "hungover, tired, drunk" as the positive words, and "awake, alert" as the negative words.
Random words will then have their similarities calculated with the positive and negative sets (note that "positive" and "negative" don't refer to valence here).
The null distribution consists of the relative semantic similarity scores of the randomly selected words.
The template sentence could be "I am $", with part-of-speech tag "JJ", for adjectives.
A target-contrast word can be specified to test whether a target's similarity remains significant when the contrast word's vector is subtracted from it.
The output will give the randomization-based p-values for the input target words and scores (recall that small p-values indicate statistical significance, traditionally below .05).