Date of this Version
Tokenized word terms were collected from three sources: controlled vocabulary headings, user keyword searches, and html documents all dealing with issues in water quality. Distances were calculated between word pairs using the Jacquard formula. Distances from the three sources were compared using Spearman rank correlations and clusters were calculated on distances transformed for non-normality using the SAS pseudo-centroid method. Word pair distances from controlled vocabularies were more closely correlated to keyword searches than document distances were to users’ keywords. The mean distance of controlled vocabularies was also closer to that of users. Clusters produced from the three sources were most similar for word pairs with small distances.