when client asks
Beware the lure of crowdsourced data
Crowdsourced data can often be inconsistent, messy or downright wrong
We all like something for nothing, that’s why open source software is so popular. (It’s also why the Pirate Bay exists). But sometimes things that seem too good to be true are just that.
Repustate is in the text analytics game which means we needs lots and lots of data to model certain characteristics of written text. We need common words, grammar constructs, human-annotated corpora of text etc. to make our various language models work as quickly and as well as they do.
We recently embarked on the next phase of our text analytics adventure: semantic analysis. Semantic analysis the process of taking arbitrary text and assigning meaning to the individual, relevant components. For example, being able to identify “apple” as a fruit in the sentence “I went apple picking yesterday” but to identify “Apple’ the company when saying “I can’t wait for the new Apple product announcement” (note: even though I used title case for the latter example, casing should not matter)