Towards smarter data - accuracy and precision

sasha uritsky

Nov 01, 2014

There is a huge amount of information out there. And it is growing. To make it efficient and increase our competitive advantage we need to evolve and start using information in a smart way, by concentrating on data that drives business value because it is accurate, actionable, and agile. Accuracy is an important measure that determines the quality of data processing solutions.

How accuracy is calculated?

It is easy to do with structured data, because the requirements are formalizable. It is less obvious with unstructured data, e.g. a stream of social feeds, or any data set that involves natural language. Indeed, the sentences of natural language are subject to multiple interpretations, and therefore allow a degree of subjectivity. For example, should a sentence ‘I haven’t been on a sea cruise for a long time’ be qualified for a data set of people interested in going on a cruise? Both answers, yes and no, seem valid.

In these cases an argument was put forward endorsing a consensus approach which polls data providers is the best way to judge data accuracy. This approach essentially claims that attributes with the highest consensus across data providers is the most accurate.

At nmodes we deal with unstructured data all the time because we process natural language messages, primarily from social networks. We do not favor this simplistic approach, as it is considered biased, inviting people to make assumptions based on what they already believe to be true, and making no distinction between precision and accuracy. Obviously the difference is that precision measures what you got right, and accuracy measures both what you got right and what you got wrong. Accuracy is a more inclusive and therefore more valuable characteristic.

Our approach is

a) to validate data against third party independent sources (typically of academic origin) that contain trusted sets and reliable demography. Validating nmodes data against third party sources allows us to verify that our data achieves the greatest possible balance of scale and accuracy.

b) to enrich upon the existing test sets by purposefully including examples ambiguous in meaning and intent, and providing additional levels of categorization to cover these examples.

Accuracy is becoming important when businesses move from rudimentary data use, typical of the first Big Data years, to a more measured and careful approach of today. Understanding how it is calculated and the value it brings helps in achieving long-term sustainability and success.

Interested in reading more? Check out our other blogs:

What Is AI Engine and Do I Need It?

Chatbots and assistant programs designed to support conversations with human users rely on natural language processing (NLP). This is a field of scientific research that aims at making computers understand the meaning of sentences in natural language. The algorithms developed by NLP researchers helped power first generation of virtual assistants such as Siri or Cortana. Now the same algorithms are made available to the developer community to help companies build their own specialized virtual assistants. Industry products that offer NLP capabilities based on these algorithms are often called AI engines.

The most powerful and advanced AI engines currently available on the market are (in no particular order): IBM Watson, Google DialogFlow, Microsoft LUIS, Amazon Lex.

All these engines use intents and entities as primary pnguistic identifies to convey the meaning of incoming sentences. All of them offer conversation flow capability. In other words, intents and entities help to understand what the incoming sentence is about. Once the incoming sentence is correctly identified you can use the engine to provide a reply. You can repeat these two steps a large number of times, thus creating a conversation, or dialog.

In terms of language processing ability and simplicity of user experience IBM Watson and Google DialogFlow are currently above the pack. Microsoft LUIS is okay too; still, keeping in mind that Microsoft are aggressively territorial and like when users stay within their ecosystem, it is most efficient to use LUIS together with other Microsoft products such as MS Bot Framework.

Using AI engine conversation flow to create dialogs makes building conversations a simple, almost intuitive, task, with no coding involved. On the flip side, using AI engine conversation flow limits your natural tendency to make conversations natural. The alternative, delegating the conversation flow to the business layer of your chatbot, adds richness and flexibility to your dialog but makes the process more comppcated as it now requires coding. Cannot sell a cow and drink the milk at the same time, can you?

Amazon Lex lacks the semantic sophistication of their competitors. One can say (somewhat metaphorically) that IBM Watson was created by linguists and computer scientists while Amazon Lex was created by sales people. As a product it is well packaged and initially looks pleasing on the eye, but once you start digging deeper you notice the limitations. Also, Amazon traditionally excelled in voice recognition component (Amazon Alexa) and not necessarily in actual language processing.

The space of conversational AI is fluid and changes happen rapidly. The existing products are evolving continuously and a new generation of AI engines is in the process of being developed.

WHAT IS AI TRAINING

AI training is a critical part of conversational AI solutions, a part that makes AI software different from any kind of software previously created.
AI training is not coding.
Unlike all other existing software which is fully coded.

Let us consider a simple example:
We create chatbots for two companies, one company is selling shoes, another is selling cars. From the software standpoint it is one chatbot solution running as an online service accessed remotely or a program available locally. In both cases they are two identical instances of the same software (one instance for the shoes company, another for the cars company).
Yet, for the first company the chatbot is supposed to talk about flip-flops, summer shoes, high heels and so on. For the second company, however, the chatbot is not expected to know any of that. Instead, the chatbot should be able to support conversations about car brands, car models, should know how to tell Toyota Camry from Toyota Corolla, etc. This shoes and cars knowledge is not programmable. It is trainable. It is not coded, instead it is a part of language processing capability that AI solutions like chatbots have. And herein lies the major differentiation and advantage of the AI solutions compared to traditional software.

How to train AI?
There are several ways to do it. Sometimes AI system can train itself, improve its linguistic ability over time. It also can be trained by professional linguists. And in some cases, by the users. The latter is the desirable scenario because businesses know better than anybody else what they want their chatbot to talk about.
It is not easy, given the existing state of AI technology, and usually requires a high level of technical knowledge. You may have heard mentions of intents and entities in chatbot discussions. These are examples of linguistic elements AI training is currently based on.
Without proper understanding of what these linguistic elements are and how language acquisition process works in existing AI systems it is better to leave AI training to professional linguists.