The “briefing questions for unstructured data” were published by ESOMAR last week, and not a moment too soon. It was high time to replace the “24 Questions to help buyers of social media research” published in 2012, with a new updated document that provides guidance to buyers of social intelligence and text analytics solutions for sources other than social media. This comprehensive work includes 26 questions and has taken a year to complete, from inception to publication.
In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form.[1] Today we see even stronger statements floating around, such as: over 90% of all the human knowledge accumulated since the beginning of time, is unstructured data. This includes text, images, audio and video. We can think of the other 10% as numbers in tables (structured), which is the primary result of any quantitative market or marketing research.
Other than reading, listening to, or viewing unstructured data there is another way to understand their meaning. Especially if we are dealing with big data then there is only one way to discover and understand the information hidden in mega-,giga-,tera-, peta- or n-ta-bytes of data: artificial intelligence. With machine learning – which is the discipline that produces A.I. – we have the ability to create models that can process large files of text or images in seconds, and annotate sentences, paragraphs, sections, objects or even whole documents with topics, sentiment and specific emotions. Sentiment and semantic analysis are the two most popular ways to analyse and understand unstructured data with the use of machine learning or a rules based approach. When the unstructured data to be analysed is in text format, the discipline falls under Computer Science (not linguistics funnily enough) and is called Natural Language Processing (NLP) or Text Analytics.
There seems to be a lot of perceived complexity in using artificial intelligence or other engineered approaches to analyse text and images in an automated way. We need to whether radically simplify, or acquire enough knowledge to understand what may seem complex and difficult at a first glance. Whatever the case, this article and certainly the ESOMAR briefing aim to educate and simplify at the same time.
Some machine learning basics
Before we dive into the various use cases of NLP and practical applications of the briefing, let’s set the stage with some basic information on what is possible, and what not so much.
What is possible with machine learning:
- sentiment and semantic agreement of humans with the annotations of a machine learning model over 80%
- to achieve the above accuracy regardless of text language
- to annotate text for multiple emotions that go beyond positive and negative sentiment
- to automatically caption millions of images using text that can then be analysed for topics and sentiment
- to analyse data from any source – machine learning is not only language but also data source agnostic
What is really difficult to achieve with machine learning:
- 100% agreement of multiple humans with the annotations of a machine learning model (100% agreement of one human with the machine learning model is achievable)
- over 70% accuracy with a subject generic machine learning model in a language
- over 70% accuracy using a rules based approach
- combined accuracy for brand, sentiment and topics over 70%
Use cases for unstructured data analytics
There is multitude of users, data sources and use cases within an organisation, and all of them can benefit from the document ESOMAR has published. Let’s start with relevant data sources:
- Social Media
- Other public websites
- Answers to open ended questions
- Transcripts of in-depth interviews and focus group discussions
- Call centre conversations with customers
- Organic conversations on private online communities
ESOMAR mainly caters to the market researchers in organisations globally, but there are many more users of text and image analytics solutions sitting in different departments, that can benefit from this briefing. Here is a combined list of users and use case examples for each one, which is not exhaustive by any means:
- Market research – for insights from social and other unstructured data sources
- Public relations – to manage brand and corporate reputation
- Customer service – to respond to questions, complaints and requests
- Advertising – to leverage positive testimonials
- Marketing – to find and leverage influencers
- Product Development – to learn about missing product features or ones that are not appreciated by consumers
- Innovation (beyond new product development) – to learn about emerging trends and new product use cases
- Competitive Intelligence – to gauge how competitors are doing in an industry or product category
- Operations – to learn about issues that need fixing
- Finance (together with marketing) – to find out about sentiment towards pricing
- Board – to benchmark and track sentiment on governance
- Sales – to find sales leads who express purchase intent
Based on the fact that there are so many use cases, there are many tools that initially started with a single use case in mind – the most popular ones being public relations, reputation management and customer care. As time went, by these tools were looking for growth, so they – almost without exception – decided to dabble in the market research sector. This very fact created an immense problem for the market research industry: it led to a delay in the adoption of social intelligence solutions i.e. the use of text and image analytics to process and annotate unsolicited opinions on the web, for consumer insights purposes.
To offer some more clarity, the delay happened because insights professionals tried out some of the social media monitoring tools that were around in 2010-2012, and figured out that their accuracy was so low that could not be used for market research purposes. This is why ESOMAR had created the 24 questions to help buyers of social media research back in 2012. By that time the market research world had already written off social media listening – as many called it – as not accurate, not representative, and by extension not only useless but also possibly harmful.
Fast forward to 2019, thankfully, perceptions have changed. It was proven to the powers that be that social media data can be cleaned from irrelevant posts, and text analytics can be accurate enough for market research purposes. This makes the ESOMAR guide comprising 26 questions to ask before you buy your way into an automated text and image analytics capability so timely and necessary, not just for market researchers but for all prospective buyers out there.
Practical applications of the questions in this briefing
There are five sections in the document that are meant to guide buyers of related tools and services to ask vendors the right questions. The answers to these questions will enable buyers to take an informed purchasing decision. Here are the five sections:
Company Profile and Capabilities
First of all it is important to know who we are dealing with. Is this a pure technology company with a tool or do they have any subject matter expertise? For example, market research and insights expertise would be nice if the buyer is an insights professional.
Data sources and types
Is this solution making use of specific data sources that it provides as part of the service or is it just an analytics solution – meaning the client should provide the data for the analysis. Even if the company provides data, is the technology source agnostic? In other words, can it process and accurately annotate all 5 source examples listed above?
Software design and capabilities
This section is one of the two most important ones. It helps the buyer understand how the data processing, annotation and analysis is done; in which languages and what types of data are analysed.
Data quality and validation
This is the other one of the two most important sections in the briefing. We all know the saying: garbage in – garbage out. This is about cleaning the data before processing and annotating them.
Ethical and legal compliance
The ESOMAR code of conduct has always been stricter than the law and this briefing is no different. Not only should the vendor be GDPR compliant, but they should also ensure that no harm is done to subjects in the research no matter how insignificant it may seem.
For some questions there are no right or wrong answers; the vendor just needs to have a plausible answer – if they do not, then that in itself would constitute a red flag. As an example, if on question 18 about the vendor’s minimum accuracy the answer is “What do you mean?” then a good next step for the buyer would be to walk away… and fast!
Tip for the uninitiated
Consumer research has typically been performed by asking questions in surveys or qualitative research. For many insights professionals, social media intelligence or intelligence extracted from other unstructured data sources, is fairly new. If this guide is your first exposure to Natural Language Processing or image analytics then it is possible that some of the questions or explanations for context will not be enough to get a thorough understanding of the issue the guide is trying to address. In such a case feel free to contact ESOMAR or the project team co-chairs directly with your questions.
If it turns out that we will need to create answers to frequently asked questions about the 26 questions and their possible answers, then this may imply that we did not do such a good job simplifying for our audience. The only consolation is that even if it contains a lot of complexity, it is a step in the right direction. Thank you ESOMAR for being open, flexible and very supportive to this initiative.