“Any researcher worth their salt was acutely aware of sample design and delivery on that design back in the 1970s. We knew that it would make or break our study and our competitive advantage lay in our data quality.”
So says Butch Rice, a South African market research industry pioneer who started one of that country’s most successful agencies, Research Surveys, before it was subsumed by TNS and later Kantar when I spoke to him about this article.
Over time, sampling has become such a fundamental and pervasive part of our industry that it has become almost an after-thought for many researchers. However, our once-dogged focus on sampling is beginning to rear its head in new and novel ways. With the large-scale adoption of AI technologies, we’re finding the need to think more carefully about the data samples that we use to train our AI models and we can learn from our industry’s diligent focus on sampling in the past.
Much has been said about bias in AI – bias towards specific races, genders, age groups and so on – but little has been said about the inherent biases in AI models that make them less relevant to the market research industry than they could be. There is a disconnect between the language of insights – the shared paradigm between researchers and clients that allows us to quickly convey concepts about brands – and the language of the internet that is encoded into many AI models. Industry stalwart, Larry Friedman, highlighted the disconnect between traditional brand concepts and new data sources like social media for me:
“Too often, researchers haven’t been careful enough when attempting to ‘translate’ established survey constructs like Brand Consideration or Purchase Intent into equivalent social metrics. How people discuss their interest in buying brands on Twitter may not correspond simply into a survey top-box score; it needs to be thought through very carefully.”
How we construct AI models
Often, we don’t give enough thought to this consilience, or lack thereof. Most AI models are created in one of two ways:
- A training dataset that consists of inputs (variables, text verbatims, etc.) and associated examples of what we are trying to predict is used to train an AI model. Often these are datasets shared publicly by academia or industry, unless a company invests in creating its own.
- A public model that encodes the relationships within a massive dataset is used to predict where our new, unseen data falls based on these previously encoded relationships. These datasets and models are often based on huge swathes of the internet
In the first scenario, traditional sampling considerations are very relevant when it comes to constructing a training dataset from scratch. Does the training dataset capture the same domain or paradigm as the unseen data that I am going to be applying the model to? Is the training dataset ‘representative’ of the way that whatever I am trying to predict falls out in real life? These kinds of questions will resonate with any market researcher and are vitally important to consider when constructing an AI model.
In the second scenario, rather than starting from scratch each time, AI practitioners increasingly leverage datasets and models shared in the public domain by organisations that have the resources to collect and process large portions of the internet.
Fine-tuning required?
For a long time, we thought that these models incorporated so much data that they captured universal relationships that could be leveraged in most contexts. However, what has become apparent over time is that these models, while amazing public goods, still need to be tweaked, or “fine-tuned”, to be sensitive towards specific contexts. This process is known as “transfer learning” – a popular field of AI research and application at the moment. These public models represent massive, solid foundations to build on, but we still need to choose the paint colour, window shape and roof style (to butcher an analogy).
Indeed, when it comes to these public models, the promise of big data was that we didn’t need to sample anymore. Why would we when we had access to “all” the data? And, indeed, many public resources that AI models are based on use so much of the internet to inform their magic. However, regardless of how big these datasets might be, the reality is that we seldom have all the data. When we do have access to large datasets, they are often biased towards specific domains; few ever capture the market research paradigm with its unique concepts like brand equity or common product attributes. Consumers on the internet just don’t frame their discussions around these topics in the same way we talk about them as brand researchers and owners.
Models created using millions (or even hundreds of millions) of tweets, for example, encode the semantic associations and perceptions of a self-selected group of vocal, passionate and opinionated people. Models based on IMDB, Amazon or Yelp reviews encode consumer language specific to certain categories and Wikipedia, boon to humanity that it is, captures dry relationships between facts that hardly reflect how real people talk or think.
Models based on these sources are used by most AI practitioners around the world. Released by companies such as OpenAI, Hugging Face, DeepMind, Google and Facebook, they are surely valuable public goods but they often fail to give us the quality that we need in the market research industry out the box when we apply them to our own text, voice, image and video data.
So what next?
This brings us back to the concept of sampling, or at least thinking deeply about the data that goes into your models, and market researchers can teach the AI community a think or two about diligence in this area.
When constructing training datasets or when fine-tuning public models, careful thought needs to be given to the alignment between domains – was the training dataset or public model created using data from a similar domain to what you are applying it to? If not, how should I go about creating a new training dataset that can be used to create a new model or fine-tune an existing one? How many classes (metrics, tags, KPIs, etc.) am I trying to predict? How many training examples do I need to cover all these classes appropriately? Where am I sourcing this data from? How am I going to go about coding this data? Will I do it in-house? Will I use a coding team? Will I use a crowd-sourcing company such as Amazon’s Mechanical Turk or Figure Eight? How many coders should review each document? And so on…
It’s important that AI practitioners do not overlook this crucial aspect of building AI models and there is much that experienced marketing scientists and other insights professionals can impart in this regard.
AI has been a bit like the Wild West but it’s time to straighten things up with a bit of market research wisdom.