Are You Making Erroneous Business Decisions Based on Poor Analysis of Social Media and Text Data?

Many companies take advantage of natural language processing (NLP) for text analysis of market research or social media data. However, how accurate are the techniques you are using? We interview data scientists almost every week as we’re constantly expanding our team of data scientists, and what we’ve noticed when questioning prospective data scientists on various areas are the many misconceptions around natural language processing approaches, and the huge number using inferior and misleading techniques but believing they understand them. Almost every data scientist we speak with believes that they are very familiar with these techniques, and that semantic analysis of text is easy. There are fairly common ML techniques commonly being used, but if you examine them critically, the data they are analyzing is inaccurate.

One can write algorithms to analyze keywords or phrases, assume a positive or negative meaning and classify them that way, then add an intensity score, and then these can be input into machine learning or another AI technique. However, this can cause serious inaccuracies in the resulting predictions because these techniques are very flawed. Everything in AI comes down to the data. Garbage in – garbage out. Both deep linguistic analysis and then machine learning hand in hand are the optimal way to approach text analytics. For text analysis to be accurate, when applying machine learning AI techniques, one needs to conduct a 2 level process: first, a proper deep linguistic text analysis to ensure accurate identification, classification, context and meaning of complex text data; then, add your machine learning algorithm to make predictions.

In Pharma applications, a critical component of this is having a team of computational linguists build a custom dictionary for the therapy areas and language you are monitoring. Once that is done correctly, applying ML algorithms will provide far more accurate results. This is where we see so many in the AI field failing on social media listening. Let me give you some examples… If you had a keyword ‘sick’ on the words you were monitoring, and you had gone straight to the ML approach and failed to incorporate a deep linguistic analysis first, you would apply an ML algorithm and classify this as a negative word. Chances of you being correct are 50/50. Then, if you use that classification in your ML algorithms, what do you think the accuracy of your result will be? Yes – you guessed it – 50/50. A word can have different meanings depending on context, and even on who is saying it. If a teenager said something was ‘sick’, there would be a positive meaning compared with someone in their 80’s using the word. Currently, many using NLP we have seen would assign the word as positive or negative and then assign an intensity value. This can lead to inaccurate classification.

One also needs to examine more than just the keywords as the sentence or phrase structure also has an impact. In addition, the order of the same words can have a very different meaning, and some analyses of these fail to spot these; for instance: if you said, “I’ve never had a fever before” compared with “I had a fever like never before”. The linguistic structure is also critical in gaining an understanding of meaning – “This laptop is much better than my old laptop”, which is positive, versus “This laptop is not much better than my old laptop”, which is negative. Intensity must also be captured. The more intense the feelings, the higher or lower the score. It needs to capture intensifiers such as ‘very’, ‘really’, ‘much’, ‘extremely’, ‘tremendously’, awfully’, ‘exceedingly’, etc… So, this is something that needs to be overcome.

However, the issues are far more than sentiment score related when you are examining language. Can the approach utilize syntactic rules and determine what an entity is when it is the same word – with multiple meanings – based on context? For example: Barack Obama. Is that a person, a school, an avenue, or something else? If you say, “I am going to Barack Obama”, are you going to the person, to scope out the school for your child, to visit the avenue as you are considering buying a property there, or something else? Most NLP we have seen do not take overall context into account to distinguish these. We need analytics that distinguishes between entity type (as above example shows) based on context, or you could get very wrong classifications and analytics from it. On top of that, we need our analytics to be able to distinguish correct concepts as different sentences with the same words as these can mean different things depending on the context. Correctly identifying the intended meaning is important. And, of course, multiple instances of a classification within a sentence and ensuring that the accurate overall intended meaning is uncovered is critical. So, if you said, “This laptop is wonderful but far too expensive, and the screen is too small” – this is clearly saying that they will not be buying it but a simple keyword analysis may stop at “This laptop is wonderful” as a phrase to analyze and conclude it is a positive phrase.

In addition, you need a topic detection to ensure you are able to handle negation – “This laptop is really not that bad”. You also need to consider linguistic structure in similar worded sentences such as “Qlik takes the lead from Tableau” versus “Qlik takes over Tableau”. Don’t get me wrong! We love, and live by, Artificial Intelligence algorithms. Nevertheless, it is critical that you think about the data as the quality of your results depends on this.

Conclusion

If you are using text data, be it high (journals) or low (twitter, forums) quality, don’t take a simplistic approach and throw it into an algorithm and expect great accuracy. The process should be to ensure that you have a good understanding of the data you are working with. In the case of text data, this means you should ensure that you utilize a computational linguist (and probably several to get projects completed in a timely manner) and create a custom dictionary for the therapy area and language you are working with, then get the computational linguist to apply various deep linguistic analyses to the data to classify it, and then apply ML algorithms to the data to create predictions about the data. I am not a linguist, but I would not allow anyone in my data science team to conduct any kind of text analysis without getting the computational linguists working on the data first. Eularis live by our results, so we need to ensure we are doing the best with the data to ensure the best outcomes for our accuracy and client results.

For more information on this topic, please contact the author – Dr Andrée Bates – at Eularis: https://www.eularis.com.

Found this article interesting?

To learn more about how Eularis can help you find the best solutions to the challenges faced by healthcare teams, please drop us a note or email the author at abates@eularis.com.

Contact Us

Write you name and email and enquiry and we will get right back to you as soon as we can.