Elon’s Tremendous Tweets
Natural Language Processing
One of the earliest things humans learn is how to use words to express what we’re feeling, thinking, etc. Humans are pretty good at this; society and every relationship we have depends on our ability to communicate and understand others (at least mostly understand).
How do we teach this to a computer? How can a computer understand the meaning and sentiment behind words? One of the ways we’ve found is through a natural language processor.
Now, to be clear, it is incredibly difficult to have a computer interpret all of the sentiments behind a simple statement. Take the one above: it’s certainly a positive statement, but there’s an admiration, excitement, etc. Or, depending on context, this tweet could be entirely sarcastic.
In order to simplify the task for the machine, we would take this statement at face value and simply assign it one of three values: positive, neutral, or negative.
Luckily enough, there’s a large dataset with tweet text, sentiment, date, and user included, available here.
Word Reduction
Word reduction or normalization involves taking different words and reducing them down to some root word.
This is critical to making the model more computationally efficient by reducing amount of information the model is given and allowing it to understand trends in sentiments.
Now generally there are two approaches, Stemming and Lemmatization.
Let’s use Lemmatization until the data becomes too large to handle efficiently
Model Selection, Building, and Training
Building
Let’s choose from two different model styles: a linear support vector classifier (LSVM) and a standard logistic regression. Lets call these models DirectionandMagnitude (LSVM) and boring old LogRegNLP (logistic regression).
LSVM
The goal of SVM is to find the optimal hyperplane (an n-1 dimensional object that separates the Rn plane into disconnected regions) separating the classes of data.
By mapping data into a new space through a kernel function (in this case a linear function) there are extensions to data that are not separated linearly.
The support vector refers to the data points lying closest to the hyperplane. They are generally the most difficult to classify and are used to build the loss function.
The goal is to maximize the distance between the hyperplane (the object separating the data) and the support vectors (the closest points).
LSVM‘s are generally recommended for text classification, due to the linear nature of text (at least English), computational efficiency (because of the number of features), and speed, which for large text datasets can be a serious concern.
Logistic Regression
Logistic regression is a much older technique than SVM approaches. This regression will return probabilities for different data points whether or not they lie one class or the other, with the threshold being 0.50.
The probabilities and curve for logistic regression comes out of the following equation, where X is the predictor and the response is p.
Logistic regression comes out of maximizing the maximum likelihood estimate (MLE). The MLE is constructed through attempting to find the probability of Y given that the predictor variable X is some value.
Maximizing the MLE using some root finding method (say Newton’s or secant) allows us to find the value of the estimator, and construct the equation.
Logistic regression is much less computationally expensive, and better with larger datasets. The probabilities produced using logistic regression can also be seen as confidence estimates in the classification, indicating the probability of a given classification.
Selection and Training
Now that we’ve learned about the basis for these different models, let’s look at how DirectionandMagnitude and LogRegNLP do on this task. Both models were trained on 190,000 data points, and the performance evaluated on 10,000 data points. Both of these are a much smaller subset than the full data set to save my poor CPU.
DirectionandMagnitude
Let’s see how this LSVM model performed on the testing data.
Overall, it seems like this model can identify the sentiment of the text data correctly 81.6% of the time. Let’s look at more of the details on how this model performed.
It seems like there wasn’t really a significant difference between the accuracy of the classification of a sentiment as negative (0) or positive (1) sentiment.
LogRegNLP
Let’s see how the logistic regression model handled the testing data.
It seems as though this model can identify sentiment slightly more accurately than the LSVM, at 82.63% of the time. As with the LSVM, let’s look at the details.
Similar to the LSVM, the logistic regression didn’t seem to have a significant difference between the accuracy of classification as positive or negative.
It seems like LogRegNLP performed slightly better than DirectionandMagnitude. However, given how close they were in performance, I will still use DirectionandMagnitude to determine the sentiment of some other text data.
Applications
Does Elon Musk Have a Positive Message?
Now let’s use our model to determine some sentiments. An NLP could be useful to companies for determining what people feel about them, sentiment analysis for marketing campaigns, brand or product sentiment, etc.
But given that this model was trained on tweet data let’s analyze some tweet data. When I think of twitter I generally think of two personalities: Elon Musk, and Donald Trump. My friend who coded this project with me, Caleb Stevens, decided to do his analysis on Donald Trump, so I will do mine on Elon Musk.
The question “Does Elon Musk have a positive message?” can be answered simply by running the DirectionandMagnitude model on his tweet data (after following the previous steps) and seeing how many are rated positive compared to the total number of tweets.
Elon Musk tweets from 2016 through 2020 (gathered using twitter API) should give a good idea about the sentiments of Elon Musk. When this tweet data is run through DirectionandMagnitude, a solid 74.5% of those tweets are rated as positive. Good for Mr. Musk!
Overall, I’d say Elon Musk is a pretty positive guy, at least on twitter.