
Using Natural Language Processing to evaluate correlations between sentiment and trading activity.
Introduction
Opinions shared on the social media platform Twitter influence market data (number of sales, sales volume, floor price and market capitalization) for non-fungible tokens (NFTs), but to what extent? This article examines the strength of the relationship between social media short messages and market data for NFT collections. The resulting hypothesis is that market movements are correlated with tweets.
The dataset consists of tweets and market data from 90 of the largest NFT collections by market capitalization, which are considered to be representative of the NFT market. The aim is to calculate a macro sentiment index based on a natural language tool. The index indicates whether the mood of the NFT market is more positive or negative. The sentiment index is intended to provide investors with a reliable assessment of the NFT market and to help them decide whether it makes sense to invest or sell at this time.

There are several steps involved. The first step is to get the data. The selected accounts are followed and their tweets are exported. Pre-processing aims to present all the elements of the text in such a way that they can be evaluated by natural language processing. This includes for example replacing emoticons and emojis with the actual text. Natural language processing gives the text a score that indicates whether the content is positive, neutral or negative. The approaches used to look at the words, sentence order and grammar of the underlying tweet.
Pre-Processing
The tweets used in this work are taken from a database we have built. The Twitter API is used to retrieve the tweet and additional information beyond the tweet itself. This includes the UserID, the creation date, the number of likes, the number of retweets, the name of the collection and much more information that is not necessarily needed for this task. Since a tweet is limited to 280 characters, every element in the tweet must be evaluated and taken into account. This requires pre-processing steps, which are described below.

The first step is to replace abbreviations. These include common abbreviations that are used independently of specific topics, such as “asap” for “as soon as possible”. Similarly, abbreviations used in the financial and crypto/NFT sectors in particular, such as “dyor” for “do your own research”, will be replaced by the full words. For the abbreviations in the NFT space, a directory of standard abbreviations was created. This is necessary because abbreviations cannot be recognised by the sentiment algorithms used later.
The next step was to remove URLs pointing to third-party sites or other websites, as these should not be evaluated. There is a risk that a website with the URL “www.everythingisgood.com” would be rated positively by the wording, but the tweet would have a negative bias. This will lead to an incorrect sentiment score.
Next, emojis and emoticons are translated into text. The “emot” package, which can be used for Python, is particularly useful for this purpose. An example use case for this package is replacing “:-)” with “happy face smiling”. Emoticons and emojis cannot be recognised by the dictionary-based and rule-based approaches used later, but the text stored in place of the emoticon can. Emoticons are used to express emotions, so they need to be replaced with the appropriate text.
Fourth, similar to the URL, the HTML tags are removed. The content of these tags gives a false impression of the overall sentiment score of the tweet. Furthermore, this is not content that was written by the user and does not give an insight into their sentiment.
Next, hashtags are presented in a way that allows them to be evaluated. This means that the hashtag character before the actual hashtag is removed so that there are no special characters in the string. This is important because hashtags often give insight into sentiment and trends.
Sixthly, the name tags of other users are removed, as these can also negatively affect the results of sentiment analysis. Also, this is content that was not written by the user himself and the name of another user or account does not provide any information about the sentiment of the creator.
Finally, all words were written in lowercase using the “casefold” feature of the NLTK package. This is a general formatting step to avoid future problems. Figure 3 shows the impact of the pre-processing.

The blue lines show that the total number of tweets rated as 100% neutral has decreased from 23.1% to 21.1%. The average positive and negative score of each tweet increases, which means that the existing tweets can be better evaluated. In particular, the average negative score increases from 0.0227 to 0.0446, an increase of almost 100%. In general, there are more tweets to rate and these tweets have a higher negative/positive rating.
Natural Language Processing
Four algorithms are used to score the adjusted tweet. The common denominator of these approaches is that they perform well on text analysis, especially on short messages. In addition, all four approaches produce a score ranging from -1 to +1, where -1 indicates negative sentiment, 0 neutral sentiment and +1 positive sentiment. It was decided to use 4 approaches, as one approach alone is not reliable. In the subsequent selection, only tweets with at least two sentiment scores of either a positive, neutral or negative result are taken into account.
On the one hand, the SenticNet algorithm is used, which is a lexicon-based approach. Unlike traditional methods that rely on statistical techniques, SenticNet uses a combination of semantic analysis and machine learning to understand emotions more accurately. It takes into account words, grammar as well as sentence structure to establish relations between terms. SenticNet’s default output is a 4-dimensional matrix with a score for “introspection”, “temper”, “attitude” and “sensitivity”, where especially the “attitude” score is used for further steps. As can be seen from other sources, SenticNet is particularly well suited for sentiment analysis of Twitter tweets.
The second lexicon-based approach is the SentiWordNet algorithm. This algorithm is organized in cognitive synonyms called synsets. The background to this is that the focus is mainly on individual words and their meaning. Grammar and sentence structure are not directly considered. An analysis of sentiment derived from the context of a sentence is therefore not possible. Nevertheless, SentiWordNet is particularly well suited for short texts, since here the actual sentiment often depends on a few words.
The third algorithm used is the rule-based approach Valence Aware Dictionary and sEntiment Reasoner, or VADER, analyser. This algorithm was developed specifically for social media and, unlike the other two algorithms, it combines lexical and heuristic rules. VADER takes into account individual words, sentence structure, grammatical rules and the context of the words to assign a reliable sentiment score to short texts. Each tweet is scored using these three algorithms and receives a sentiment score. The goal is to represent the existing sentiment of the text in a single numerical value. Of course, this score only quantifies whether the content of the text is positive, neutral or negative.
The final algorithm used is the robustly optimised Bidirectional Encoder Representations from Transformers approach (roBERTa). This algorithm was developed by Facebook AI Research as an extension of the original BERT model. In this case, a pre-trained model focused on Twitter data was used. RoBERTa considers the context of words in a sentence as well as their relationship to surrounding words. This approach, therefore, captures not only meaning but also semantic structure. Because it is a neural network model, it is computationally intensive and takes much longer to analyse unstructured data than the approaches described above.
The following figure shows the progression of the sentiment curve based on the respective approach. It can be seen that SentiWordNet, VADER and roBERTa have a similar course and, in particular, at the peak shortly after the 2022-12-25 over one vote.

Results
The NFT market data used for comparison with the sentiment data are the number of sales, the volume of sales, the market capitalisation and the floor price. The market data has been scaled from 0 to 1 in order to display it on a graph. The data shown is based on all transactions made on the OpenSea platform. The following figure shows the market data over the same period as the sentiment data.

It is clear that volume, market capitalisation and floor price have strong peaks in the data. The raw data shows that the NFT market is heavily influenced by market manipulation, known as wash trades. In wash trades, NFT owners sell their NFTs to an address they control at a price well above or below what the NFT is actually worth. The aim is to artificially influence the price of the collection. This is because the number of NFTs sold does not increase over the same period, while the prices and volumes traded change dramatically.

For values below 50%, more than half of the predictions are wrong. Both VADER and SenticNet perform particularly badly in this first comparison. RoBERTa has the highest correlation. In this first comparison, the sentiment values were aggregated over a period of 12 hours. A prediction was then made for the next two hours. The next step is to check which period is the best. The fact that the sentiment values are taken over a 12-hour period remains the same. A check is made to see when a reaction can be observed in the market. Different time periods are examined.


Table 2 shows that the correlation values peak at a lag of 6 hours. There is an outlier in the number of sales for the 12-hour period. Therefore, for all further considerations, the market data is aggregated for a period of 6 hours. Compared to the previous table, the results are significantly better. It can therefore be assumed that the sentiment data for a 12-hour period will lead to an immediate reaction within the next 6 hours. As the moving average is used as a time series forecasting technique, it is still necessary to check how many previous values are included in the forecast. In the classical financial sector, there are moving averages in technical analysis that include 50, 100 or even 200 of the previous values. As the NFT market is much more volatile, the size of the window length varies between 2 and 20.

Table 3 shows that the previous window length of 2 upstream values gives the best results. This is especially true for the bottom price and the market capitalisation. The sales volume varies little and is only minimally affected by the chosen window length. Finally, we examine the influence of tweet attributes on the forecast. The roBERTa score is extended to include like count, retweet count and frequency. The frequency is the number of tweets present in the 12-hour period.

Table 4 shows that extending the actual sentiment score only adds noise to the data. With the exception of the number of sales in the extension with the number of retweets, only a deterioration of the results can be expected. The sentiment score, therefore, remains a stand-alone score based on the content of the tweets.
Conclusion
In summary, the roBERTa approach is most consistent with the NFT market data. For further research, it would be interesting to determine to what extent the hit rate can be improved. For example, are there times when the hit rate is significantly higher than average? In addition, the sentiment could be adjusted by replacing Twitter accounts with other accounts until the most important sentiment drivers are reflected in the sentiment. In addition to predicting a move, it would also be very useful to predict the strength of the move. This means that the percentage change in the sentiment index can also be used to predict the percentage change in market data. Figure 7 clearly shows that the number of sales is the parameter where there are no large outliers.

Figure 7 shows that there is a visible correlation between the number of sales and the roBERTa score. The correlation is particularly striking at the peak shortly after 25/01/23. In further analysis, the market data is categorised. The subdivision is based on the floor price of the collections. A distinction is made between:
- Low ⇒ Collections with a floor price of < 0.5 ETH
- Middle ⇒ Collections with a floor price between > 0.5 ETH and < 2 ETH
- High ⇒ Collections with a floor price of > 2 ETH

In order to translate the previous findings into a trading strategy, certain aspects of the data are examined more closely. This is particularly the case where the hit rate is significantly higher than average. The time period for which the predictions are made is suitable for such an investigation. Initial results show, for example, that the hit rate for the period between 0 and 6 a.m. is significantly higher than average. Also, the number of tweets within certain time periods and the removal of data points with price or market manipulation. This is particularly the case for collections with a minimum price greater than 0.5 ETH and less than 2 ETH. First evaluations show that the hit rate is between 70% and 80%. This is especially true for the floor price, which is certainly the most interesting market value for a collection, as money can be made from buying and selling.
In order to be able to predict the strength of the movement in addition to the actual market movement, it is helpful to establish a percentage change correlation. The number of sales would be a good parameter to start with. In order to use the number of sales, the floor price and the market capitalisation, the existing raw data would first have to be post-processed to at least remove the periods of price manipulation on a more reliable basis.
The final step in this process is to back-test the correlations. This is set up so that an amount x is invested at the beginning of the period. The trades are then executed, taking into account constraints such as transaction fees. At the end of the period, the final amount should ideally be greater than the amount x invested at the beginning.
About the author
Max is a PhD student at the joint project “DeFi Risk Advisor AI” of the University of Ulm & Blockbrain, specialising in the convergence of decentralised finance, non-fungible tokens (NFTs) and AI. Due to his previous studies in natural sciences as well as his interest in start-ups and distributed ledger technology, this research area fits perfectly with his strengths and interests.