Time Series with NLP Stock Prediction

Can we use Elon’s Tweets to make a better Multivariate Time Series Model?

Libraries

  • scipy
  • statsmodel
  • plotly
  • pandas_datareader
  • seaborn
  • streamlit
  • wordcloud
  • PIL
  • bs4
  • bokeh
  • gensim
  • huggingface
  • ktrain
  • pytorch
  • tensorflow

How Data was collected

  • Data was collected using pandas_datareader for financial data and GetOldTweets3 for twitter data
  • User handle was elonmusk and data collection period was Dec 1, 2011 to July 31, 2020
  • Running a non-parametric test between no-tweets, personal-tweets and business tweets showed a price difference in closing price between the 3

timeseries, tesla, nlp

timeseries, tesla, nlp

timeseries, tesla, nlp

Text Classification

  • More traditional Classification models were first experimented with before moving to transformer architecture.
  • They had very low f1-score the best being logistic regression at 0.62

timeseries, tesla, nlp

  • Using BERT and DistilBERT f1-score was around 0.80
  • Used DistilBERT because the model was much smaller and would be much easier to launch as a webapp

timeseries, tesla, nlp

timeseries, tesla, nlp

timeseries, tesla, nlp

timeseries, tesla, nlp

Time Series Models

  • I tried again to use sklearn models before turning to neural networks.
  • I tried a xgboost and randomforest however because of the structure and how volatile Tesla stock price was these algorithms probably wouldn’t be suitable
  • I also tried using ARIMA but usually that is only for univariate time series models.

timeseries, tesla, nlp

RNN, LSTM, GRU

  • GRU gave the best results.
  • I changed some of the structure in GRU below which gave a slightly better prediction.
  • Changing the recurrent activation and activation

timeseries, tesla, nlp

timeseries, tesla, nlp

Out of sample data

  • Doing a 0.90 Train, 0.1 Test I felt did not give me much data to work with
  • I further used August 1, 2020 to August 13th, 2020 as a validation set
  • Below are the results and it made a huge miss on August 12th, 2020 the announcement of the stock split.

timeseries, tesla, nlp

What to improve on

  • Use 8-k data from SEC. Scrape it and analyze the text
  • Use prominent business tweeters instead of Elon as they talk more about the business of Tesla
  • Instead of classifying as type and sentiment separate them in bins of gains and losses based on the text. timeseries, tesla, nlp

  • More information and details on the write up can be found in the webapp. Github link also below:

Tesla Web App

Tesla Github Repo