Stock Movement Modeling Based on the Analysis of Negative Correlation

This research presents the data-driven modeling method to derive a combined trading model from the analysis of negative correlations among the top-five active stocks from each sector of the Thailand stock market. The negative movements are computed from the closing price direction of major stocks in the eight biggest sectors. The highly negative correlated stocks among market groups are then used to build predictive trading models with three algorithms: regression analysis, generalized linear modeling, and chi-square automatic interaction detection. An ensemble from the combination of the best two models is then created. The experimental results reveal that the proposed method of trading based on negative movement analysis can accurately predict closing price of the active stock with low error rate around 1.86%.


Introduction
Equity security such as stocks that have monetary value and can be traded is a primary financial instrument for companies to gain new capital for driving organizations' operations. Investors traditionally earn profit from their belonging securities by selling them in the stock exchange market at higher price than the bought value. Buying low and selling high is a common strategy used by all investors to earn profit, but to apply successfully such tactic the traders need complicate analysis to manage trading plan.
Some analysts may use knowledge regarding fundamental aspects such as financial statement of a company to evaluate intrinsic value of that company's security. A stock value measurement method based on a company's performance is known as a fundamental analysis. Another measurement paradigm used by many short and medium term traders is called technical analysis that evaluates stock value based on trading statistics such as prices and volumes in the past. Our proposed work in this paper is in the line of technical analysis.
On technically analyzing historical stock price movement patterns, visualization tools such as candle sticks and bar charts are commonly used to identify movement trends. There are many movement indicators used for graph plotting, for instance, moving average (MA), moving average convergence divergence (MACD), relative strength index (RSI), and many more. The trend identification is mostly based on the movement pattern of a single stock of interest. Trend forecasting of a specific stock over consecutive time intervals is analyzed through the serial correlation to determine the future price of a stock. Such correlation is based on univariate analysis.
Another kind of correlation analysis recently appeared in the literature is the cross correlation between markets or among different sectors in the same market [1]- [7]. The positive cross-correlation between markets or among sectors is called co-movement because the index changes or the price changes are in the same direction. In this work, we propose the opposite study of stock movement; our analysis is based on the negative correlation between market sectors. We call this kind of study as reverse movement analysis.
Our intuitive idea is based on the observation of Thailand stock movement that major stocks in the same sector are positively correlated with each other and move in the same direction. Investing on the correlated stocks is vulnerable to a great loss. Managing portfolio to include some negatively correlated stocks can more or less be a good protection against market volatility. After the literature review section, we explain our methodology to analyze the reverse movement of stocks across sectors. We then illustrate our methodology implementation and experimentation results. In conclusion, we discuss the advantage of our proposed approach as well as the future modification of this research to overcome some current limitations.

Literature Review
Predicting accurately direction of stock movement is the main focus of both fundamental and technical analysts; they only differ in data used for the analysis. Fundamental analysts make their prediction based on performance of the industry and other economic factors, whereas technical analysts evaluate their investments based on statistics generated by market activity such as historical stock prices and trading volumes. The majority of research work in the literature focuses on the technical analysis indicators to derive trading rules [8] and to forecast stock price direction [9]- [12].
Stock direction forecasting techniques that have been proven by many researchers in the past decade yielding satisfactory results are those based on machine learning and soft computing techniques [13]- [16]. Machine learning tasks can be categorized into three main groups: classification (on categorical target) or prediction (on numerical target), clustering, and association. The most applicable machine learning techniques in finance domain are in the group of prediction task and the extensively adopted learning algorithm is artificial neural network (ANN). Tilakaratne et al. [17] applied ANN to predict trading signals of the Australian index. Ticknor [18] introduced a Bayesian regularized network that assigns probability to the network weights and the proposed ANN had been tested with the Microsoft and Goldman Sachs stocks. Chen and Seneviratna [19] used feed forward back propagation ANN to predict price index of Colombo stock exchange, Sri Lanka.
Researchers also adopted other learning algorithms to predict stocks including decision tree induction [20], [21], genetic algorithm [22]- [24], state space modeling [25], and optimization techniques [26]. To improve predictive accuracy of the induced models, many researchers considered a fusion approach that combined results from model ensemble such as a hybrid adaptive neuro inference system [27], a bagging of tree-based classifiers [28], [29], an integrated forecasting system using wavelet neural network and artificial bee colony [30], a deep learning approach incorporated with two-directional principal component analysis [31], and support vector machine integrated with probabilistic AdaBoost [32].
Clustering and association analyses are two machine learning tasks that have also been applied to the finance domain. Renugadevi et al. [33] applied hierarchical agglomerative clustering to rank Indian stocks based on the difference between close price and open price. Ta and Lin [34] adopted k-means and hierarchical clustering methods to visualize patterns of the Vietnamese stock market. Peachavanish [35] also detected Thailand stock trend using cluster analysis. Hsieh et al. [36] applied the concept of association analysis by discovering closed itemsets among numerous profit rules with the main purpose of rapid search for high profit rules for the real-time trading. Bhoopathi and Rama [37] also applied association analysis to derive the efficient trading rules.
Stock clustering based on similarity in price movement can help investors making decision on managing their port folios. Association among stocks is also a supplement on port folio management in the sense that it reveals movement direction that tends to be helpful for trading decision. Cross-correlation between stock markets and interactions among market sectors have been extensively studied recently [38]. The study of cross-correlation and market movement are mostly based on the correlation analysis and the application of random matrix theory [39]- [42].
In this work, we focus our study on cross-correlation among market sectors. Instead of performing comovement analysis as other researchers had been done, we propose to anticipate stock price movement with the observed inverse correlation of stocks from different sectors. The details of our proposed research framework are explained in the next section.

Methodology for Trading Model Creation
Our trading model is data-driven such that model is learned through the systematic analysis of historical trading events. Our model creation process consists of seven steps as shown in Fig. 1. Data collection, preparation and analysis steps can be explained as follows: Step 1 Data Extraction: Investors specify market sectors of interest and trading indicators such as opening price, closing price, trading volume, and so on. The proposed systematic process will extract data from each sector the top five stocks based on historical trading volumes.
Step 2 Data Transformation: The extracted data are ordered in such a way that rows are consecutive trading dates and indicators of all selected stocks are in the same row. For example, suppose traders select only two trading indicators: opening price and closing price. If they specify only two market sectors, then there will be ten stocks selected from step 1. We name the selected stocks as S 1 , S 2 , …, S 10 . Thus, the ordered International Journal of e-Education, e-Business, e-Management and e-Learning data in this step will be: Step 3 Reverse Movement analysis: At this step, traders identify the target indicator. For instance, if we are interested in predicting closing price of a specific stock, then closing price of that stock will be the focus of our movement analysis. Based on the example in step 2, negative correlation of all the selected ten stocks' closing prices are computed.
Step 4 Stock Selection: Stocks from different market groups showing highly negative correlation are selected for further processing. We set the default threshold of negative correlation to be at least 0.7 in its strength to be considered as high. This threshold can be adjusted according to the traders' decision. The output of this step is a set of negatively correlated stocks from different market sectors.
Step 5 Data Imputation: This step is to be performed only if there exists missing items in the set of negatively correlated stocks. We impute the missing attributes with the nearest neighbor method.
Step 6 Model Creation: From the obtained set of negatively correlated stocks, the traders pick one stock as the target for the creation of predicting models. In our proposed methodology, we adopt three numerical prediction methods: linear regression, generalized linear model (GLM), and chi-square automatic interaction detector (CHAID).
GLM is the extension of linear modeling framework to allow variable that are not normally distributed [43]. The GLM framework includes a broad class of models such as ANOVA, Poisson regression, log-linear models, and many more. Generalization power of GLM is achieved via a modification of link function to handle different kinds of data distribution.
CHAID [44] is a tree-based learning algorithm that recursively partitioning large dataset of mixed distributions into smaller data subsets with homogeneous distribution at the terminal (or leaf) node. For continuous response (or target) variable, each terminal node of the CHAID tree represents regression-type predictor.
At the end of this step, our analysis methodology generates three predictive models to forecast stock price movement.
Step 7 Model Evaluation: The generated three predictive models have to be assessed on their predicting accuracy. We use the hold out method for model evaluation. The set of negatively correlated stocks that has been outputted from step 5 is to be split into two subsets with proportion around 70:30. The large portion (70% of data) is to be used for training and generating the three predictive models as explained in step 6, whereas the remaining (30% of data) is to be used as model tester in this step. Among the candidate three predictive models, the one with the highest accuracy (or the lowest predicting error) will be selected as the final result of our methodology. If the accuracy assessed from the three models is not distinctive, we propose a combination method by using a fusion of two or three models as an ensemble.

Experimentation and Model Evaluation Results
To demonstrate our proposed methodology through real-world data experimentation, we use daily trading data of Thailand stock exchange market acquired from the online source "http://siamchart.com/stock/". The stock trading data in our experiments are between January to December of the year 2016, which are comprising of 244 records. We explain the implementation and experimentation details of all seven steps as the following:

International Journal of e-Education, e-Business, e-Management and e-Learning
Step 1 Data Extraction. We extract the top-five most active stocks from each market sector. There are eight major sectors in Thailand stock market. The sector name and the most active stocks in each sector are summarized and shown in Table 1. Step 2 Data Transformation. For all extracted 40 stocks, we select five indicators and the trading date for further analysis. The selected indicators are opening price (opn), highest value of the day (hgh), lowest value of the day (low), closing price (cls), and trading volume (vol). All the 40 stocks, S 1 … S 40 , with their attributes are rearranged in the tabular format as shown in Fig. 2.  Step 3 Reverse Movement Analysis. The correlations of all 40 stocks are then computed based on their closing price attribute. We focus our analysis on the industrial sector and we have found only one stock in this sector, that is STANLY, showing cross-negative relations to SABINA in the Consump sector (coefficient = -0.75) and TPIPL in the Propcon sector (coefficient = -0.76). The negative movements between STANLY, SABINA, and TPIPL are graphically shown in Fig. 3.
Step 4 Stock Selection. In our experimentation, we set the correlation strength to be at least 0.75 and the direction has to be negative. Based on the correlation analysis results, the set of negatively correlated stocks from cross sectors is {STANLY, SABINA, TPIPL}.
Step 5 Data Imputation. The target of our prediction is the closing price of the STANLY stock. But the STANLY trading data from step 4 contain missing values for 39 records out of the total 244 records in the three attributes: Open, High, and Low. Thus, we need to perform data imputation with the nearest neighbor method.
Steps 6-7 Model Creation and Evaluation. To predict closing price of STANLY stock based on the cross-sector reverse movement of SABINA and TPIPL stocks, we apply three numerical prediction algorithms: linear regression, GLM, and CHAID. For GLM algorithm, we experiment with log and exponential link functions for the normal, Gaussian, and binomial distributions. The experimental results of GLM modeling are as good as the linear regression. Both GLM and linear regression models yield lower error than the CHAID model. The performances of linear regression, GLM, and CHAID models are summarized in Table 2.

Final Trading Model
Based on the model evaluation results, the best models are linear regression and GLM and the second best one is CHAID. We thus perform another step of modeling by combining the two models: linear regression and CHAID. To reduce model assessment bias, we split the 244 data records obtained from step 5 into two separate data subsets. The large subset (167 records, around 68%) is used as a training dataset to create a combined linear regression + CHAID model. The hold-out data subset (77 records, 32%) is used as a test set to evaluate predictive performance of the combined model. Evaluation results of the combined regression + CHAID as compared to the single linear regression and CHAID models are illustrated in Table 3. It can be seen from the results that a combined regression + CHAID model predict closing price of a target

International Journal of e-Education, e-Business, e-Management and e-Learning
stock more accurate than a single model. Mean absolute error (MAE) on out-of-sample data is as low as 3.19, whereas the linear regression and CHAID models predict the same set of test data with higher errors at 3.40 and 3.38, respectively. The combined regression + CHAID model is shown in Fig. 4. Fig. 4. The combined regression + CHAID model to predict the closing price of STANLY stock.
On applying the combined model to predict value of the target variable, both regression and CHAID predictions are to be applied and scores from both predictions are finally computed as a single model ensemble. The advantage of multiple scoring models is to avoid the limitation of each single model and thus results in a higher predictive accuracy.

Conclusion
In this work, we propose a systematic approach to analyze reverse movement of stocks among different market sectors and extract only highly negative correlated stocks to build potential predictive models. Based on the sample data of Thailand stock exchange market during the year 2016, we found that linear regression and CHAID are two accurate models. Then, we combine these statistical-based and tree-based modeling methods to generate the final trading model ensemble. Based on our experimental result of predicting the closing price of STANLY stock using the combined model, we observed the mean absolute error of predicting 77 out-of-sample trading data records to be as low as 3.199. Taking into consideration the mean closing price of STANLY stock at 171.891, the error rate of our combined model is around 1.86%. This percentage error is computed from (3.199/171.891)*100.
The proposed method creates a trading model ensemble from a single statistical approach (linear regression) and a tree-based learning algorithm (CHAID). As the result is promising, we plan to further our study on investigating other learning methods such as support vector machine and deep learning. We also plan to extend our work to cover lagged correlation on the model building process to assist the long-term trade forecasting.