12.1 order book 分析 · 基于高频 limit order book 数据的短程价格方向预测—— via multi-class SVM

来源:https://uqer.io/community/share/5660665bf9f06c6c8a91b1a0

摘要:

下面的内容是基于文献Modeling high-frequency limit order book dynamics with support vector machines的框架写的,由于高频数据粗粒度依然有限,只能实现了部分内容。若需要完整理解这个问题以及实现方法,请阅读上述的文献。下面我会简单介绍一下整个框架的内容。

模型构造

作者使用Message book以及Order book作为数据来源,通联没有前者的数据,因此后面的部分只涉及到level1买卖5档的order book数据作为模型的输入。这里我只实现了通过order book数据预测mid price的方向,包括向上,向下,以及不变。对于bid-ask spread crossing的方法相似,我暂时就不放上来了。

特征选择

对order book数据做处理后,可以提取到我们需要的特征向量。总的特征分为三类:基本、时间不敏感和时间敏感三类,这里我们能从数据中获得全部的基本和时间不敏感特征,以及部分时间敏感特征,具体的见图片,或者进一步阅读文献。

12.1 order book 分析 · 基于高频 limit order book 数据的短程价格方向预测—— via multi-class SVM - 图1

  1. #importing package
  2. import numpy as np
  3. import pandas as pd
  4. from matplotlib import pyplot as plt
  5. from sklearn import svm
  6. from CAL.PyCAL import *
  7. #global parameter for model
  8. date = '20151130'
  9. securityID = '000002.XSHE' #万科A
  10. trainSetNum = 900
  11. testSetNum = 600
  12. #loading LOB data
  13. dataSet = DataAPI.MktTicksHistOneDayGet(securityID=securityID, date=date,pandas='1')
  14. #Features representation
  15. ##Basic Set
  16. ###V1: price and volume (10 levels)
  17. featV1 = dataSet[['askPrice1','askPrice2','askPrice3','askPrice4','askPrice5','askVolume1','askVolume2','askVolume3','askVolume4','askVolume5','bidPrice1','bidPrice2','bidPrice3','bidPrice4','bidPrice5','bidVolume1','bidVolume2','bidVolume3','bidVolume4','bidVolume5']]
  18. featV1 = np.array(featV1)
  19. ##Time-insensitive Set
  20. ###V2: bid-ask spread and mid-prices
  21. temp1 = featV1[:,0:5] - featV1[:,10:15]
  22. temp2 = (featV1[:,0:5] + featV1[:,10:15])*0.5
  23. featV2 = np.zeros([temp1.shape[0],temp1.shape[1]+temp2.shape[1]])
  24. featV2[:,0:temp1.shape[1]] = temp1
  25. featV2[:,temp1.shape[1]:] = temp2
  26. ###V3: price differences
  27. temp1 = featV1[:,4] - featV1[:,0]
  28. temp2 = featV1[:,10] - featV1[:,14]
  29. temp3 = abs(featV1[:,1:5] - featV1[:,0:4])
  30. temp4 = abs(featV1[:,11:15] - featV1[:,10:14])
  31. featV3 = np.zeros([temp1.shape[0],1+1+temp3.shape[1]+temp4.shape[1]])
  32. featV3[:,0] = temp1
  33. featV3[:,1] = temp2
  34. featV3[:,2:2+temp3.shape[1]] = temp3
  35. featV3[:,2+temp3.shape[1]:] = temp4
  36. ###V4: mean prices and volumns
  37. temp1 = np.mean(featV1[:,0:5],1)
  38. temp2 = np.mean(featV1[:,10:15],1)
  39. temp3 = np.mean(featV1[:,5:10],1)
  40. temp4 = np.mean(featV1[:,15:],1)
  41. featV4 = np.zeros([temp1.shape[0],1+1+1+1])
  42. featV4[:,0] = temp1
  43. featV4[:,1] = temp2
  44. featV4[:,2] = temp3
  45. featV4[:,3] = temp4
  46. ###V5: accumulated differences
  47. temp1 = np.sum(featV2[:,0:5],1)
  48. temp2 = np.sum(featV1[:,5:10] - featV1[:,15:],1)
  49. featV5 = np.zeros([temp1.shape[0],1+1])
  50. featV5[:,0] = temp1
  51. featV5[:,1] = temp2
  52. ##Time-insensitive Set
  53. ###V6: price and volume derivatives
  54. temp1 = featV1[1:,0:5] - featV1[:-1,0:5]
  55. temp2 = featV1[1:,10:15] - featV1[:-1,10:15]
  56. temp3 = featV1[1:,5:10] - featV1[:-1,5:10]
  57. temp4 = featV1[1:,15:] - featV1[:-1,15:]
  58. featV6 = np.zeros([temp1.shape[0]+1,temp1.shape[1]+temp2.shape[1]+temp3.shape[1]+temp4.shape[1]]) #由于差分,少掉一个数据,此处补回
  59. featV6[1:,0:temp1.shape[1]] = temp1
  60. featV6[1:,temp1.shape[1]:temp1.shape[1]+temp2.shape[1]] = temp2
  61. featV6[1:,temp1.shape[1]+temp2.shape[1]:temp1.shape[1]+temp2.shape[1]+temp3.shape[1]] = temp3
  62. featV6[1:,temp1.shape[1]+temp2.shape[1]+temp3.shape[1]:] = temp4
  63. ##combining the features
  64. feat = np.zeros([featV1.shape[0],sum([featV1.shape[1],featV2.shape[1],featV3.shape[1],featV4.shape[1],featV5.shape[1],featV6.shape[1]])])
  65. feat[:,:featV1.shape[1]] = featV1
  66. feat[:,featV1.shape[1]:featV1.shape[1]+featV2.shape[1]] = featV2
  67. feat[:,featV1.shape[1]+featV2.shape[1]:featV1.shape[1]+featV2.shape[1]+featV3.shape[1]] = featV3
  68. feat[:,featV1.shape[1]+featV2.shape[1]+featV3.shape[1]:featV1.shape[1]+featV2.shape[1]+featV3.shape[1]+featV4.shape[1]] = featV4
  69. feat[:,featV1.shape[1]+featV2.shape[1]+featV3.shape[1]+featV4.shape[1]:featV1.shape[1]+featV2.shape[1]+featV3.shape[1]+featV4.shape[1]+featV5.shape[1]] = featV5
  70. feat[:,featV1.shape[1]+featV2.shape[1]+featV3.shape[1]+featV4.shape[1]+featV5.shape[1]:] = featV6
  71. ##normalizing the feature
  72. numFeat = feat.shape[1]
  73. meanFeat = feat.mean(axis=1)
  74. meanFeat.shape = [meanFeat.shape[0],1]
  75. stdFeat = feat.std(axis=1)
  76. stdFeat.shape = [stdFeat.shape[0],1]
  77. normFeat = (feat - meanFeat.repeat(numFeat,axis=1))/stdFeat.repeat(numFeat,axis=1)
  78. #print(normFeat)
  79. api.wmcloud.com 443

数据标注

选择时间间隔为通联能获取的最小时间间隔(3s),

  • 若下一个单位时刻mid price大于此时的mid price,则标注为向上,
  • 若下一个单位时刻mid price小于此时的mid price,则标注为向下,
  • 若下一个单位时刻mid price等于此时的mid price,则标注为不变,
  1. ##mid-price trend of dataset:upward(0),downward(1) or stationary(2)
  2. upY = featV2[1:,5] > featV2[:-1,5]
  3. upY = np.append(upY,0)
  4. numUp = sum(upY)
  5. downY = featV2[1:,5] < featV2[:-1,5]
  6. downY = np.append(downY,0)
  7. numDown = sum(downY)
  8. statY = featV2[1:,5] == featV2[:-1,5]
  9. statY = np.append(statY,0)
  10. numStat = sum(statY)
  11. #Y = np.zeros([upY.shape[0],3])
  12. #Y[:,0] = upY
  13. #Y[:,1] = downY
  14. #Y[:,2] = statY
  15. pUp = np.where(upY==1)[0]
  16. pDown = np.where(downY==1)[0]
  17. pStat = np.where(statY==1)[0]
  18. multiY = np.zeros([upY.shape[0],1])
  19. multiY[pUp] = 0
  20. multiY[pDown] = 1
  21. multiY[pStat] = 2
  22. ##divide the dataset into trainSet, and testSst
  23. numTrain = 1200
  24. numTest = 500
  25. #rebalance the radio of upward, downward and stationary data
  26. numTrainUp = 250
  27. numTrainDown = 250
  28. numTrainStat = 400
  29. pUpTrain = pUp[:numTrainUp]
  30. pDownTrain = pDown[:numTrainDown]
  31. pStatTrain = pStat[:numTrainStat]
  32. pTrainTemp = np.append(pUpTrain,pDownTrain)
  33. pTrain = np.append(pTrainTemp,pStatTrain)
  34. trainSet = normFeat[pTrain,:]
  35. #trainSet = normFeat[1:numTrain+1,:]
  36. testSet = normFeat[numTrain+1:numTrain+numTest+1,:]
  37. #trainY = Y[1:numTrain+1,:]
  38. trainMultiYTemp = np.append(multiY[pUpTrain],multiY[pDownTrain])
  39. trainMultiY = np.append(trainMultiYTemp,multiY[pStatTrain])
  40. #trainMultiY = multiY[1:numTrain+1]
  41. testMultiY = multiY[numTrain+1:numTrain+numTest+1]

分类模型

基于one vs all的multi-class SVM,这里我没有对参数做过多调整,因此看到的模型事实上非常简陋。有兴趣的话也可以用forest tree等ML方法尝试。

  1. ##training a multi-class svm model
  2. Model = svm.LinearSVC(C=2.)
  3. Model.fit(trainSet,trainMultiY)
  4. pred = Model.predict(testSet)
  5. ap = Model.score(testSet,testMultiY)
  6. print(ap)
  7. 0.522

结果

我这里拿了11月30日的万科A作为数据来源来预测。之所以拿万科A,是因为我从11月上旬就开始看好这只股票,结果在中旬的时候没有拿住,低位没有补进,谁知道月底就起飞了,让我又爱又恨。我在最后画出了预测结果,蓝线是测试集中的mid price时间序列,红点表示模型预测下一时刻方向向上,绿点表示模型预测下一时刻方向向下,没有画点表示预测方向不变。

  1. testMidPrice = featV2[numTrain+1:numTrain+numTest+1,5]
  2. pUpTest = np.where(pred==0)[0]
  3. pDownTest = np.where(pred==1)[0]
  4. pStatTest = np.where(pred==2)[0]
  5. plt.figure(figsize=(16,5))
  6. plt.plot(range(numTest),testMidPrice,'b-',pUpTest,testMidPrice[pUpTest],'r.',pDownTest,testMidPrice[pDownTest],'g.')
  7. plt.grid()
  8. plt.xlabel('time')
  9. plt.ylabel('midPrice')
  10. <matplotlib.text.Text at 0x6f8d2d0>

12.1 order book 分析 · 基于高频 limit order book 数据的短程价格方向预测—— via multi-class SVM - 图2

题外话

现在你看到的是一个极为粗糙的东西,原论文的框架远远比这个复杂,包括对训练集的交叉验证,以及数据的更新替代,bid-ask spread crossing,以及基于此的toy策略(当然这么高频的操作在平台上暂时也实现不了:))等等等等都没有实现。这里我只是选取了前1200个数据作了normalization和rebalance后来预测后500个数据。我现在研二忙成狗,也只能晚上写一写,还得赶着发完论文以后赶紧找实习,所以以后有机会也许再放一个更精细的版本上来。最后感谢通联的朋友特地给我开了历史高频的接口~