之于当下而言,可能有两件事:一是AI量化可以产出交付实盘的有效策略;二是如何有效切入AIGC。
今天先来看看因子分析。
代码在这个位置:alphalens相对复杂,其实我们一般也仅看ic和ric两个数值,我们就自己实现了。
from typing import Tupleimport pandas as pddef calc_ic(pred: pd.Series, label: pd.Series, date_col="date", dropna=False) -> Tuple[pd.Series, pd.Series]: df = pd.DataFrame({"pred": pred, "label": label}) ic = df.groupby(date_col).apply(lambda df: df["pred"].corr(df["label"])) ric = df.groupby(date_col).apply(lambda df: df["pred"].corr(df["label"], method="spearman")) if dropna: return ic.dropna(), ric.dropna() else: return ic, ricif __name__ == '__main__': from engine.datafeed.dataloader import CSVDataloader from engine.config import etfs_indexes from engine.config import DATA_INDEX loader = CSVDataloader(DATA_INDEX.resolve(), etfs_indexes.values(), start_date="20120101") df = loader.load(fields=['ta("MACD",close)','shift(close,-5)/close-1'], names=['factor','label']) print(df) ic, ir = calc_ic(pred=df['label'], label=df['factor'],dropna=True) print(ic.mean(), ir.mean()) ic.plot() import matplotlib.pyplot as plt plt.show()
5天的RSI值为未来5天收益的因子分析:ic/ric均值为0.04,是有相关性的。
目前看,即便0.04,0.05的因子,对于未来5天收益率的分类准确性,样本外仍然是10%左右。如果是仅预测未来是上涨还是下跌的2分类,样本外可以达到60%多。
因此,我们仍然把希望寄托在排序上。
投资无外乎选股、择时。而择时本身也是一种广义的选股。
而选择本身就是一种排序,择其优者而选之。
大家最熟悉的分类算法,在AI量化也算一种排序。就是通过分类算法,预测未来股票上涨概率。然后按概率值排序——这里对应的就是pointwise。这里有一个明显的问题在于,我们对每一支股票的截面数据进行预测,但没有关心股票与股票之间的相对关系,而我们知道,预测收益概率是非常难的事情,而判断它们之间的相对优劣则更为容易。
重构GBMRanker
import numpy as np from sklearn.model_selection import train_test_split import lightgbm as lgb from engine.datafeed.dataset import DataSet import joblib, os from engine.models.model_base import ModelBase class LGBRanker(ModelBase): def __init__(self, name, load_model, feature_cols=None): super(LGBRanker, self).__init__(name, load_model) self.feature_cols = feature_cols self.label_col = 'label' def _prepare_groups(self, df): df['day'] = df.index group = df.groupby('day')['day'].count() print(group.values) return group.values def predict(self, data): data = data.copy(deep=True) if self.feature_cols: data = data[self.feature_cols] pred = self.ranker.predict(data) return pred def train(self, df, split_date): if split_date: df_train = df[df.index < split_date] df_val = df[df.index >= split_date] query_train = self._prepare_groups(df_train.copy(deep=True)) query_val = self._prepare_groups(df_val.copy(deep=True)) ranker = lgb.LGBMRanker() ranker.fit(df_train[self.feature_cols], df_train[self.label_col], group=query_train, eval_set=[(df_val[self.feature_cols], df_val[self.label_col])], eval_group=[query_val], eval_at=[1, 2, 5, 10, 20], early_stopping_rounds=50) self.ranker = ranker print(ranker.n_features_) print(ranker.feature_importances_) print(ranker.feature_name_) score, names = zip(*sorted(zip(ranker.feature_importances_, ranker.feature_name_), reverse=True)) print(score) print(names) if __name__ == '__main__': from engine.config import etfs_indexes from engine.datafeed.dataloader import CSVDataloader from engine.config import DATA_INDEX from engine.datafeed.alpha import AlphaLit loader = CSVDataloader(path=DATA_INDEX.resolve(), symbols=etfs_indexes.values()) ds = DataSet("行业指数数据集", loader=loader, handler=AlphaLit(), cache=False) print(ds.data) df = ds.data df.dropna(inplace=True) print(df) LGBRanker('', feature_cols=ds.get_feature_names(), load_model=False).train(df, '2022-01-01') # lightgbm.plot_importance(ranker, figsize=(12, 8))
实证告诉我们一个结论:
因子不是越多越好,哪怕GBDT可以筛选因子,但因子之间是会冲突,不恰当的加入多余的因子,会降低训练的性能!这里就需要验证因子之间的相关性,正效性。——所以,不要试图一股恼把一堆因子加到模型里,让模型自己去筛选。——如果真这样,那就太简单了。
模型回测——滚动回测
from engine.config import etfs_indexes from engine.datafeed.dataset import DataSet from engine.env import Env from engine.config import etfs_indexes from engine.datafeed.dataloader import CSVDataloader from engine.config import DATA_INDEX from engine.datafeed.alpha import AlphaLit loader = CSVDataloader(path=DATA_INDEX.resolve(), symbols=etfs_indexes.values()) ds = DataSet("行业指数数据集", loader=loader, handler=AlphaLit(), cache=False) print(ds.data) from engine.algo.algos import * from engine.algo.algo_weights import * from engine.algo.algo_model import ModelWFA from engine.models.lgb_ranker import LGBRanker from engine.env import Env model = LGBRanker(name='滚动回测——排序',load_model=False, feature_cols=ds.get_feature_names()) # model.train(ds.data, '2022-01-01') env = Env(ds.data) env.set_algos([ RunWeekly(), ModelWFA(model=model), SelectTopK(K=2, order_by='pred_score', b_ascending=False), WeightEqually() ]) env.backtest_loop() env.show_results()
之于这第二件事,有效切入AIGC。当下很多自媒体,甚至很多技术人员,都是看热闹阶段,或者在周边用chatGPT的api或者作图的api,所谓的prompt,生成一些文章或者一些图片。真正有效参与到这个生产力提升的环节,普通人或者创业公司,要从头训练大模型即不现实也不需要。把大模型看成传统“预训练”模型就好,传统我们也不会去从零开始训练bert或者GPT,而是在预训练的基础上进行微调,对于通用大模型,逻辑类似。
而且我们切入的方向一定是垂直方向,专有知识解决领域问题。我想到最好落地的仍然是金融,与AI量化类似,FinGPT应该也会很有意思。之有做FinRL的团队开源了FinGPT,至少这个构想是特别好的。后续会持续关注下。
今天要抽空实现“前向滚动”训练回测。
从20日动量来看,我们选择29个行业作为候选集合是对的。
因为市场是按行业分化的,尤其是市场情绪震荡有分歧之时:
29个行业等权回测:
年化收益 0.075,最大回测 -0.318,夏普比 0.441。
等权加周再平衡:
年化收益 0.146,最大回测 -0.283,夏普比 0.748。再平衡适合相关性低的组合,正当的行业比较明显。
使用lightGBM的WFA滚动训练:
年化收益 0.219,最大回测 -0.259,夏普比 0.861。
etfs = [ '159870.SZ', '512400.SH', '515220.SH', '515210.SH', '516950.SH', '562800.SH', '515170.SH', '512690.SH', '159996.SZ', '159865.SZ', '159766.SZ', '515950.SH', '159992.SZ', '159839.SZ', '512170.SH', '159883.SZ', '512980.SH', '159869.SZ', '515050.SH', '515000.SH', '515880.SH', '512480.SH', '515230.SH', '512670.SH', '515790.SH', '159757.SZ', '516110.SH', '512800.SH', '512200.SH', ] import numpy as np import pandas as pd from tqdm import tqdm from engine.env import Env from engine.datafeed.dataset import DataSet, AlphaLit, Alpha ds = DataSet(etfs, start_date='20190101', handler=Alpha(), cache=True) # ds = DataSet(['000300.SH', '399006.SZ'], start_date='20120101', handler=Alpha()) env = Env(ds.data) from engine.context import ExecContext from engine.algo.algos import * from engine.algo.algo_model import ModelWFA from engine.models.gbdt_l2r import LGBModel model = LGBModel(load_model=False, feature_cols=ds.get_feature_names()) env.set_algos([ RunWeekly(), ModelWFA(model=model), SelectTopK(K=2, order_by='pred_score', b_ascending=False), WeightEqually() ]) env.backtest_loop() env.show_results()
下面是因子集的定义代码:
class Alpha(AlphaBase): def __init__(self): pass @staticmethod def parse_config_to_fields(): # ['CORD30', 'STD30', 'CORR5', 'RESI10', 'CORD60', 'STD5', 'LOW0', # 'WVMA30', 'RESI5', 'ROC5', 'KSFT', 'STD20', 'RSV5', 'STD60', 'KLEN'] fields = [] names = [] # kbar fields += [ "(close-open)/open", "(high-low)/open", "(close-open)/(high-low+1e-12)", "(high-greater(open, close))/open", "(high-greater(open, close))/(high-low+1e-12)", "(less(open, close)-low)/open", "(less(open, close)-low)/(high-low+1e-12)", "(2*close-high-low)/open", "(2*close-high-low)/(high-low+1e-12)", ] names += [ "KMID", "KLEN", "KMID2", "KUP", "KUP2", "KLOW", "KLOW2", "KSFT", "KSFT2", ] # =========== price ========== feature = ["OPEN", "HIGH", "LOW", "CLOSE"] windows = range(5) for field in feature: field = field.lower() fields += ["shift(%s, %d)/close" % (field, d) if d != 0 else "%s/close" % field for d in windows] names += [field.upper() + str(d) for d in windows] # ================ volume =========== fields += ["shift(volume, %d)/(volume+1e-12)" % d if d != 0 else "volume/(volume+1e-12)" for d in windows] names += ["VOLUME" + str(d) for d in windows] # ================= rolling ==================== windows = [5, 10, 20, 30, 60] fields += ["shift(close, %d)/close" % d for d in windows] names += ["ROC%d" % d for d in windows] fields += ["mean(close, %d)/close" % d for d in windows] names += ["MA%d" % d for d in windows] fields += ["std(close, %d)/close" % d for d in windows] names += ["STD%d" % d for d in windows] fields += ["slope(close, %d)/close" % d for d in windows] names += ["BETA%d" % d for d in windows] fields += ["max(high, %d)/close" % d for d in windows] names += ["MAX%d" % d for d in windows] fields += ["min(low, %d)/close" % d for d in windows] names += ["MIN%d" % d for d in windows] fields += ["corr(close/shift(close,1), log(volume/shift(volume, 1)+1), %d)" % d for d in windows] names += ["CORD%d" % d for d in windows] fields += ['close/shift(close,20)-1'] names += ['roc_20'] return fields, names
发布者:股市刺客,转载请注明出处:https://www.95sca.cn/archives/104090
站内所有文章皆来自网络转载或读者投稿,请勿用于商业用途。如有侵权、不妥之处,请联系站长并出示版权证明以便删除。敬请谅解!