340万名表，开上亿豪车，住6亿美金毫宅：科技才是人类的星辰大海；quantlab3.0整合gplearn因子挖掘。

开始之前，晒一下AI大佬镇镇场子。

我想最不被仇富，最有成就感，calling的目标，应该是突破前沿科技，给全人类带来福祉，然后通过资本市场（一级市场股权，期权）获得超级财富。

比如如下这位，openAI的CEO山姆.奥特曼。

之前我们写过gplearn，但是内置了一个向量化的回测引擎。

gplearn在期货和多支股票上因子挖掘实战的代码（代码+数据下载）

年化167%，夏普比大于7：基于gplearn的股指期货的高频因子挖掘

我们需要把gplearn整合到quantlab3.0中去。

多因子挖掘是量化的未来，因子挖掘可以贯穿quant2.0-3.0的全过程。

之前我们梳理过：Quantlab3.0进展，结合Quant4.0的思考：全自动，可解释AI量化是未来

Quant 2.0:将量化的研究模式从小型的天才工坊因子流水线。Quant2.0主要是挖因子，在股票和期货市场的私募里已经比较流行。

gplearn等自动挖掘因子的工具基本是标配：gplearn在期货和多支股票上因子挖掘实战的代码（代码+数据下载）。

Quant3.0：Quant 3.0更注重深度学习建模。在使用相对简单的因子下，深度学习仍然有潜力通过其强大的端到端学习能力和灵活的模型拟合能力。深度学习需要大量的数据。

咱们在DeepAlphaGen里有尝试过端到端的因子合成：端到端因子挖掘框架：DeepAlphaGen V1.0代码发布，支持最新版本qlib。

gplearn原本是用于符号拟合：

比如下面的代码：y= cos(x1)-sin(x2)

# 此文档将简要说明gplearn的使用方法
import numpy as np
import pandas as pd
from gplearn import fitness
from gplearn.genetic import SymbolicRegressor
from datetime import datetime


def score_func_basic(y, y_pred, sample_weight, **args):  # 适应度函数：策略评价指标
    return sum((pd.Series(y_pred) - y) ** 2)  # 这里是最小化残差平方和


m = fitness.make_fitness(function=score_func_basic,
                         # function(y, y_pred, sample_weight) that returns a floating point number. 
                         greater_is_better=False,  # 上述y是输入的目标y向量，y_pred是genetic program中的预测值，sample_weight是样本权重向量
                         wrap=False)  # gplearn.fitness.make_fitness(function, greater_is_better, wrap=True)

cmodel_gp = SymbolicRegressor(population_size=500,  # 每一代公式群体中的公式数量 500
                              generations=10,  # 公式进化的世代数量 10
                              metric=m,  # 适应度指标，这里是前述定义的通过 大于0做多，小于0做空的 累积净值/最大回撤 的评判函数
                              tournament_size=50,  # 在每一代公式中选中tournament的规模，对适应度最高的公式进行变异或繁殖 50
                              function_set=('add', 'sub', 'mul', 'abs', 'neg', 'sin', 'cos', 'tan'),  # 用于构建和进化公式使用的函数集
                              const_range=(-1.0, 1.0),  # 公式中包含的常数范围
                              parsimony_coefficient='auto',
                              # 对较大树的惩罚,默认0.001，auto则用c = Cov(l,f)/Var( l), where Cov(l,f) is the covariance between program size l and program fitness f in the population, and Var(l) is the variance of program sizes.
                              # stopping_criteria=100.0, # 是对metric的限制（此处为收益/回撤）
                              init_depth=(2, 4),  # 公式树的初始化深度，树深度最小2层，最大6层
                              init_method='half and half',  # 树的形状，grow生分枝整的不对称，full长出浓密
                              p_crossover=0.2,  # 交叉变异概率 0.8
                              p_subtree_mutation=0.2,  # 子树变异概率
                              p_hoist_mutation=0.2,  # hoist变异概率 0.15
                              p_point_mutation=0.2,  # 点变异概率
                              p_point_replace=0.2,  # 点变异中每个节点进行变异进化的概率
                              max_samples=1.0,  # The fraction of samples to draw from X to evaluate each program on.
                              feature_names=None, warm_start=False, low_memory=False,
                              n_jobs=1,
                              verbose=1,
                              random_state=0
                              )

if __name__ == '__main__':
    start = datetime.now()
    LenD = 1000
    X1 = pd.DataFrame(data={'a': range(LenD), 'b': np.random.randint(-10, 10, LenD)})
    Y1 = X1.sum(axis=1)  # .values
    print("初始策略是Y1=X1.sum(axis=1)")
    cmodel_gp.fit(X1, Y1)
    print(cmodel_gp)
    print("------------------------------------------------------------------------------------")
    print(" ")
    print(" ")
    print(" ")
    print(" ")
    print(" ")
    print(" ")
    print("------------------------------------------------------------------------------------")

    LenD = 1000
    X2 = pd.DataFrame(data={'a': range(LenD), 'b': np.random.randint(0, 10, LenD)})
    Y2 = np.cos(X2['a']) - np.sin(X2['b'])
    cmodel_gp.fit(X2, Y2)
    print(cmodel_gp)
    print("------------------------------------------------------------------------------------")
    end = datetime.now()
    elapsed = end - start
    print("Time elapsed:", elapsed)

经过几轮迭代，可以生成表达式。

这个机制天然适合我们挖掘因子，只需要换上合适的fitness即可。

针对期货的分钟线，我们来挖掘因子，并计算因子值：

import pandas as pd
import numpy as np
from gplearn import fitness
from gplearn.genetic import SymbolicRegressor

train_data = pd.read_csv('../backtesting/test/IC_train.csv', index_col=0, parse_dates=[0])
feature_names = list(train_data.columns)
train_data.loc[:,'y'] = np.log(train_data['Open'].shift(-4)/train_data['Open'].shift(-1)) # 对数收益率
train_data.dropna(inplace = True)
print(train_data)

def my_gplearn(function_set, score_func_basic, pop_num=100, gen_num=3, tour_num=10, random_state = 42, feature_names=None):
    # pop_num, gen_num, tour_num的几个可选值：500, 5, 50; 1000, 3, 20; 1000, 15, 100
    metric = fitness.make_fitness(function=score_func_basic, # function(y, y_pred, sample_weight) that returns a floating point number.
                        greater_is_better=True,  # 上述y是输入的目标y向量，y_pred是genetic program中的预测值，sample_weight是样本权重向量
                        wrap=False)  # 不保存，运行的更快 # gplearn.fitness.make_fitness(function, greater_is_better, wrap=True)
    return SymbolicRegressor(population_size=pop_num,  # 每一代公式群体中的公式数量 500，100
                              generations=gen_num,  # 公式进化的世代数量 10，3
                              metric=metric,  # 适应度指标，这里是前述定义的通过 大于0做多，小于0做空的 累积净值/最大回撤 的评判函数
                              tournament_size=tour_num,  # 在每一代公式中选中tournament的规模，对适应度最高的公式进行变异或繁殖 50
                              function_set=function_set,
                              const_range=(-1.0, 1.0),  # 公式中包含的常数范围
                              parsimony_coefficient='auto',
                              # 对较大树的惩罚,默认0.001，auto则用c = Cov(l,f)/Var( l), where Cov(l,f) is the covariance between program size l and program fitness f in the population, and Var(l) is the variance of program sizes.
                              stopping_criteria=100.0,  # 是对metric的限制（此处为收益/回撤）
                              init_depth=(2, 3),  # 公式树的初始化深度，树深度最小2层，最大6层
                              init_method='half and half',  # 树的形状，grow生分枝整的不对称，full长出浓密
                              p_crossover=0.8,  # 交叉变异概率 0.8
                              p_subtree_mutation=0.05,  # 子树变异概率
                              p_hoist_mutation=0.05,  # hoist变异概率 0.15
                              p_point_mutation=0.05,  # 点变异概率
                              p_point_replace=0.05,  # 点变异中每个节点进行变异进化的概率
                              max_samples=1.0,  # The fraction of samples to draw from X to evaluate each program on.
                              feature_names=feature_names, warm_start=False, low_memory=False,
                              n_jobs=1,
                              verbose=1,
                              random_state=random_state)

# 生成因子
  # 函数集
function_set=['add', 'sub', 'mul', 'div', 'sqrt', 'log',  # 用于构建和进化公式使用的函数集
                    'abs', 'neg', 'inv', 'sin', 'cos', 'tan', 'max', 'min',
                    # 'if', 'gtpn', 'andpn', 'orpn', 'ltpn', 'gtp', 'andp', 'orp', 'ltp', 'gtn', 'andn', 'orn', 'ltn', 'delayy', 'delta', 'signedpower', 'decayl', 'stdd', 'rankk'
                    ] # 最后一行是自己的函数，目前不用自己函数效果更好

def score_func_basic(y, y_pred, sample_weight):  # 因子评价指标
    try:
        _ = bt.run_(factor=y_pred)
        factor_ret = _['annualized_mean']/_['max_drawdown'] if _['max_drawdown'] != 0 else 0 # 可以把max_drawdown换成annualized_std
    except:
        factor_ret = 0
    return factor_ret

factor_num = 1  # 因子编号
my_cmodel_gp = my_gplearn(function_set, score_func_basic, random_state=0,
                          feature_names=feature_names)  # 可以通过换random_state来生成不同因子
my_cmodel_gp.fit(train_data.loc[:, :'rank_num'].values, train_data.loc[:, 'y'].values)
print(my_cmodel_gp)

test_data = pd.read_csv('../backtesting/test/IC_test.csv', index_col=0, parse_dates=[0])
factor = my_cmodel_gp.predict(test_data.values)
print(factor)