云子量化

python高效使用16---sort_values排序需要万分警惕的问题

作者：yunjinqi 类别：编程日期：2025-02-15 11:29:24 阅读：335 次消耗积分：0 分

sort_values函数需要万分警惕的问题

背景
今天在优化empyrical模块的时候，发现在win11上测试通过的测试用例，在ubuntu18.04上测试失败了，通过定位发现是sortvalues惹得祸。在使用pandas.sortvalues(by="value1")的时候，value1如果有相同值，在默认排序算法下，排序后的结果在windows上和ubuntu上结果可能不一样。

例子

gitee地址：https://gitee.com/yunjinqi/empyrical/blob/master/tests/test_one_function.py

import numpy as np
import pandas as pd
from numpy.ma.testutils import assert_almost_equal

def beta_fragility_heuristic_aligned(returns, factor_returns):
    """Estimate fragility to drop in betaParameters
     ----------
     returns : pd.Series or np.ndarray
    Daily returns of the strategy, noncumulative.
    - See full explanation in: func:`~empyrical.stats.cum_returns`.
     factor_returns : pd.Series or np.ndarray
     Daily noncumulative returns of the factor to which beta is
     computed. Usually a benchmark such as the market.
     - This is in the same style as returns.

     Returns
    -------
        float, np.nan
            The beta fragility of the strategy.

        Note
        ----
        If they are pd.Series, expects returns and factor_returns have already
        been aligned on their labels.  If np.ndarray, these arguments should have
        the same shape.
        See also::
        `A New Heuristic Measure of Fragility andTail Risks: Application to Stress Testing`
        https://www.imf.org/external/pubs/ft/wp/2012/wp12216.pdf
        An IMF Working Paper describing the heuristic
    """
    if len(returns) < 3 or len(factor_returns) < 3:
        return np.nan# combine returns and factor returns into pairs
        returns_series = pd.Series(returns)
        factor_returns_series = pd.Series(factor_returns)
        pairs = pd.concat([returns_series, factor_returns_series], axis=1)
        pairs.columns = ['returns', 'factor_returns']

        # exclude any rows where returns are nan
        pairs = pairs.dropna()
        # sort by beta
        pairs = pairs.sort_values(by=['factor_returns'], kind='stable')
        print(pairs)
        # find the three vectors, using median of 3
        start_index = 0
        mid_index = int(np.around(len(pairs) / 2, 0))
        end_index = len(pairs) - 1

        (start_returns, start_factor_returns) = pairs.iloc[start_index]
        (mid_returns, mid_factor_returns) = pairs.iloc[mid_index]
        (end_returns, end_factor_returns) = pairs.iloc[end_index]

        factor_returns_range = (end_factor_returns - start_factor_returns)
        start_returns_weight = 0.5
        end_returns_weight = 0.5

        # find weights for the start and end returns
        # using a convex combination
        if not factor_returns_range == 0:
            start_returns_weight = \
                (mid_factor_returns - start_factor_returns) / \
                factor_returns_range
            end_returns_weight = \
                (end_factor_returns - mid_factor_returns) / \
                factor_returns_range

        # calculate fragility heuristic
        heuristic = (start_returns_weight * start_returns) + \
            (end_returns_weight * end_returns) - mid_returns

        return heuristic
if __name__ == '__main__':
    mixed_returns = pd.Series(
        np.array([np.nan, 1., 10., -4., 2., 3., 2., 1., -10.]) / 100,
        index=pd.date_range('2000-1-30', periods=9, freq='D'))
    simple_benchmark = pd.Series(
        np.array([0., 1., 0., 1., 0., 1., 0., 1., 0.]) / 100,
        index=pd.date_range('2000-1-30', periods=9, freq='D'))
    actual_value = beta_fragility_heuristic_aligned(mixed_returns, simple_benchmark)
    expected_value = 0.09
    assert_almost_equal(actual_value, expected_value)

建议
官网上关于sortvalues函数的说明: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sortvalues.html#pandas.DataFrame.sort_values 在使用的时候，建议增加一个参数kind='mergesort'或者'stable', 以便排序后的结果比较稳定, 在各个平台能够实现一致。

错误反馈：

问题咨询：

系统当前共有 469 篇文章

量化的不仅是股票、期货、期权、债券等投资交易工具，更是量化自我，量化是人生漫长的修行

专注量化

sort_values函数需要万分警惕的问题