Skip to content Skip to sidebar Skip to footer

Pandas Describe By - Additional Parameters

I see that the pandas library has a Describe by function which returns some useful statistics. However, is there a way to add additional rows to the output such as standard deviat

Solution 1:

the default describe looks like this:

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 5), columns=list('ABCDE'))

df.describe()

                A           B           C           D           E
count  100.000000100.000000100.000000100.000000100.000000
mean     0.4958710.4729390.4555700.5038990.451341
std      0.3035890.2919680.2949840.2699360.284666min0.0064530.0015590.0010680.0153110.00952625%      0.2393790.2191410.1962510.2943710.20295650%      0.5295960.4565480.3765580.5320020.43293675%      0.7594520.7396660.6655630.7307020.686793max0.9997990.9945100.9972710.9815510.979221

Updated for pandas > 0.21.0 I'd make my own describe like below. It should be obvious how to add more.

def describe(df, stats):
    d = df.describe()
    return d.append(df.reindex(d.columns, axis = 1).agg(stats))

describe(df, ['skew', 'mad', 'kurt'])

                A           B           C           D           E
count  100.000000100.000000100.000000100.000000100.000000
mean     0.4958710.4729390.4555700.5038990.451341
std      0.3035890.2919680.2949840.2699360.284666
min      0.0064530.0015590.0010680.0153110.00952625%0.2393790.2191410.1962510.2943710.20295650%0.5295960.4565480.3765580.5320020.43293675%0.7594520.7396660.6655630.7307020.686793
max      0.9997990.9945100.9972710.9815510.979221
skew    -0.0149420.0480540.247244   -0.1251510.066156
mad      0.2677300.2499680.2543510.2285580.242874
kurt    -1.323469   -1.223123   -1.095713   -1.083420   -1.148642

Updated for pandas 0.20 I'd make my own describe like below. It should be obvious how to add more.

def describe(df, stats):
    d = df.describe()
    return d.append(df.reindex_axis(d.columns, 1).agg(stats))

describe(df, ['skew', 'mad', 'kurt'])

                A           B           C           D           E
count  100.000000100.000000100.000000100.000000100.000000
mean     0.4958710.4729390.4555700.5038990.451341
std      0.3035890.2919680.2949840.2699360.284666
min      0.0064530.0015590.0010680.0153110.00952625%0.2393790.2191410.1962510.2943710.20295650%0.5295960.4565480.3765580.5320020.43293675%0.7594520.7396660.6655630.7307020.686793
max      0.9997990.9945100.9972710.9815510.979221
skew    -0.0149420.0480540.247244   -0.1251510.066156
mad      0.2677300.2499680.2543510.2285580.242874
kurt    -1.323469   -1.223123   -1.095713   -1.083420   -1.148642

Old Answer

def describe(df):
    return pd.concat([df.describe().T,
                      df.mad().rename('mad'),
                      df.skew().rename('skew'),
                      df.kurt().rename('kurt'),
                     ], axis=1).T

describe(df)

                A           B           C           D           E
count  100.000000100.000000100.000000100.000000100.000000
mean     0.4958710.4729390.4555700.5038990.451341
std      0.3035890.2919680.2949840.2699360.284666
min      0.0064530.0015590.0010680.0153110.00952625%0.2393790.2191410.1962510.2943710.20295650%0.5295960.4565480.3765580.5320020.43293675%0.7594520.7396660.6655630.7307020.686793
max      0.9997990.9945100.9972710.9815510.979221
mad      0.2677300.2499680.2543510.2285580.242874
skew    -0.0149420.0480540.247244   -0.1251510.066156
kurt    -1.323469   -1.223123   -1.095713   -1.083420   -1.148642

Solution 2:

The answer from piRSquared makes the most sense to me, but I get a deprecation warning about reindex_axis in Python 3.5. This works for me:

    stats = data.describe()
    stats.loc['IQR'] = stats.loc['75%'] - stats.loc['25%'] # appending interquartile range instead of recalculating it
    stats = stats.append(data.reindex(stats.columns, axis=1).agg(['skew', 'mad', 'kurt']))

Solution 3:

Try this:

 df.describe()

      num1  num2
count   3.0   3.0
mean    2.0   5.0
std     1.0   1.0
min     1.0   4.0
25%     1.5   4.550%     2.0   5.075%     2.5   5.5
max     3.0   6.0

Build a second DataFrame.

 pd.DataFrame(df.mad() , columns = ["Mad"] ).T

         num1      num2
Mad  0.666667  0.666667

Join the two DataFrames.

 pd.concat([df.describe(),pd.DataFrame(df.mad() , columns = ["Mad"] ).T ])

          num1      num2
count  3.0000003.000000
mean   2.0000005.000000
std    1.0000001.000000min1.0000004.00000025%    1.5000004.50000050%    2.0000005.00000075%    2.5000005.500000max3.0000006.000000
Mad    0.6666670.666667

Post a Comment for "Pandas Describe By - Additional Parameters"