Pandas Describe By - Additional Parameters
I see that the pandas library has a Describe by function which returns some useful statistics. However, is there a way to add additional rows to the output such as standard deviat
Solution 1:
the default describe
looks like this:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 5), columns=list('ABCDE'))
df.describe()
A B C D E
count 100.000000100.000000100.000000100.000000100.000000
mean 0.4958710.4729390.4555700.5038990.451341
std 0.3035890.2919680.2949840.2699360.284666min0.0064530.0015590.0010680.0153110.00952625% 0.2393790.2191410.1962510.2943710.20295650% 0.5295960.4565480.3765580.5320020.43293675% 0.7594520.7396660.6655630.7307020.686793max0.9997990.9945100.9972710.9815510.979221
Updated for pandas > 0.21.0
I'd make my own describe
like below. It should be obvious how to add more.
def describe(df, stats):
d = df.describe()
return d.append(df.reindex(d.columns, axis = 1).agg(stats))
describe(df, ['skew', 'mad', 'kurt'])
A B C D E
count 100.000000100.000000100.000000100.000000100.000000
mean 0.4958710.4729390.4555700.5038990.451341
std 0.3035890.2919680.2949840.2699360.284666
min 0.0064530.0015590.0010680.0153110.00952625%0.2393790.2191410.1962510.2943710.20295650%0.5295960.4565480.3765580.5320020.43293675%0.7594520.7396660.6655630.7307020.686793
max 0.9997990.9945100.9972710.9815510.979221
skew -0.0149420.0480540.247244 -0.1251510.066156
mad 0.2677300.2499680.2543510.2285580.242874
kurt -1.323469 -1.223123 -1.095713 -1.083420 -1.148642
Updated for pandas 0.20
I'd make my own describe
like below. It should be obvious how to add more.
def describe(df, stats):
d = df.describe()
return d.append(df.reindex_axis(d.columns, 1).agg(stats))
describe(df, ['skew', 'mad', 'kurt'])
A B C D E
count 100.000000100.000000100.000000100.000000100.000000
mean 0.4958710.4729390.4555700.5038990.451341
std 0.3035890.2919680.2949840.2699360.284666
min 0.0064530.0015590.0010680.0153110.00952625%0.2393790.2191410.1962510.2943710.20295650%0.5295960.4565480.3765580.5320020.43293675%0.7594520.7396660.6655630.7307020.686793
max 0.9997990.9945100.9972710.9815510.979221
skew -0.0149420.0480540.247244 -0.1251510.066156
mad 0.2677300.2499680.2543510.2285580.242874
kurt -1.323469 -1.223123 -1.095713 -1.083420 -1.148642
Old Answer
def describe(df):
return pd.concat([df.describe().T,
df.mad().rename('mad'),
df.skew().rename('skew'),
df.kurt().rename('kurt'),
], axis=1).T
describe(df)
A B C D E
count 100.000000100.000000100.000000100.000000100.000000
mean 0.4958710.4729390.4555700.5038990.451341
std 0.3035890.2919680.2949840.2699360.284666
min 0.0064530.0015590.0010680.0153110.00952625%0.2393790.2191410.1962510.2943710.20295650%0.5295960.4565480.3765580.5320020.43293675%0.7594520.7396660.6655630.7307020.686793
max 0.9997990.9945100.9972710.9815510.979221
mad 0.2677300.2499680.2543510.2285580.242874
skew -0.0149420.0480540.247244 -0.1251510.066156
kurt -1.323469 -1.223123 -1.095713 -1.083420 -1.148642
Solution 2:
The answer from piRSquared makes the most sense to me, but I get a deprecation warning about reindex_axis in Python 3.5. This works for me:
stats = data.describe()
stats.loc['IQR'] = stats.loc['75%'] - stats.loc['25%'] # appending interquartile range instead of recalculating it
stats = stats.append(data.reindex(stats.columns, axis=1).agg(['skew', 'mad', 'kurt']))
Solution 3:
Try this:
df.describe()
num1 num2
count 3.0 3.0
mean 2.0 5.0
std 1.0 1.0
min 1.0 4.0
25% 1.5 4.550% 2.0 5.075% 2.5 5.5
max 3.0 6.0
Build a second DataFrame.
pd.DataFrame(df.mad() , columns = ["Mad"] ).T
num1 num2
Mad 0.666667 0.666667
Join the two DataFrames.
pd.concat([df.describe(),pd.DataFrame(df.mad() , columns = ["Mad"] ).T ])
num1 num2
count 3.0000003.000000
mean 2.0000005.000000
std 1.0000001.000000min1.0000004.00000025% 1.5000004.50000050% 2.0000005.00000075% 2.5000005.500000max3.0000006.000000
Mad 0.6666670.666667
Post a Comment for "Pandas Describe By - Additional Parameters"