Indexing And Data Columns In Pandas/pytables
Solution 1:
You should just try it.
In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])
In [23]: store = pd.HDFStore('test.h5',mode='w')
In [24]: store.append('df_only_indexables',df)
In [25]: store.append('df_with_data_columns',df,data_columns=True)
In [26]: store.append('df_no_index',df,data_columns=True,index=False)
In [27]: store
Out[27]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index])
/df_with_data_columns frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
In [28]: store.close()
you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
If you specify
data_columns=True
ordata_columns=list_of_columns
, then these are stored separately and can then be subsequently queried.If you specify
index=False
then aPyTables
index is not automatically created for the queryable column (eg. theindex
and/ordata_columns
).
To see the actual indexes being created (the PyTables
indexes), see the output below. colindexes
defines which columns have an actual PyTables
index created. (I have truncated it somewhat).
/df_no_index/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": Float64Col(shape=(), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
/df_no_index/table._v_attrs (AttributeSet), 15 attributes:
[A_dtype := 'float64',
A_kind := ['A'],
B_dtype := 'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'A',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": Float64Col(shape=(), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
[A_dtype := 'float64',
A_kind := ['A'],
B_dtype := 'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'A',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
So if you want to query a column, make it a data_column
. If you don't then they will be stored in blocks by dtype (faster / less space).
You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).
See the cookbook for a menagerie of questions.
Post a Comment for "Indexing And Data Columns In Pandas/pytables"