Indexing And Data Columns In Pandas/pytables

September 07, 2023 Post a Comment

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information ab

Solution 1:

You should just try it.

In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])

In [23]: store = pd.HDFStore('test.h5',mode='w')

In [24]: store.append('df_only_indexables',df)

In [25]: store.append('df_with_data_columns',df,data_columns=True)

In [26]: store.append('df_no_index',df,data_columns=True,index=False)

In [27]: store
Out[27]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index                     frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables              frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index])          
/df_with_data_columns            frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])

In [28]: store.close()

you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
If you specify data_columns=True or data_columns=list_of_columns, then these are stored separately and can then be subsequently queried.
If you specify index=False then a PyTables index is not automatically created for the queryable column (eg. the index and/or data_columns).

To see the actual indexes being created (the PyTables indexes), see the output below. colindexes defines which columns have an actual PyTables index created. (I have truncated it somewhat).

/df_no_index/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  /df_no_index/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

So if you want to query a column, make it a data_column. If you don't then they will be stored in blocks by dtype (faster / less space).

You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).

Baca Juga

See the cookbook for a menagerie of questions.

Learn Python Tutorials

Indexing And Data Columns In Pandas/pytables

Solution 1:

Post a Comment for "Indexing And Data Columns In Pandas/pytables"