How To Create An Edge List Dataframe From A Adjacency Matrix In Python?

November 09, 2024 Post a Comment

I have a pandas dataframe (think of if as a weighted adjacency matrix of nodes in a network) of the form, df, A B C D A 0 0.5 0.5 0 B 1 0 0 0 C 0.

Solution 1:

Mark diagonal as nan , then we stack

df.values[[np.arange(len(df))]*2]= np.nan
df
Out[172]: 
     A    B    C    D
A  NaN0.50.50.0
B  1.0NaN0.00.0
C  0.80.0NaN0.2
D  0.00.01.0NaN
df.stack().reset_index()
Out[173]: 
   level_0 level_1    00        A       B  0.51        A       C  0.52        A       D  0.03        B       A  1.04        B       C  0.05        B       D  0.06        C       A  0.87        C       B  0.08        C       D  0.29        D       A  0.010       D       B  0.011       D       C  1.0

Solution 2:

Using rename_axis + reset_index + melt:

df.rename_axis('Source')\
  .reset_index()\
  .melt('Source', value_name='Weight', var_name='Target')\
  .query('Source != Target')\
  .reset_index(drop=True)

  SourceTargetWeight0BA1.01CA0.82DA0.03AB0.54CB0.05DB0.06AC0.57BC0.08DC1.09AD0.010BD0.011CD0.2

melt has been introduced as a function of the DataFrame object as of 0.20, and for older versions, you'd need pd.melt instead:

v = df.rename_axis('Source').reset_index()
df = pd.melt(
      v, 
      id_vars='Source', 
      value_name='Weight', 
      var_name='Target'
).query('Source != Target')\
 .reset_index(drop=True)

Timings

x = np.random.randn(1000, 1000)
x[[np.arange(len(x))] * 2] = 0

df = pd.DataFrame(x)

%%timeit
df.index.name ='Source'
df.reset_index()\
  .melt('Source', value_name='Weight', var_name='Target')\
  .query('Source != Target')\
  .reset_index(drop=True)

1 loop, best of3: 139 ms per loop

# Wen's solution

%%timeit
df.values[[np.arange(len(df))]*2] = np.nan
df.stack().reset_index()

10 loops, best of 3: 45 ms per loop

Solution 3:

Two approaches using NumPy tools -

Approach #1

def edgelist(df):
    a = df.values
    c = df.columns
    n = len(c)
    
    c_ar = np.array(c)
    out = np.empty((n, n, 2), dtype=c_ar.dtype)
    
    out[...,0] = c_ar[:,None]
    out[...,1] = c_ar
    
    mask = ~np.eye(n,dtype=bool)
    df_out = pd.DataFrame(out[mask], columns=[['Source','Target']])
    df_out['Weight'] = a[mask]
    return df_out

Sample run -

In[155]: dfOut[155]: 
     ABCDA0.00.50.50.0B1.00.00.00.0C0.80.00.00.2D0.00.01.00.0In[156]: edgelist(df)
Out[156]: 
   SourceTargetWeight0AB0.51AC0.52AD0.03BA1.04BC0.05BD0.06CA0.87CB0.08CD0.29DA0.010DB0.011DC1.0

Approach #2

# https://stackoverflow.com/a/46736275/ @Divakar
def skip_diag_strided(A):
    m = A.shape[0]
    strided = np.lib.stride_tricks.as_strided
    s0,s1 = A.strides
    return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1))

# https://stackoverflow.com/a/48234170/ @Divakar
def combinations_without_repeat(a):
    n = len(a)
    out = np.empty((n,n-1,2),dtype=a.dtype)
    out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1))
    out.shape = (n-1,n,2)
    out[:,:,1] = onecold(a)
    out.shape = (-1,2)
    return out  

cols = df.columns.values.astype('S1')
df_out = pd.DataFrame(combinations_without_repeat(cols))
df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()

Runtime test

Using @cᴏʟᴅsᴘᴇᴇᴅ's timing setup :

In [704]: x = np.random.randn(1000, 1000)
     ...: x[[np.arange(len(x))] * 2] = 0
     ...: 
     ...: df = pd.DataFrame(x)

# @cᴏʟᴅsᴘᴇᴇᴅ's soln
In [705]: %%timeit
     ...: df.index.name = 'Source'
     ...: df.reset_index()\
     ...:   .melt('Source', value_name='Weight', var_name='Target')\
     ...:   .query('Source != Target')\
     ...:   .reset_index(drop=True)
10 loops, best of 3: 67.4 ms per loop

# @Wen's soln
In [706]: %%timeit
     ...: df.values[[np.arange(len(df))]*2] = np.nan
     ...: df.stack().reset_index()
100 loops, best of 3: 19.6 ms per loop

# Proposed in this post - Approach #1
In [707]: %timeit edgelist(df)
10 loops, best of 3: 24.8 ms per loop

# Proposed in this post - Approach #2
In [708]: %%timeit
     ...: cols = df.columns.values.astype('S1')
     ...: df_out = pd.DataFrame(combinations_without_repeat(cols))
     ...: df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()
100 loops, best of 3: 17.4 ms per loop

Solution 4:

Using NetworkX 2.x API:

import networkx as nx

In [246]: G = nx.from_pandas_adjacency(df, create_using=nx.MultiDiGraph())

In [247]: G.edges(data=True)
Out[247]: OutMultiEdgeDataView([('A', 'B', {'weight': 0.5}), ('A', 'C', {'weight': 0.5}), ('B', 'A', {'weight': 1.0}), ('C', 'A', {'weight': 0.8}), ('C', 'D', {
'weight': 0.2}), ('D', 'C', {'weight': 1.0})])

In [248]: nx.to_pandas_edgelist(G)
Out[248]:
  source target  weight
0      A      B     0.51      A      C     0.52      B      A     1.03      C      A     0.84      C      D     0.25      D      C     1.0

Learn Python Tutorials