Buffers cleared during .backwards()

willfang · April 9, 2020, 4:22pm

I’m running into an issue where the backprop fails with this message:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

The odd thing is that the first few loops work, and if I run this locally instead of on a cluster, it works as well. Has anyone run into something similar?

There is only a single .backwards() in my code

Cluster GPU: TeslaK80
Local GPU: GTX 1660

I’ve narrowed it down to this troublesome function, but having trouble understanding why

def pool_values(g_input,g_connector, g_out,x):
#Input: input graph, g_connector
#Output: Pooled values of g_output
#Description: Pools a DGL graph

    n_input_nodes = g_input.__len__()
    n_output_nodes = g_out.__len__()

    #Propogate Data
    in_nodes = list(range(0,n_input_nodes))
    out_nodes = list(range(n_input_nodes,n_input_nodes + n_output_nodes))
    g_connector.nodes[in_nodes].data['h'] = x

    g_connector.update_all(gcn_msg, gcn_reduce)

    x = g_connector.nodes[out_nodes].data['h']
    return x

mufeili · April 9, 2020, 5:32pm

Can you try adding

g_input = g_input.local_var()
g_connector = g_connector.local_var()
g_out = g_out.local_var()

before n_input_nodes = g_input.__len__()? This will create local copies of graph topologies so that the data of graphs will not be corrupted.

willfang · April 9, 2020, 6:26pm

That worked beautifully, thank you!