Why do I get a segmentation fault (SIGSEGV) when calling a forward pass?

sopkri · July 23, 2020, 8:35pm

Hi there,

I have been getting a segmentation violation error, also called SIGSEGV when calling a forward pass in my graph convolutional neural network for which I am using torch version 1.5.0.

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

This error comes after I have created an instance of my graph convolutional neural network and am calling the forward method on it for the first time, thereby passing the data.

I have been adapting the code from the DGL implementation of Relational Graph Convolutional Neural Network for Link Prediction.

The LinkPredictHetero is similar to the DGL class LinkPredict.

This is part of my script:

# create model
model: LinkPredictHetero = LinkPredictHetero(
    g=deephet.graph,
    h_dim=h_dim,
    out_dim=h_out,
    num_hidden_layers=n_hidden_layers,
)
    
# call forward pass
embed = model(deephet.graph)

The graph is an instance of the dgl.heterograph.

The forward()method of the model links to the forward method of the base layer class which is similar to the DGL BaseRGCN forward method, defined as follows:

def forward(self, graph):
    graph = graph.local_var()
    hs = {}
    hs = self.embed_layer(graph)
    for layer in self.layers:
        hs = layer(g, inputs = hs)
    return hs

Related to this forward method,
self.embed_layer is a torch.nn.ModuleDict():

self.embed_layer = torch.nn.ModuleDict()
for node_type in self.graph.node_types:
    in_size = list(self.graph.nodes[node_type].data['feature'].size())[1]
    self.embed_layer[node_type] = torch.nn.Linear(in_size, self.embed_size)

and self.layers is a torch.nn.ModuleList() to which different layer implementations are appended.

So when in the script I call model(deephet.graph), which should call the forward() method of the layer, the segmentation violation ends my program, and actually also terminates python.

Could you please help me to find out why this happens? I would really appreciate it.
Thank you, dear community.

mufeili · July 25, 2020, 9:03pm

Could you provide a complete code snippet so that we can try reproducing your issue?

classicsong · July 27, 2020, 7:10am

hi sopkri:
Here is an example of implementing RGCN link prediction using heterogeneous graph APIs: https://github.com/classicsong/dgl/tree/rgcn-link/examples/pytorch/rgcn-hetero. But I do not have time to tune the performance.

sopkri · July 30, 2020, 3:50pm

Yes, sure I can. I have adapted the implementation of the GraphConv:

# Sample script for testing segmentation violation

import torch as th
from torch import nn
from torch.nn import init

import dgl.function as fn
from dgl.base import DGLError


# pylint: disable=W0235
class GraphConvHetero(nn.Module):
    r"""Apply graph convolution over an input signal.
    Graph convolution is introduced in `GCN <https://arxiv.org/abs/1609.02907>`__
    and can be described as below:
    .. math::
      h_i^{(l+1)} = \sigma(b^{(l)} + \sum_{j\in\mathcal{N}(i)}\frac{1}{c_{ij}}h_j^{(l)}W^{(l)})
    where :math:`\mathcal{N}(i)` is the neighbor set of node :math:`i`. :math:`c_{ij}` is equal
    to the product of the square root of node degrees:
    :math:`\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}`. :math:`\sigma` is an activation
    function.
    The model parameters are initialized as in the
    `original implementation <https://github.com/tkipf/gcn/blob/master/gcn/layers.py>`__ where
    the weight :math:`W^{(l)}` is initialized using Glorot uniform initialization
    and the bias is initialized to be zero.
    Notes
    -----
    Zero in degree nodes could lead to invalid normalizer. A common practice
    to avoid this is to add a self-loop for each node in the graph, which
    can be achieved by:
    >>> g = ... # some DGLGraph
    >>> g.add_edges(g.nodes(), g.nodes())
    Parameters
    ----------
    in_feats : int
        Input feature size.
    out_feats : int
        Output feature size.
    norm : str, optional
        How to apply the normalizer. If is `'right'`, divide the aggregated messages
        by each node's in-degrees, which is equivalent to averaging the received messages.
        If is `'none'`, no normalization is applied. Default is `'both'`,
        where the :math:`c_{ij}` in the paper is applied.
    weight : bool, optional
        If True, apply a linear layer. Otherwise, aggregating the messages
        without a weight matrix.
    bias : bool, optional
        If True, adds a learnable bias to the output. Default: ``True``.
    activation: callable activation function/layer or None, optional
        If not None, applies an activation function to the updated node features.
        Default: ``None``.
    Attributes
    ----------
    weight : torch.Tensor
        The learnable weight tensor.
    bias : torch.Tensor
        The learnable bias tensor.
    """

    def __init__(self,
                 in_feats,
                 out_feats,
                 norm='both',
                 weight=True,
                 bias=True,
                 activation=None):
        super(GraphConvHetero, self).__init__()
        if norm not in ('none', 'both', 'right'):
            raise DGLError('Invalid norm value. Must be either "none", "both" or "right".'
                           ' But got "{}".'.format(norm))
        self._in_feats = in_feats
        self._out_feats = out_feats
        self._norm = norm

        if weight:
            self.weight = nn.Parameter(th.Tensor(in_feats, out_feats))
        else:
            self.register_parameter('weight', None)

        if bias:
            self.bias = nn.Parameter(th.Tensor(out_feats))
        else:
            self.register_parameter('bias', None)

        self.reset_parameters()

        self._activation = activation

    def reset_parameters(self):
        """Reinitialize learnable parameters."""
        if self.weight is not None:
            init.xavier_uniform_(self.weight)
        if self.bias is not None:
            init.zeros_(self.bias)

    def forward(self, graph, feat, weight=None):
        r"""Compute graph convolution.
        Notes
        -----
        * Input shape: :math:`(N, *, \text{in_feats})` where * means any number of additional
          dimensions, :math:`N` is the number of nodes.
        * Output shape: :math:`(N, *, \text{out_feats})` where all but the last dimension are
          the same shape as the input.
        * Weight shape: :math:`(\text{in_feats}, \text{out_feats})`.
        Parameters
        ----------
        graph : DGLGraph
            The graph.
        feat : torch.Tensor
            The input feature
        weight : torch.Tensor, optional
            Optional external weight tensor.
        Returns
        -------
        torch.Tensor
            The output feature
        """
        with graph.local_scope():
            [src, dst] = graph.ntypes

            if self._norm == 'both':
                degs = graph.out_degrees().to(feat.device).float().clamp(min=1)
                norm = th.pow(degs, -0.5)
                shp = norm.shape + (1,) * (feat.dim() - 1)
                norm = th.reshape(norm, shp)
                feat = feat * norm

            if weight is not None:
                if self.weight is not None:
                    raise DGLError('External weight is provided while at the same time the'
                                   ' module has defined its own weight parameter. Please'
                                   ' create the module with flag weight=False.')
            else:
                weight = self.weight

            if self._in_feats > self._out_feats:
                # mult W first to reduce the feature size for aggregation.
                if weight is not None:
                    feat = th.matmul(feat, weight)
                graph.nodes[src].data['h'] = feat
                graph.update_all(fn.copy_src(src='h', out='m'),
                                 fn.sum(msg='m', out='h'))
                rst = graph.nodes[dst].data['h']
            else:
                # aggregate first then mult W
                graph.nodes[src].data['h'] = feat

                # SEGMENTAION VIOLATION HAPPENS _HERE
                graph.update_all(fn.copy_src(src='h', out='m'),
                                 fn.sum(msg='m', out='h'))

                rst = graph.nodes[dst].data['h']
                if weight is not None:
                    rst = th.matmul(rst, weight)

            if self._norm != 'none':
                degs = graph.in_degrees().to(feat.device).float().clamp(min=1)
                if self._norm == 'both':
                    norm = th.pow(degs, -0.5)
                else:
                    norm = 1.0 / degs
                shp = norm.shape + (1,) * (feat.dim() - 1)
                norm = th.reshape(norm, shp)
                rst = rst * norm

            if self.bias is not None:
                rst = rst + self.bias

            if self._activation is not None:
                rst = self._activation(rst)

            return rst

    def extra_repr(self):
        """Set the extra representation of the module,
        which will come into effect when printing the model.
        """
        summary = 'in={_in_feats}, out={_out_feats}'
        summary += ', normalization={_norm}'
        if '_activation' in self.__dict__:
            summary += ', activation={_activation}'
        return summary.format(**self.__dict__)

It appears that the fn.sum() function is not recognised. The reason I found is that the import import dgl.function as fn redirects to this path of the GitHub repository where no sum() function is found.

Maybe also important to mention is the Version of DGL I am using: 0.4.3post2

mufeili · August 1, 2020, 9:29am

I tried your code and it seems to be working fine on my machine. fn.sum is defined here.

sopkri · August 3, 2020, 9:59am

@mufeili Thank you for testing it! That’s good news, now I just have to figure out why it is not working on my system.

Could you maybe tell me which versions of which packages you used in your environment?
Also the Python version could be of relevance.

Thanks in advance!

sopkri · August 3, 2020, 11:56am

@mufeili
I have been digging into the code of DGL and have been printing what is executed to see at which point exactly the segmentation violation occurs.

At this line in the runtime.py file (which I uncommented) I am getting these print statements:

Feat _z4 = READ_COL(src_frame, "h")
Feat _z7 = COPY_REDUCE(sum, _z2, 0, _z4, 34, _z5, _z6)

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Would you be able to tell me with this information why the segmentation violation is happening and how I can resolve it?

sopkri · August 3, 2020, 2:24pm

@classicsong Thank you for the pointer. I would actually really like to try to use this implementation, but I would need to adapt it a bit for my purpose.

Right now it is really hard to figure out what is happening in the functions since there is no documentation about the parameters and the types and size they have. Are you planning on adding this to make the code reproducible?

Then I would also be able to use it.

mufeili · August 3, 2020, 4:04pm

Could you try running the following code?

import torch
import dgl

model = GraphConvHetero(2, 3)
g = dgl.bipartite((torch.tensor([1, 2]), torch.tensor([1, 3])))
node_feats = torch.randn(3, 2)
model(g, node_feats)

I used Python 3.6.10 with PyTorch 1.5.1.

sopkri · August 3, 2020, 4:24pm

@mufeili Yes, I just did and it did not throw any error.
I printed the output of model(g, node_feats):

tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.4112,  0.3156, -0.4092],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.5122,  0.5028, -0.5659]], grad_fn=<AddBackward0>)

mufeili · August 3, 2020, 5:00pm

Could you try reproducing your error with a toy example of graph and features and share it with me?

sopkri · August 4, 2020, 8:22am

@mufeili I invited you to a test repository on GitHub, that has the essential code.

You can run the script by cloning the repo, then

$ python3 -m pip install -e .

$ python3 -m redrugnn.rgcn_hetero.deployer

Hope this works - as in getting the same Error
Segmentation fault: 11

mufeili · August 5, 2020, 8:09pm

Ok. I see what’s going on here … The code crashes here with graph.out_degrees(). Basically you have 3 drug-typed nodes, with IDs 13, 16, 20. In DGL, we expect the node IDs to be consecutive integers starting from 0 for each type of nodes. That says, you should use 0, 1, 2 for the IDs of these drug nodes.

mufeili · August 10, 2020, 7:26am

For an update, DGL will automatically check whether the IDs are valid in graph construction as of v0.5 (coming soon).

sopkri · September 10, 2020, 10:23am

@mufeili I was able to adapt my code and it works now. Thank you so much for the pointer! This was a lot of help!