Paper100m download failed

I am not able to download and run papers100m, which was not a problem in August. The error is as following:

cc@exp:~$ python3 dgl/examples/pytorch/graphsage/node_classification_1.py
Training in mixed mode.
Loading data
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/usr/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dgl/examples/pytorch/graphsage/node_classification_1.py", line 135, in <module>
    dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-papers100M'))
  File "/home/cc/ogb/ogb/nodeproppred/dataset_dgl.py", line 69, in __init__
    self.pre_process()
  File "/home/cc/ogb/ogb/nodeproppred/dataset_dgl.py", line 98, in pre_process
    if decide_download(url):
  File "/home/cc/ogb/ogb/utils/url.py", line 12, in decide_download
    d = ur.urlopen(url)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

I saw similar issues and followed the suggestion there to change the proxy but it did not work for me. Using the mainland proxy gave me ConnectionRefusedError: [Errno 111] Connection refused; using the IP of the virtual machine or my physical IP gave urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>. I am wondering how I can solve this. Thanks in advance!

Were you able to download directly with the following link?
http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip

Sorry for the late reply. The zip file was successfully downloaded to the machine but oom happened during the process of saving DGL objects (when first ran the script) as

Processing graphs...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 9986.44it/s]
Converting graphs into DGL objects...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:04<00:00,  4.30s/it]
Saving...
Killed

If rerun it, it showed

cc@v1:~/exp$ python3 dgl/examples/pytorch/graphsage/node_classification.py
Training in mixed mode.
Loading data
Traceback (most recent call last):
  File "dgl/examples/pytorch/graphsage/node_classification.py", line 128, in <module>
    dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-papers100M'))
  File "/home/cc/.local/lib/python3.8/site-packages/ogb/nodeproppred/dataset_dgl.py", line 69, in __init__
    self.pre_process()
  File "/home/cc/.local/lib/python3.8/site-packages/ogb/nodeproppred/dataset_dgl.py", line 76, in pre_process
    self.graph, label_dict = load_graphs(pre_processed_file_path)
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 182, in load_graphs
    return load_graph_v2(filename, idx_list)
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 194, in load_graph_v2
    return [gdata.get_graph() for gdata in heterograph_list], label_dict
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 194, in <listcomp>
    return [gdata.get_graph() for gdata in heterograph_list], label_dict
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/heterograph_serialize.py", line 54, in get_graph
    ndict = {ntensor[i]: F.zerocopy_from_dgl_ndarray(ntensor[i+1]) for i in range(0, len(ntensor), 2)}
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/heterograph_serialize.py", line 54, in <dictcomp>
    ndict = {ntensor[i]: F.zerocopy_from_dgl_ndarray(ntensor[i+1]) for i in range(0, len(ntensor), 2)}
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/backend/pytorch/tensor.py", line 357, in zerocopy_from_dgl_ndarray
    if data.shape == (0,):
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 177, in shape
    return tuple(self.handle.contents.shape[i] for i in range(self.handle.contents.ndim))
AttributeError: 'NoneType' object has no attribute 'contents'
Segmentation fault (core dumped)

Specifically, the script used is modified to dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-papers100M')) for line 128.

I tried in my side and did not hit your issue though failed somewhere else.
dgl-cu102 0.10a221108

Can I ask what’s the memory requirement for preprocessing this dataset? Mine is 126GB and it still gave oom.

Mine is 377GB. I don’t know the minimum value.

Probably I can preprocess the data in another machine with larger RAM, scp the processed data and then run the training script. Specifically, I am thinking of using from dgl import load_graphs, load data with g, label_dict=load_graphs(~/exp/dataset/ogbn_papers100M/processed/dgl_data_processed) and continue with the rest of the script. I am not sure which processed file should be passed and whether this will work. Thanks!

or just try run dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-products')) on a machine with large RAM and save it to disk.

I ran dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-products')) and it gave dataset/ogbn_papers100M and all processed files in the directory. Then, I scp it to my current working machine and ran the same script as above, which gave the following

Training in mixed mode.
Loading data
Traceback (most recent call last):
  File "dgl/examples/pytorch/graphsage/node_classification.py", line 128, in <module>
    dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-papers100M'))
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/adapter.py", line 88, in __init__
    super().__init__(self.dataset.name + '-as-nodepred',
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/dgl_dataset.py", line 99, in __init__
    self._load()
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/dgl_dataset.py", line 192, in _load
    self.save()
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/adapter.py", line 151, in save
    utils.save_graphs(os.path.join(self.save_path, 'graph_{}.bin'.format(self.hash)), [self.g])
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 130, in save_graphs
    save_heterographs(filename, g_list, labels)
  File "/home/cc/.local/lib/python3.8/site-packages/dgl/data/heterograph_serialize.py", line 29, in save_heterographs
    _CAPI_SaveHeteroGraphData(filename, gdata_list, tensor_dict_to_ndarray_dict(labels))
  File "dgl/_ffi/_cython/./function.pxi", line 293, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 225, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 215, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [04:10:00] /opt/dgl/third_party/dmlc-core/src/io/local_filesys.cc:38: Check failed: std::fwrite(ptr, 1, size, fp_) == size: FileStream.Write incomplete
Stack trace:
  [bt] (0) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fa515209bff]
  [bt] (1) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::io::FileStream::Write(void const*, unsigned long)+0x88) [0x7fa5161a3a58]
  [bt] (2) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Save(dmlc::Stream*) const+0x20d) [0x7fa51555394d]
  [bt] (3) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::Save(dmlc::Stream*) const+0x17f) [0x7fa5156a72ef]
  [bt] (4) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::Save(dmlc::Stream*) const+0x12b) [0x7fa51559afdb]
  [bt] (5) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::serialize::SaveHeteroGraphs(std::string, dgl::runtime::List<dgl::serialize::HeteroGraphData, void>, std::vector<std::pair<std::string, dgl::runtime::NDArray>, std::allocator<std::pair<std::string, dgl::runtime::NDArray> > > const&)+0x489) [0x7fa515637a29]
  [bt] (6) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(+0x7c871f) [0x7fa51563871f]
  [bt] (7) /home/cc/.local/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fa515530138]
  [bt] (8) /home/cc/.local/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x1633c) [0x7fa4ff99233c]

specify save_dir which points to the path you stores the processed data.

for reference:

>>> ds2=dgl.data.AsNodePredDataset(dgl.data.RedditDataset(), save_dir='home/ubuntu/.dgl/', verbose=True)
Done loading data from cached files.
1 Like

I changed the line to dataset = dgl.data.AsNodePredDataset(DglNodePropPredDataset('ogbn-papers100M'), save_dir='~/ogbn_papers100M/', verbose=True). And it prompted to download the dataset. I am wondering if the step for downloading raw dataset can be skipped since there are already processed data. FYI because of the limited disk storage in the node, I did not scp the raw directory under dataset/ogbn_papers100M/

how about using DglNodePropxxx directly? AsNodePredDataset is not necessary.

This works! Thanks! Now I failed at the same place as shown in your previous screenshot.

great. that issue could be resolved by formatting it to long, I think.