Error when trying to train Deep Generative Models of Graphs (DGMG)

Hi all,

I’m trying to use the dgl-lifesci example on this page to train a model for molecule generation on my own data:

pytorch version: 1.9.0
rdkit version: 2018.09.3.0

Using the code below

%run train.py -d none -o random -tf smiles.csv -vf val.csv

leads to the following error message:

Prepare logging directory...
Created directory ./training_results/none_random_2021-09-20_14-23-53
Saved settings to ./training_results/none_random_2021-09-20_14-23-53/settings.txt
Configure for new dataset none...
Processing smiles 1/96
---------------------------------------------------------------------------
ArgumentError                             Traceback (most recent call last)
~/dgllife/train.py in <module>
    178 
    179     args = parser.parse_args()
--> 180     args = setup(args, train=True)
    181 
    182     if args['num_processes'] == 1:

~/dgllife/utils.py in setup(args, train)
    183 
    184     if train:
--> 185         setup_dataset(args)
    186         args['checkpoint_dir'] = os.path.join(log_dir, 'checkpoint.pth')
    187         pprint(args)

~/dgllife/utils.py in setup_dataset(args)
    159     else:
    160         print('Configure for new dataset {}...'.format(args['dataset']))
--> 161         configure_new_dataset(args['dataset'], args['train_file'], args['val_file'])
    162 
    163 def setup(args, train=True):

~/dgllife/utils.py in configure_new_dataset(dataset, train_file, val_file)
    670     path_to_atom_and_bond_types = '_'.join([dataset, 'atom_and_bond_types.pkl'])
    671     if not os.path.exists(path_to_atom_and_bond_types):
--> 672         atom_types, bond_types = get_atom_and_bond_types(all_smiles)
    673         with open(path_to_atom_and_bond_types, 'wb') as f:
    674             pickle.dump({'atom_types': atom_types, 'bond_types': bond_types}, f)

~/dgllife/utils.py in get_atom_and_bond_types(smiles, log)
    454             print('Processing smiles {:d}/{:d}'.format(i + 1, n_smiles))
    455 
--> 456         mol = smiles_to_standard_mol(s)
    457         if mol is None:
    458             continue

~/dgllife/utils.py in smiles_to_standard_mol(s)
    411     """
    412     mol = Chem.MolFromSmiles(s)
--> 413     return standardize_mol(mol)
    414 
    415 def mol_to_standard_smile(mol):

~/dgllife/utils.py in standardize_mol(mol)
    393     """
    394     reactions = initialize_neuralization_reactions()
--> 395     Chem.Kekulize(mol, clearAromaticFlags=True)
    396     mol = neutralize_charges(mol, reactions)
    397     return mol

ArgumentError: Python argument types in
    rdkit.Chem.rdmolops.Kekulize(NoneType)
did not match C++ signature:
    Kekulize(RDKit::ROMol {lvalue} mol, bool clearAromaticFlags=False)

Am I doing something wrong or leaving out some important information?

There was an error in my smiles file. The word ‘smiles’ was written in the first row. I was able to re-run the code but now I get a new error:

Prepare logging directory...
Created directory ./training_results/Tox21_random_2021-09-20_17-45-38
Saved settings to ./training_results/Tox21_random_2021-09-20_17-45-38/settings.txt
Configure for new dataset Tox21...
Processing smiles 1/103
Processing smiles 2/103
Processing smiles 3/103
Processing smiles 4/103
Processing smiles 5/103
Processing smiles 6/103
Processing smiles 7/103
Processing smiles 8/103
Processing smiles 9/103
Processing smiles 10/103
Processing smiles 11/103
Processing smiles 12/103
Processing smiles 13/103
Processing smiles 14/103
Processing smiles 15/103
Processing smiles 16/103
Processing smiles 17/103
Processing smiles 18/103
Processing smiles 19/103
Processing smiles 20/103
Processing smiles 21/103
Processing smiles 22/103
Processing smiles 23/103
Processing smiles 24/103
Processing smiles 25/103
Processing smiles 26/103
Processing smiles 27/103
Processing smiles 28/103
Processing smiles 29/103
Processing smiles 30/103
Processing smiles 31/103
Processing smiles 32/103
Processing smiles 33/103
Processing smiles 34/103
Processing smiles 35/103
Processing smiles 36/103
Processing smiles 37/103
Processing smiles 38/103
Processing smiles 39/103
Processing smiles 40/103
Processing smiles 41/103
Processing smiles 42/103
Processing smiles 43/103
Processing smiles 44/103
Processing smiles 45/103
Processing smiles 46/103
Processing smiles 47/103
Processing smiles 48/103
Processing smiles 49/103
Processing smiles 50/103
Processing smiles 51/103
Processing smiles 52/103
Processing smiles 53/103
Processing smiles 54/103
Processing smiles 55/103
Processing smiles 56/103
Processing smiles 57/103
Processing smiles 58/103
Processing smiles 59/103
Processing smiles 60/103
Processing smiles 61/103
Processing smiles 62/103
Processing smiles 63/103
Processing smiles 64/103
Processing smiles 65/103
Processing smiles 66/103
Processing smiles 67/103
Processing smiles 68/103
Processing smiles 69/103
Processing smiles 70/103
Processing smiles 71/103
Processing smiles 72/103
Processing smiles 73/103
Processing smiles 74/103
Processing smiles 75/103
Processing smiles 76/103
Processing smiles 77/103
Processing smiles 78/103
Processing smiles 79/103
Processing smiles 80/103
Processing smiles 81/103
Processing smiles 82/103
Processing smiles 83/103
Processing smiles 84/103
Processing smiles 85/103
Processing smiles 86/103
Processing smiles 87/103
Processing smiles 88/103
Processing smiles 89/103
Processing smiles 90/103
Processing smiles 91/103
Processing smiles 92/103
Processing smiles 93/103
Processing smiles 94/103
Processing smiles 95/103
Processing smiles 96/103
Processing smiles 97/103
Processing smiles 98/103
Processing smiles 99/103
Processing smiles 100/103
Processing smiles 101/103
Processing smiles 102/103
Processing smiles 103/103
Processing 1/94
Processing 2/94
Processing 3/94
Processing 4/94
Processing 5/94
Processing 6/94
Processing 7/94
Processing 8/94
Processing 9/94
Processing 10/94
Processing 11/94
Processing 12/94
Processing 13/94
Processing 14/94
Processing 15/94
Processing 16/94
Processing 17/94
Processing 18/94
Processing 19/94
Processing 20/94
Processing 21/94
Processing 22/94
Processing 23/94
Processing 24/94
Processing 25/94
Processing 26/94
Processing 27/94
Processing 28/94
Processing 29/94
Processing 30/94
Processing 31/94
Processing 32/94
Processing 33/94
Processing 34/94
Processing 35/94
Processing 36/94
Processing 37/94
Processing 38/94
Processing 39/94
Processing 40/94
Processing 41/94
Processing 42/94
Processing 43/94
Processing 44/94
Processing 45/94
Processing 46/94
Processing 47/94
Processing 48/94
Processing 49/94
Processing 50/94
Processing 51/94
Processing 52/94
Processing 53/94
Processing 54/94
Processing 55/94
Processing 56/94
Processing 57/94
Processing 58/94
Processing 59/94
Processing 60/94
Processing 61/94
Processing 62/94
Processing 63/94
Processing 64/94
Processing 65/94
Processing 66/94
Processing 67/94
Processing 68/94
Processing 69/94
Processing 70/94
Processing 71/94
Processing 72/94
Processing 73/94
Processing 74/94
Processing 75/94
Processing 76/94
Processing 77/94
Processing 78/94
Processing 79/94
Processing 80/94
Processing 81/94
Processing 82/94
Processing 83/94
Processing 84/94
Processing 85/94
Processing 86/94
Processing 87/94
Processing 88/94
Processing 89/94
Processing 90/94
Processing 91/94
Processing 92/94
Processing 93/94
Processing 94/94
Processing 1/9
Processing 2/9
Processing 3/9
Processing 4/9
Processing 5/9
Processing 6/9
Processing 7/9
Processing 8/9
Processing 9/9
{'batch_size': 1,
 'checkpoint_dir': './training_results/Tox21_random_2021-09-20_17-45-38/checkpoint.pth',
 'dataset': 'Tox21',
 'dropout': 0.2,
 'log_dir': './training_results/Tox21_random_2021-09-20_17-45-38',
 'lr': 0.0001,
 'master_ip': '127.0.0.1',
 'master_port': '12345',
 'nepochs': 400,
 'node_hidden_size': 128,
 'num_processes': 32,
 'num_propagation_rounds': 2,
 'order': 'random',
 'seed': 0,
 'train_file': 'smiles.csv',
 'val_file': 'val.csv',
 'warmup_epochs': 10}
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/deepchem/Cleaned_up/train.py in <module>
    187         for rank in range(args['num_processes']):
    188             procs.append(mp.Process(target=launch_a_process, args=(rank, args, main), daemon=True))
--> 189             procs[-1].start()
    190         for p in procs:
    191             p.join()

~/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

~/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/context.py in _Popen(process_obj)
    282         def _Popen(process_obj):
    283             from .popen_spawn_posix import Popen
--> 284             return Popen(process_obj)
    285 
    286     class ForkServerProcess(process.BaseProcess):

~/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/popen_spawn_posix.py in __init__(self, process_obj)
     30     def __init__(self, process_obj):
     31         self._fds = []
---> 32         super().__init__(process_obj)
     33 
     34     def duplicate_for_child(self, fd):

~/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/popen_fork.py in __init__(self, process_obj)
     18         self.returncode = None
     19         self.finalizer = None
---> 20         self._launch(process_obj)
     21 
     22     def duplicate_for_child(self, fd):

~/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/popen_spawn_posix.py in _launch(self, process_obj)
     40         tracker_fd = semaphore_tracker.getfd()
     41         self._fds.append(tracker_fd)
---> 42         prep_data = spawn.get_preparation_data(process_obj._name)
     43         fp = io.BytesIO()
     44         set_spawning_popen(self)

~/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/spawn.py in get_preparation_data(name)
    170     # or through direct execution (or to leave it alone entirely)
    171     main_module = sys.modules['__main__']
--> 172     main_mod_name = getattr(main_module.__spec__, "name", None)
    173     if main_mod_name is not None:
    174         d['init_main_from_name'] = main_mod_name

AttributeError: module '__main__' has no attribute '__spec__'

See if this thread helps.

1 Like

Thank you.

I no longer get that error when I run the code in my terminal, but when the code runs it seems as though my dataset has no samples:

"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
Process SpawnProcess-13:
Process SpawnProcess-9:
Traceback (most recent call last):
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/Users/abc/Downloads/dgl-lifesci-master/examples/generative_models/dgmg/utils.py", line 229, in launch_a_process
    target(rank, args)
  File "/Users/abc/Downloads/dgl-lifesci-master/examples/generative_models/dgmg/train.py", line 56, in main
    shuffle=True, collate_fn=dataset.collate)
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/site-packages/torch/utils/daTraceback (most recent call last):
ta/dataloader.py", line 270, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
/torch/utils/data/dataloader.py", line 270, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 103, in __init__
    "value, but got num_samples={}".format(self.num_samples))
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/abc/Downloads/dgl-lifesci-master/examples/generative_models/dgmg/utils.py", line 229, in launch_a_process
    target(rank, args)
  File "/Users/abc/Downloads/dgl-lifesci-master/examples/generative_models/dgmg/train.py", line 56, in main
    shuffle=True, collate_fn=dataset.collate)
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 270, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/Users/abc/opt/anaconda3/envs/deepchem/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 103, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
ValueError: num_samples should be a positive integer value, but got num_samples=0

The file generated in my training_results folder is only a txt file:

seed 0
warmup_epochs 10
dataset None
order random
train_file smiles.csv
val_file val.csv
log_dir ./training_results/None_random_2021-09-22_09-06-15
num_processes 32
master_ip 127.0.0.1
master_port 12345
node_hidden_size 128
num_propagation_rounds 2
lr 0.0001
dropout 0.2000
nepochs 400
batch_size 1

Can you try stepping into the code with a debugging tool like ipdb? You will need to disable multiprocessing for it. If your dataset is very large, you can use a small subset of it, say 1000 SMILES strings, to verify the data loading process.

1 Like