CSVDataset returns flattened feature tensor

Chaves2021 · January 9, 2024, 3:16am

Hi guys.
I’m reading my data from a CSV file, and I’m using CSVDataset to parse it, and I’m using a custom parser to parse categorical edge and node feature named type based on the tutorial 4.6 Loading data from CSV files. The problem is the result of the parser is a dict, that is transformed into a 1D torch.tensor.
I tried to return a numpy array already in 2D format, but the function expects a dict. How could I parse the categorical feature in a way that It’s in the shape like (N, in_size) ?
Thanks in advance

OBS: the parser code if necessary:

class ParseCategoricalFeature:
  def __call__(self, df: pd.DataFrame):
    EnumType = Enum('EnumType', ['ADDRESS','ARGV','BLOCK','FILE','IATTR','LINK','MMAPED_FILE','PATH','PIPE','PROCESS_MEMORY','SHM','SOCKET','TASK','XATTR','ACCEPT',
                                 'ACCEPT_SOCKET','ARG','BIND','CLONE','CLONE_MEM','CONNECT','EXEC','EXEC_TASK','FILE_LOCK','FILE_RCV','FREE','GETATTR','GETXATTR','GETXATTR_INODE',
                                 'LISTXATTR','MEMORY_READ','MEMORY_WRITE','MMAP','MMAP_EXEC','MMAP_READ','MMAP_WRITE','MUNMAP','NAMED','OPEN','PERM_APPEND','PERM_EXEC','PERM_READ',
                                 'PERM_WRITE','READ','READ_IOCTL','READ_LINK','RECEIVE','RECEIVE_MSG','RECEIVE_UNIX','SEND','SEND_MSG','SEND_UNIX','SETATTR','SETATTR_INODE','SETUID',
                                 'SETXATTR','SETXATTR_INODE','SH_ATTACH_READ','SH_ATTACH_WRITE','SH_CREATE_READ','SH_CREATE_WRITE','SHMDT','SH_READ','SH_WRITE','SOCKET_CREATE','SOCKET_PAIR_CREATE',
                                 'TERMINATE_PROC','TERMINATE_TASK','UNLINK','VERSION_ACTIVITY','VERSION_ENTITY','WRITE','WRITE_IOCTL'], start=0)
    parsed = {}
    for header in df:
      dt = df[header].to_numpy().squeeze()
      if header == 'type':
        list_type = []
        for e in dt:
          list_type.append((EnumType[str(e).upper()].value) * 1.0)
        dt = np.array(list_type)
      parsed[header] = dt
    return parsed

Chaves2021 · January 9, 2024, 3:11pm

The solution was simpler than I thought. The code used:

class ParseCategoricalFeature:
  def __call__(self, df: pd.DataFrame):
    EnumType = Enum('EnumType', ['ADDRESS','ARGV','BLOCK','FILE','IATTR','LINK','MMAPED_FILE','PATH','PIPE','PROCESS_MEMORY','SHM','SOCKET','TASK','XATTR','ACCEPT',
                                 'ACCEPT_SOCKET','ARG','BIND','CLONE','CLONE_MEM','CONNECT','EXEC','EXEC_TASK','FILE_LOCK','FILE_RCV','FREE','GETATTR','GETXATTR','GETXATTR_INODE',
                                 'LISTXATTR','MEMORY_READ','MEMORY_WRITE','MMAP','MMAP_EXEC','MMAP_READ','MMAP_WRITE','MUNMAP','NAMED','OPEN','PERM_APPEND','PERM_EXEC','PERM_READ',
                                 'PERM_WRITE','READ','READ_IOCTL','READ_LINK','RECEIVE','RECEIVE_MSG','RECEIVE_UNIX','SEND','SEND_MSG','SEND_UNIX','SETATTR','SETATTR_INODE','SETUID',
                                 'SETXATTR','SETXATTR_INODE','SH_ATTACH_READ','SH_ATTACH_WRITE','SH_CREATE_READ','SH_CREATE_WRITE','SHMDT','SH_READ','SH_WRITE','SOCKET_CREATE','SOCKET_PAIR_CREATE',
                                 'TERMINATE_PROC','TERMINATE_TASK','UNLINK','VERSION_ACTIVITY','VERSION_ENTITY','WRITE','WRITE_IOCTL'], start=0)
    parsed = {}
    for header in df:
      dt = df[header].to_numpy().squeeze()
      if header == 'type':
        list_type = []
        for e in dt:
          list_type.append((EnumType[str(e).upper()].value) * 1.0)
        dt = np.array(list_type)
        dt_new = np.reshape(dt, (dt.shape[0], 1))
      parsed[header] = dt_new
    return parsed

system · February 8, 2024, 3:11pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.