快速入门:MindSpore Pandas数据处理
数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindSpore Pandas处理数据的流程。
MindSpore Pandas执行模式设置
MindSpore Pandas支持多线程与多进程模式,本示例使用多线程模式,更多详见MindSpore Pandas执行模式介绍及配置说明,并设置切片维度为16*3,示例如下:
[1]:
import numpy as np
import mindpandas as pd
import random
pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))
数据生成
生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。
[2]:
DENSE_NUM = 13
SPARSE_NUM = 26
ROW_NUM = 10000
cat_val, int_val, lab_val = [], [], []
def gen_cat_feature(length):
result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()
if len(result) < length:
result = '0' * (length - len(result)) + result
return str(result)
def gen_int_feature():
return random.randint(-10, 10000)
def gen_lab_feature():
x = random.randint(0, 1)
return round(x)
for i in range(ROW_NUM * SPARSE_NUM):
cat_val.append(gen_cat_feature(8))
np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)
df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])
for i in range(ROW_NUM * DENSE_NUM):
int_val.append(gen_int_feature())
np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)
df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])
for i in range(ROW_NUM):
lab_val.append(gen_lab_feature())
np_lab = np.array(lab_val).reshape(ROW_NUM, 1)
df_lab = pd.DataFrame(np_lab, columns=['label'])
数据预处理
将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:
[3]:
df = pd.concat([df_lab, df_int, df_cat], axis=1)
df.to_pandas().head(5)
[3]:
label | I1 | I2 | I3 | I4 | I5 | I6 | I7 | I8 | I9 | ... | C17 | C18 | C19 | C20 | C21 | C22 | C23 | C24 | C25 | C26 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 5795 | 7051 | 8277 | 785 | 9305 | 7521 | 5206 | 6240 | 172 | ... | A5AE1E6D | 25A100C3 | C6B8E0A4 | A94F6B56 | B27D726B | EB9F3C73 | D98D17B2 | 793AB315 | 8C12657F | AFCEEBFF |
1 | 0 | 6968 | 8389 | 4352 | 3312 | 4021 | 5087 | 2254 | 4249 | 4411 | ... | EEAC1040 | BDC711B9 | 16269D1B | D59EA7BB | 460218D4 | F89E137C | F488ED52 | C1DDB598 | AE9C21C9 | 11D47A2A |
2 | 1 | 1144 | 9327 | 9399 | 7745 | 8144 | 7189 | 1663 | 1005 | 6421 | ... | 54EE530F | 68D2F7EF | EFD65C79 | B2F2CCF5 | 86E02110 | 31617C19 | 44A2DFA4 | 032C30D1 | C8098BAD | CE4DD8BB |
3 | 1 | 6214 | 3183 | 9229 | 938 | 9160 | 2783 | 2680 | 4775 | 4436 | ... | 639D80AA | 3A14B884 | 9FC92B4F | 67DB3280 | 1EE1FC45 | CE19F4C1 | F34CC6FD | C3C9F66C | CA1B3F85 | F184D01E |
4 | 1 | 3220 | 3235 | 2243 | 50 | 5074 | 6328 | 6894 | 6838 | 3063 | ... | 7671D909 | 126B3F69 | 1262514D | 25C18137 | 2BA958DE | D6CE7BE3 | 18D4EEE1 | 315D0FFB | 7C25DB1D | 6E4ABFB1 |
5 rows × 40 columns
特征工程
获取稠密数据每一列的最大值与最小值,为后续的归一化做准备。
[4]:
max_dict, min_dict = {}, {}
for i, j in enumerate(df_int.max()):
max_dict[f'I{i + 1}'] = j
for i, j in enumerate(df_int.min()):
min_dict[f'I{i + 1}'] = j
截取df的第2列到第40列。
[5]:
features = df.iloc[:, 1:40]
对df的“label”列应用自定义函数,将“label”列中的数值转换为numpy数组,将数据添加到df的“label”列。
[6]:
def get_label(x):
return np.array([x])
df['label'] = df['label'].apply(get_label)
对features应用自定义函数,将稠密数据进行归一化处理,其他数据填充为1,将数据添加到df的“weight”列。
[7]:
def get_weight(x):
ret = []
for index, val in enumerate(x):
if index < DENSE_NUM:
col = f'I{index + 1}'
ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))
else:
ret.append(1)
return ret
feat_weight = features.apply(get_weight, axis=1)
df['weight'] = feat_weight
对features应用自定义函数,获取稠密数据的索引,其他数据使用其哈希值进行填充,将数据添加到df的“id”列。
[8]:
def get_id(x):
ret = []
for index, val in enumerate(x):
if index < DENSE_NUM:
ret.append(index + 1)
else:
ret.append(hash(val))
return ret
feat_id = features.apply(get_id, axis=1)
df['id'] = feat_id
数据集划分
df新增“is_training”列,将数据的前70%设为训练数据,其他数据标记为非训练数据。结果如下所示:
[9]:
m_train_len = int(len(df) * 0.7)
df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)
df = df[['id', 'weight', 'label', 'is_training']]
df.to_pandas()
[9]:
id | weight | label | is_training | |
---|---|---|---|---|
0 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... | [0.5799200799200799, 0.705335731414868, 0.8280... | [0] | 1 |
1 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | [0.6971028971028971, 0.8390287769784173, 0.435... | [0] | 1 |
2 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... | [0.11528471528471529, 0.9327537969624301, 0.94... | [1] | 1 |
3 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... | [0.6217782217782217, 0.3188449240607514, 0.923... | [1] | 1 |
4 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | [0.3226773226773227, 0.3240407673860911, 0.225... | [1] | 1 |
... | ... | ... | ... | ... |
9995 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | [0.09270729270729271, 0.3959832134292566, 0.03... | [0] | 0 |
9996 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... | [0.5147852147852148, 0.48810951239008793, 0.46... | [1] | 0 |
9997 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... | [0.4792207792207792, 0.4045763389288569, 0.514... | [1] | 0 |
9998 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | [0.550949050949051, 0.1035171862509992, 0.2167... | [0] | 0 |
9999 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... | [0.9004995004995004, 0.9000799360511591, 0.826... | [0] | 0 |
10000 rows × 4 columns
至此,数据生成、数据预处理以及特征工程等操作已完成,处理好的数据即可传入模型进行训练。