快速入门:MindPandas数据处理
数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindPandas处理数据的流程。
MindPandas执行模式设置
MindPandas支持多线程与多进程模式,本示例使用多线程模式,更多详见MindPandas执行模式介绍及配置说明,并设置切片维度为16*3,示例如下:
[1]:
import numpy as np
import mindpandas as pd
import random
pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))
数据生成
生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。
[2]:
DENSE_NUM = 13
SPARSE_NUM = 26
ROW_NUM = 10000
cat_val, int_val, lab_val = [], [], []
def gen_cat_feature(length):
result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()
if len(result) < length:
result = '0' * (length - len(result)) + result
return str(result)
def gen_int_feature():
return random.randint(-10, 10000)
def gen_lab_feature():
x = random.randint(0, 1)
return round(x)
for i in range(ROW_NUM * SPARSE_NUM):
cat_val.append(gen_cat_feature(8))
np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)
df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])
for i in range(ROW_NUM * DENSE_NUM):
int_val.append(gen_int_feature())
np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)
df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])
for i in range(ROW_NUM):
lab_val.append(gen_lab_feature())
np_lab = np.array(lab_val).reshape(ROW_NUM, 1)
df_lab = pd.DataFrame(np_lab, columns=['label'])
数据预处理
将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:
[3]:
df = pd.concat([df_lab, df_int, df_cat], axis=1)
df.to_pandas().head(5)
[3]:
label | I1 | I2 | I3 | I4 | I5 | I6 | I7 | I8 | I9 | ... | C17 | C18 | C19 | C20 | C21 | C22 | C23 | C24 | C25 | C26 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 153 | 4326 | 4239 | 3998 | 4394 | 8434 | 8463 | 7862 | 9993 | ... | 938379C6 | 9878C0E2 | A75A4A8C | D9F9E0F2 | 173E6F23 | 004968BA | E66F6B9F | 287A48D1 | AC62D5CE | A723AB7F |
1 | 1 | 1962 | 6771 | 372 | 1754 | 7408 | 9176 | 6414 | 751 | 7680 | ... | 1613C18C | CE911717 | 8B35FF3E | 585C6D76 | 5A4EF600 | 3FA13F3A | 1B8B88AD | C232D96E | CD630ACA | AB435A6A |
2 | 1 | 8665 | 1485 | 3321 | 5368 | 2658 | 6317 | 2848 | 2780 | 2522 | ... | 193587B6 | 17AC3A54 | 025D3F81 | 5E2D04CB | D28747FF | D6A6A51A | C4E08EE7 | C520A45C | B8CB53F1 | 3933626E |
3 | 1 | 7794 | 5804 | 9079 | 4813 | 1912 | 4740 | 212 | 373 | 620 | ... | 8C816BC2 | F5AA01BE | 08CBECA8 | DC884327 | 9F95F1D4 | 9C389A00 | 7CFFC865 | DC9203DB | 86DC5DC2 | EFFF0EAC |
4 | 0 | 3331 | 4672 | 9741 | 6430 | 4610 | 8867 | 9055 | 3170 | 7955 | ... | E18EF1EB | 0905B30C | 1A584C44 | BAC91CC4 | 8DAAC9B4 | 7298201D | 73A30ED7 | 9560AB20 | 6B452601 | D7754942 |
5 rows × 40 columns
特征工程
获取稠密数据每一列的最大值与最小值,为后续的归一化做准备。
[4]:
max_dict, min_dict = {}, {}
for i, j in enumerate(df_int.max()):
max_dict[f'I{i + 1}'] = j
for i, j in enumerate(df_int.min()):
min_dict[f'I{i + 1}'] = j
截取df的第2列到第40列。
[5]:
features = df.iloc[:, 1:40]
对df的“label”列应用自定义函数,将“label”列中的数值转换为numpy数组,将数据添加到df的“label”列。
[6]:
def get_label(x):
return np.array([x])
df['label'] = df['label'].apply(get_label)
对features应用自定义函数,将稠密数据进行归一化处理,其他数据填充为1,将数据添加到df的“weight”列。
[7]:
def get_weight(x):
ret = []
for index, val in enumerate(x):
if index < DENSE_NUM:
col = f'I{index + 1}'
ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))
else:
ret.append(1)
return ret
feat_weight = features.apply(get_weight, axis=1)
df['weight'] = feat_weight
对features应用自定义函数,获取稠密数据的索引,其他数据使用其哈希值进行填充,将数据添加到df的“id”列。
[8]:
def get_id(x):
ret = []
for index, val in enumerate(x):
if index < DENSE_NUM:
ret.append(index + 1)
else:
ret.append(hash(val))
return ret
feat_id = features.apply(get_id, axis=1)
df['id'] = feat_id
数据集划分
df新增“is_training”列,将数据的前70%设为训练数据,其他数据标记为非训练数据。结果如下所示:
[9]:
m_train_len = int(len(df) * 0.7)
df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)
df = df[['id', 'weight', 'label', 'is_training']]
df.to_pandas()
[9]:
id | weight | label | is_training | |
---|---|---|---|---|
0 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 89... | [0.016285343191127986, 0.4332400559664201, 0.4... | [0] | 1 |
1 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 70... | [0.19702267958837047, 0.6775934439336398, 0.03... | [1] | 1 |
2 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -5... | [0.8667199520431611, 0.14931041375174894, 0.33... | [1] | 1 |
3 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 40... | [0.7796982715556, 0.5809514291425145, 0.907992... | [1] | 1 |
4 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 64... | [0.3337995803776601, 0.467819308414951, 0.9741... | [0] | 1 |
... | ... | ... | ... | ... |
9995 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 87... | [0.8151663502847437, 0.962722366580052, 0.5130... | [1] | 0 |
9996 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 47... | [0.6402237985812769, 0.9683190085948431, 0.948... | [1] | 0 |
9997 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -7... | [0.9435508042761515, 0.9097541475114931, 0.313... | [0] | 0 |
9998 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 36... | [0.6173443900489559, 0.41225264841095344, 0.92... | [1] | 0 |
9999 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 53... | [0.869017883904486, 0.8232060763541875, 0.5049... | [0] | 0 |
10000 rows × 4 columns
至此,数据生成、数据预处理以及特征工程等操作已完成,处理好的数据即可传入模型进行训练。