Quick Start: MindPandas Data Processing
Data preprocessing is vital for model training. With good feature engineering, training accuracy could be significantly enhanced. This tutorial takes the feature engineering of recommender system as an example to introduce the procedure of using MindPandas to process data.
Setting MindPandas Execution Mode
MindPandas supports two execution modes, which are multithread mode and multiprocess mode. This example takes multithread mode as example. We set partition shape to 16*3. Example is shown as follows:
[1]:
import numpy as np
import mindpandas as pd
import random
pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))
Data Generation
Two dimensional data sized 10,000 rows and 40 columns, with label, dense features and sparse features is generated. The label is a random number with the value “0” or “1”, the dense features are random numbers with the value range from -10 to 10000, and the sparse features are random strings.
[2]:
DENSE_NUM = 13
SPARSE_NUM = 26
ROW_NUM = 10000
cat_val, int_val, lab_val = [], [], []
def gen_cat_feature(length):
result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()
if len(result) < length:
result = '0' * (length - len(result)) + result
return str(result)
def gen_int_feature():
return random.randint(-10, 10000)
def gen_lab_feature():
x = random.randint(0, 1)
return round(x)
for i in range(ROW_NUM * SPARSE_NUM):
cat_val.append(gen_cat_feature(8))
np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)
df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])
for i in range(ROW_NUM * DENSE_NUM):
int_val.append(gen_int_feature())
np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)
df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])
for i in range(ROW_NUM):
lab_val.append(gen_lab_feature())
np_lab = np.array(lab_val).reshape(ROW_NUM, 1)
df_lab = pd.DataFrame(np_lab, columns=['label'])
Data Preprocessing
Label, dense features and sparse features are concatenated to form the to-be-processed dataset. The results are shown as follows:
[3]:
df = pd.concat([df_lab, df_int, df_cat], axis=1)
df.to_pandas().head(5)
[3]:
label | I1 | I2 | I3 | I4 | I5 | I6 | I7 | I8 | I9 | ... | C17 | C18 | C19 | C20 | C21 | C22 | C23 | C24 | C25 | C26 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 5795 | 7051 | 8277 | 785 | 9305 | 7521 | 5206 | 6240 | 172 | ... | A5AE1E6D | 25A100C3 | C6B8E0A4 | A94F6B56 | B27D726B | EB9F3C73 | D98D17B2 | 793AB315 | 8C12657F | AFCEEBFF |
1 | 0 | 6968 | 8389 | 4352 | 3312 | 4021 | 5087 | 2254 | 4249 | 4411 | ... | EEAC1040 | BDC711B9 | 16269D1B | D59EA7BB | 460218D4 | F89E137C | F488ED52 | C1DDB598 | AE9C21C9 | 11D47A2A |
2 | 1 | 1144 | 9327 | 9399 | 7745 | 8144 | 7189 | 1663 | 1005 | 6421 | ... | 54EE530F | 68D2F7EF | EFD65C79 | B2F2CCF5 | 86E02110 | 31617C19 | 44A2DFA4 | 032C30D1 | C8098BAD | CE4DD8BB |
3 | 1 | 6214 | 3183 | 9229 | 938 | 9160 | 2783 | 2680 | 4775 | 4436 | ... | 639D80AA | 3A14B884 | 9FC92B4F | 67DB3280 | 1EE1FC45 | CE19F4C1 | F34CC6FD | C3C9F66C | CA1B3F85 | F184D01E |
4 | 1 | 3220 | 3235 | 2243 | 50 | 5074 | 6328 | 6894 | 6838 | 3063 | ... | 7671D909 | 126B3F69 | 1262514D | 25C18137 | 2BA958DE | D6CE7BE3 | 18D4EEE1 | 315D0FFB | 7C25DB1D | 6E4ABFB1 |
5 rows × 40 columns
Feature Engineering
Get the maximum and minimum values of each column of the dense data, to prepare for subsequent normalization.
[4]:
max_dict, min_dict = {}, {}
for i, j in enumerate(df_int.max()):
max_dict[f'I{i + 1}'] = j
for i, j in enumerate(df_int.min()):
min_dict[f'I{i + 1}'] = j
Select columns 2 to 40 of df and copy into a new dataframe, named features.
[5]:
features = df.iloc[:, 1:40]
Apply a custom function to the “label” column of df, and the numeric values are converted to numpy array. The result is added to “label” column.
[6]:
def get_label(x):
return np.array([x])
df['label'] = df['label'].apply(get_label)
Apply a custom function to features, normalize the dense data, fill the other data with 1, and add the data to the “weight” column of df.
[7]:
def get_weight(x):
ret = []
for index, val in enumerate(x):
if index < DENSE_NUM:
col = f'I{index + 1}'
ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))
else:
ret.append(1)
return ret
feat_weight = features.apply(get_weight, axis=1)
df['weight'] = feat_weight
Apply a custom function to features, get the index of the dense data, other data is filled with its hash value, add the data to the “id” column of df.
[8]:
def get_id(x):
ret = []
for index, val in enumerate(x):
if index < DENSE_NUM:
ret.append(index + 1)
else:
ret.append(hash(val))
return ret
feat_id = features.apply(get_id, axis=1)
df['id'] = feat_id
Splitting Dataset
Adding “is_training” column. The first 70% of the data is set as training data and other data is set as non-training data. The results are shown below:
[9]:
m_train_len = int(len(df) * 0.7)
df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)
df = df[['id', 'weight', 'label', 'is_training']]
df.to_pandas()
[9]:
id | weight | label | is_training | |
---|---|---|---|---|
0 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... | [0.5799200799200799, 0.705335731414868, 0.8280... | [0] | 1 |
1 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | [0.6971028971028971, 0.8390287769784173, 0.435... | [0] | 1 |
2 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... | [0.11528471528471529, 0.9327537969624301, 0.94... | [1] | 1 |
3 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... | [0.6217782217782217, 0.3188449240607514, 0.923... | [1] | 1 |
4 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | [0.3226773226773227, 0.3240407673860911, 0.225... | [1] | 1 |
... | ... | ... | ... | ... |
9995 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | [0.09270729270729271, 0.3959832134292566, 0.03... | [0] | 0 |
9996 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... | [0.5147852147852148, 0.48810951239008793, 0.46... | [1] | 0 |
9997 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... | [0.4792207792207792, 0.4045763389288569, 0.514... | [1] | 0 |
9998 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | [0.550949050949051, 0.1035171862509992, 0.2167... | [0] | 0 |
9999 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... | [0.9004995004995004, 0.9000799360511591, 0.826... | [0] | 0 |
10000 rows × 4 columns
Till now, data generation, proprocessing and feature engineering are completed. The processed data can be passed into the model for training.