Quick Start: MindPandas Data Processing

View source files in Gitee

Data preprocessing is vital for model training. With good feature engineering, training accuracy could be significantly enhanced. This tutorial takes the feature engineering of recommender system as an example to introduce the procedure of using MindPandas to process data.

Setting MindPandas Execution Mode

MindPandas supports two execution modes, which are multithread mode and multiprocess mode. This example takes multithread mode as example. We set partition shape to 16*3. Example is shown as follows:

[1]:
import numpy as np
import mindpandas as pd
import random

pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))

Data Generation

Two dimensional data sized 10,000 rows and 40 columns, with label, dense features and sparse features is generated. The label is a random number with the value “0” or “1”, the dense features are random numbers with the value range from -10 to 10000, and the sparse features are random strings.

[2]:
DENSE_NUM = 13
SPARSE_NUM = 26
ROW_NUM = 10000
cat_val, int_val, lab_val = [], [], []

def gen_cat_feature(length):
    result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()
    if len(result) < length:
        result = '0' * (length - len(result)) + result
    return str(result)

def gen_int_feature():
    return random.randint(-10, 10000)

def gen_lab_feature():
    x = random.randint(0, 1)
    return round(x)

for i in range(ROW_NUM * SPARSE_NUM):
    cat_val.append(gen_cat_feature(8))
np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)
df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])

for i in range(ROW_NUM * DENSE_NUM):
    int_val.append(gen_int_feature())
np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)
df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])

for i in range(ROW_NUM):
    lab_val.append(gen_lab_feature())
np_lab = np.array(lab_val).reshape(ROW_NUM, 1)
df_lab = pd.DataFrame(np_lab, columns=['label'])

Data Preprocessing

Label, dense features and sparse features are concatenated to form the to-be-processed dataset. The results are shown as follows:

[3]:
df = pd.concat([df_lab, df_int, df_cat], axis=1)
df.to_pandas().head(5)
[3]:
label I1 I2 I3 I4 I5 I6 I7 I8 I9 ... C17 C18 C19 C20 C21 C22 C23 C24 C25 C26
0 0 5795 7051 8277 785 9305 7521 5206 6240 172 ... A5AE1E6D 25A100C3 C6B8E0A4 A94F6B56 B27D726B EB9F3C73 D98D17B2 793AB315 8C12657F AFCEEBFF
1 0 6968 8389 4352 3312 4021 5087 2254 4249 4411 ... EEAC1040 BDC711B9 16269D1B D59EA7BB 460218D4 F89E137C F488ED52 C1DDB598 AE9C21C9 11D47A2A
2 1 1144 9327 9399 7745 8144 7189 1663 1005 6421 ... 54EE530F 68D2F7EF EFD65C79 B2F2CCF5 86E02110 31617C19 44A2DFA4 032C30D1 C8098BAD CE4DD8BB
3 1 6214 3183 9229 938 9160 2783 2680 4775 4436 ... 639D80AA 3A14B884 9FC92B4F 67DB3280 1EE1FC45 CE19F4C1 F34CC6FD C3C9F66C CA1B3F85 F184D01E
4 1 3220 3235 2243 50 5074 6328 6894 6838 3063 ... 7671D909 126B3F69 1262514D 25C18137 2BA958DE D6CE7BE3 18D4EEE1 315D0FFB 7C25DB1D 6E4ABFB1

5 rows × 40 columns

Feature Engineering

  1. Get the maximum and minimum values of each column of the dense data, to prepare for subsequent normalization.

[4]:
max_dict, min_dict = {}, {}
for i, j in enumerate(df_int.max()):
    max_dict[f'I{i + 1}'] = j

for i, j in enumerate(df_int.min()):
    min_dict[f'I{i + 1}'] = j
  1. Select columns 2 to 40 of df and copy into a new dataframe, named features.

[5]:
features = df.iloc[:, 1:40]
  1. Apply a custom function to the “label” column of df, and the numeric values are converted to numpy array. The result is added to “label” column.

[6]:
def get_label(x):
    return np.array([x])
df['label'] = df['label'].apply(get_label)
  1. Apply a custom function to features, normalize the dense data, fill the other data with 1, and add the data to the “weight” column of df.

[7]:
def get_weight(x):
    ret = []
    for index, val in enumerate(x):
        if index < DENSE_NUM:
            col = f'I{index + 1}'
            ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))
        else:
            ret.append(1)
    return ret
feat_weight = features.apply(get_weight, axis=1)
df['weight'] = feat_weight
  1. Apply a custom function to features, get the index of the dense data, other data is filled with its hash value, add the data to the “id” column of df.

[8]:
def get_id(x):
    ret = []
    for index, val in enumerate(x):
        if index < DENSE_NUM:
            ret.append(index + 1)
        else:
            ret.append(hash(val))
    return ret
feat_id = features.apply(get_id, axis=1)
df['id'] = feat_id

Splitting Dataset

Adding “is_training” column. The first 70% of the data is set as training data and other data is set as non-training data. The results are shown below:

[9]:
m_train_len = int(len(df) * 0.7)
df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)
df = df[['id', 'weight', 'label', 'is_training']]
df.to_pandas()
[9]:
id weight label is_training
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... [0.5799200799200799, 0.705335731414868, 0.8280... [0] 1
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... [0.6971028971028971, 0.8390287769784173, 0.435... [0] 1
2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... [0.11528471528471529, 0.9327537969624301, 0.94... [1] 1
3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... [0.6217782217782217, 0.3188449240607514, 0.923... [1] 1
4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... [0.3226773226773227, 0.3240407673860911, 0.225... [1] 1
... ... ... ... ...
9995 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... [0.09270729270729271, 0.3959832134292566, 0.03... [0] 0
9996 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... [0.5147852147852148, 0.48810951239008793, 0.46... [1] 0
9997 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... [0.4792207792207792, 0.4045763389288569, 0.514... [1] 0
9998 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... [0.550949050949051, 0.1035171862509992, 0.2167... [0] 0
9999 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... [0.9004995004995004, 0.9000799360511591, 0.826... [0] 0

10000 rows × 4 columns

Till now, data generation, proprocessing and feature engineering are completed. The processed data can be passed into the model for training.