Quick Start: MindPandas Data Processing

Data preprocessing is vital for model training. With good feature engineering, training accuracy could be significantly enhanced. This tutorial takes the feature engineering of recommender system as an example to introduce the procedure of using MindPandas to process data.

Setting MindPandas Execution Mode

MindPandas supports two execution modes, which are multithread mode and multiprocess mode. This example takes multithread mode as example. We set partition shape to 16*3. Example is shown as follows:

[1]:

import numpy as np
import mindpandas as pd
import random

pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))

Data Generation

Two dimensional data sized 10,000 rows and 40 columns, with label, dense features and sparse features is generated. The label is a random number with the value “0” or “1”, the dense features are random numbers with the value range from -10 to 10000, and the sparse features are random strings.

[2]:

DENSE_NUM = 13
SPARSE_NUM = 26
ROW_NUM = 10000
cat_val, int_val, lab_val = [], [], []

def gen_cat_feature(length):
    result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()
    if len(result) < length:
        result = '0' * (length - len(result)) + result
    return str(result)

def gen_int_feature():
    return random.randint(-10, 10000)

def gen_lab_feature():
    x = random.randint(0, 1)
    return round(x)

for i in range(ROW_NUM * SPARSE_NUM):
    cat_val.append(gen_cat_feature(8))
np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)
df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])

for i in range(ROW_NUM * DENSE_NUM):
    int_val.append(gen_int_feature())
np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)
df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])

for i in range(ROW_NUM):
    lab_val.append(gen_lab_feature())
np_lab = np.array(lab_val).reshape(ROW_NUM, 1)
df_lab = pd.DataFrame(np_lab, columns=['label'])

Data Preprocessing

Label, dense features and sparse features are concatenated to form the to-be-processed dataset. The results are shown as follows:

[3]:

df = pd.concat([df_lab, df_int, df_cat], axis=1)
df.to_pandas().head(5)

[3]:

	label	I1	I2	I3	I4	I5	I6	I7	I8	I9	...	C17	C18	C19	C20	C21	C22	C23	C24	C25	C26
0	0	5795	7051	8277	785	9305	7521	5206	6240	172	...	A5AE1E6D	25A100C3	C6B8E0A4	A94F6B56	B27D726B	EB9F3C73	D98D17B2	793AB315	8C12657F	AFCEEBFF
1	0	6968	8389	4352	3312	4021	5087	2254	4249	4411	...	EEAC1040	BDC711B9	16269D1B	D59EA7BB	460218D4	F89E137C	F488ED52	C1DDB598	AE9C21C9	11D47A2A
2	1	1144	9327	9399	7745	8144	7189	1663	1005	6421	...	54EE530F	68D2F7EF	EFD65C79	B2F2CCF5	86E02110	31617C19	44A2DFA4	032C30D1	C8098BAD	CE4DD8BB
3	1	6214	3183	9229	938	9160	2783	2680	4775	4436	...	639D80AA	3A14B884	9FC92B4F	67DB3280	1EE1FC45	CE19F4C1	F34CC6FD	C3C9F66C	CA1B3F85	F184D01E
4	1	3220	3235	2243	50	5074	6328	6894	6838	3063	...	7671D909	126B3F69	1262514D	25C18137	2BA958DE	D6CE7BE3	18D4EEE1	315D0FFB	7C25DB1D	6E4ABFB1

5 rows × 40 columns

Feature Engineering

Get the maximum and minimum values of each column of the dense data, to prepare for subsequent normalization.

[4]:

max_dict, min_dict = {}, {}
for i, j in enumerate(df_int.max()):
    max_dict[f'I{i + 1}'] = j

for i, j in enumerate(df_int.min()):
    min_dict[f'I{i + 1}'] = j

Select columns 2 to 40 of df and copy into a new dataframe, named features.

[5]:

features = df.iloc[:, 1:40]

Apply a custom function to the “label” column of df, and the numeric values are converted to numpy array. The result is added to “label” column.

[6]:

def get_label(x):
    return np.array([x])
df['label'] = df['label'].apply(get_label)

Apply a custom function to features, normalize the dense data, fill the other data with 1, and add the data to the “weight” column of df.

[7]:

def get_weight(x):
    ret = []
    for index, val in enumerate(x):
        if index < DENSE_NUM:
            col = f'I{index + 1}'
            ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))
        else:
            ret.append(1)
    return ret
feat_weight = features.apply(get_weight, axis=1)
df['weight'] = feat_weight

Apply a custom function to features, get the index of the dense data, other data is filled with its hash value, add the data to the “id” column of df.

[8]:

def get_id(x):
    ret = []
    for index, val in enumerate(x):
        if index < DENSE_NUM:
            ret.append(index + 1)
        else:
            ret.append(hash(val))
    return ret
feat_id = features.apply(get_id, axis=1)
df['id'] = feat_id

Splitting Dataset

Adding “is_training” column. The first 70% of the data is set as training data and other data is set as non-training data. The results are shown below:

[9]:

m_train_len = int(len(df) * 0.7)
df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)
df = df[['id', 'weight', 'label', 'is_training']]
df.to_pandas()

[9]:

	id	weight	label	is_training
0	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31...	[0.5799200799200799, 0.705335731414868, 0.8280...	[0]	1
1	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...	[0.6971028971028971, 0.8390287769784173, 0.435...	[0]	1
2	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71...	[0.11528471528471529, 0.9327537969624301, 0.94...	[1]	1
3	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38...	[0.6217782217782217, 0.3188449240607514, 0.923...	[1]	1
4	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...	[0.3226773226773227, 0.3240407673860911, 0.225...	[1]	1
...	...	...	...	...
9995	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...	[0.09270729270729271, 0.3959832134292566, 0.03...	[0]	0
9996	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12...	[0.5147852147852148, 0.48810951239008793, 0.46...	[1]	0
9997	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2...	[0.4792207792207792, 0.4045763389288569, 0.514...	[1]	0
9998	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...	[0.550949050949051, 0.1035171862509992, 0.2167...	[0]	0
9999	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4...	[0.9004995004995004, 0.9000799360511591, 0.826...	[0]	0

10000 rows × 4 columns

Till now, data generation, proprocessing and feature engineering are completed. The processed data can be passed into the model for training.