{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# 快速入门:MindSpore Pandas数据处理\n", "\n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r2.0/docs/mindpandas/docs/source_zh_cn/mindpandas_quick_start.ipynb)\n", "\n", "数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindSpore Pandas处理数据的流程。\n", "\n", "## MindSpore Pandas执行模式设置\n", "\n", "MindSpore Pandas支持多线程与多进程模式,本示例使用多线程模式,更多详见[MindSpore Pandas执行模式介绍及配置说明](https://www.mindspore.cn/mindpandas/docs/zh-CN/r0.2/mindpandas_configuration.html),并设置切片维度为16*3,示例如下:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import mindpandas as pd\n", "import random\n", "\n", "pd.set_concurrency_mode(\"multithread\")\n", "pd.set_partition_shape((16, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据生成\n", "\n", "生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "DENSE_NUM = 13\n", "SPARSE_NUM = 26\n", "ROW_NUM = 10000\n", "cat_val, int_val, lab_val = [], [], []\n", "\n", "def gen_cat_feature(length):\n", " result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n", " if len(result) < length:\n", " result = '0' * (length - len(result)) + result\n", " return str(result)\n", "\n", "def gen_int_feature():\n", " return random.randint(-10, 10000)\n", "\n", "def gen_lab_feature():\n", " x = random.randint(0, 1)\n", " return round(x)\n", "\n", "for i in range(ROW_NUM * SPARSE_NUM):\n", " cat_val.append(gen_cat_feature(8))\n", "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n", "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n", "\n", "for i in range(ROW_NUM * DENSE_NUM):\n", " int_val.append(gen_int_feature())\n", "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n", "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n", "\n", "for i in range(ROW_NUM):\n", " lab_val.append(gen_lab_feature())\n", "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n", "df_lab = pd.DataFrame(np_lab, columns=['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据预处理\n", "\n", "将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelI1I2I3I4I5I6I7I8I9...C17C18C19C20C21C22C23C24C25C26
005795705182777859305752152066240172...A5AE1E6D25A100C3C6B8E0A4A94F6B56B27D726BEB9F3C73D98D17B2793AB3158C12657FAFCEEBFF
10696883894352331240215087225442494411...EEAC1040BDC711B916269D1BD59EA7BB460218D4F89E137CF488ED52C1DDB598AE9C21C911D47A2A
21114493279399774581447189166310056421...54EE530F68D2F7EFEFD65C79B2F2CCF586E0211031617C1944A2DFA4032C30D1C8098BADCE4DD8BB
3162143183922993891602783268047754436...639D80AA3A14B8849FC92B4F67DB32801EE1FC45CE19F4C1F34CC6FDC3C9F66CCA1B3F85F184D01E
413220323522435050746328689468383063...7671D909126B3F691262514D25C181372BA958DED6CE7BE318D4EEE1315D0FFB7C25DB1D6E4ABFB1
\n", "

5 rows × 40 columns

\n", "
" ], "text/plain": [ " label I1 I2 I3 I4 I5 I6 I7 I8 I9 ... C17 \\\n", "0 0 5795 7051 8277 785 9305 7521 5206 6240 172 ... A5AE1E6D \n", "1 0 6968 8389 4352 3312 4021 5087 2254 4249 4411 ... EEAC1040 \n", "2 1 1144 9327 9399 7745 8144 7189 1663 1005 6421 ... 54EE530F \n", "3 1 6214 3183 9229 938 9160 2783 2680 4775 4436 ... 639D80AA \n", "4 1 3220 3235 2243 50 5074 6328 6894 6838 3063 ... 7671D909 \n", "\n", " C18 C19 C20 C21 C22 C23 C24 \\\n", "0 25A100C3 C6B8E0A4 A94F6B56 B27D726B EB9F3C73 D98D17B2 793AB315 \n", "1 BDC711B9 16269D1B D59EA7BB 460218D4 F89E137C F488ED52 C1DDB598 \n", "2 68D2F7EF EFD65C79 B2F2CCF5 86E02110 31617C19 44A2DFA4 032C30D1 \n", "3 3A14B884 9FC92B4F 67DB3280 1EE1FC45 CE19F4C1 F34CC6FD C3C9F66C \n", "4 126B3F69 1262514D 25C18137 2BA958DE D6CE7BE3 18D4EEE1 315D0FFB \n", "\n", " C25 C26 \n", "0 8C12657F AFCEEBFF \n", "1 AE9C21C9 11D47A2A \n", "2 C8098BAD CE4DD8BB \n", "3 CA1B3F85 F184D01E \n", "4 7C25DB1D 6E4ABFB1 \n", "\n", "[5 rows x 40 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.concat([df_lab, df_int, df_cat], axis=1)\n", "df.to_pandas().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 特征工程\n", "\n", "1. 获取稠密数据每一列的最大值与最小值,为后续的归一化做准备。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "max_dict, min_dict = {}, {}\n", "for i, j in enumerate(df_int.max()):\n", " max_dict[f'I{i + 1}'] = j\n", "\n", "for i, j in enumerate(df_int.min()):\n", " min_dict[f'I{i + 1}'] = j" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 截取df的第2列到第40列。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "features = df.iloc[:, 1:40]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "3. 对df的“label”列应用自定义函数,将“label”列中的数值转换为numpy数组,将数据添加到df的“label”列。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def get_label(x):\n", " return np.array([x])\n", "df['label'] = df['label'].apply(get_label)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "4. 对features应用自定义函数,将稠密数据进行归一化处理,其他数据填充为1,将数据添加到df的“weight”列。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def get_weight(x):\n", " ret = []\n", " for index, val in enumerate(x):\n", " if index < DENSE_NUM:\n", " col = f'I{index + 1}'\n", " ret.append((val - min_dict[col]) / (max_dict[col] - min_dict[col]))\n", " else:\n", " ret.append(1)\n", " return ret\n", "feat_weight = features.apply(get_weight, axis=1)\n", "df['weight'] = feat_weight" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "5. 对features应用自定义函数,获取稠密数据的索引,其他数据使用其哈希值进行填充,将数据添加到df的“id”列。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def get_id(x):\n", " ret = []\n", " for index, val in enumerate(x):\n", " if index < DENSE_NUM:\n", " ret.append(index + 1)\n", " else:\n", " ret.append(hash(val))\n", " return ret\n", "feat_id = features.apply(get_id, axis=1)\n", "df['id'] = feat_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据集划分\n", "\n", "df新增\"is_training\"列,将数据的前70%设为训练数据,其他数据标记为非训练数据。结果如下所示:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idweightlabelis_training
0[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31...[0.5799200799200799, 0.705335731414868, 0.8280...[0]1
1[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...[0.6971028971028971, 0.8390287769784173, 0.435...[0]1
2[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71...[0.11528471528471529, 0.9327537969624301, 0.94...[1]1
3[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38...[0.6217782217782217, 0.3188449240607514, 0.923...[1]1
4[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3...[0.3226773226773227, 0.3240407673860911, 0.225...[1]1
...............
9995[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...[0.09270729270729271, 0.3959832134292566, 0.03...[0]0
9996[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12...[0.5147852147852148, 0.48810951239008793, 0.46...[1]0
9997[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2...[0.4792207792207792, 0.4045763389288569, 0.514...[1]0
9998[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6...[0.550949050949051, 0.1035171862509992, 0.2167...[0]0
9999[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4...[0.9004995004995004, 0.9000799360511591, 0.826...[0]0
\n", "

10000 rows × 4 columns

\n", "
" ], "text/plain": [ " id \\\n", "0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... \n", "1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... \n", "2 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... \n", "3 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... \n", "4 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... \n", "... ... \n", "9995 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... \n", "9996 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... \n", "9997 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... \n", "9998 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... \n", "9999 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... \n", "\n", " weight label is_training \n", "0 [0.5799200799200799, 0.705335731414868, 0.8280... [0] 1 \n", "1 [0.6971028971028971, 0.8390287769784173, 0.435... [0] 1 \n", "2 [0.11528471528471529, 0.9327537969624301, 0.94... [1] 1 \n", "3 [0.6217782217782217, 0.3188449240607514, 0.923... [1] 1 \n", "4 [0.3226773226773227, 0.3240407673860911, 0.225... [1] 1 \n", "... ... ... ... \n", "9995 [0.09270729270729271, 0.3959832134292566, 0.03... [0] 0 \n", "9996 [0.5147852147852148, 0.48810951239008793, 0.46... [1] 0 \n", "9997 [0.4792207792207792, 0.4045763389288569, 0.514... [1] 0 \n", "9998 [0.550949050949051, 0.1035171862509992, 0.2167... [0] 0 \n", "9999 [0.9004995004995004, 0.9000799360511591, 0.826... [0] 0 \n", "\n", "[10000 rows x 4 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m_train_len = int(len(df) * 0.7)\n", "df['is_training'] = [1] * m_train_len + [0] * (len(df) - m_train_len)\n", "df = df[['id', 'weight', 'label', 'is_training']]\n", "df.to_pandas()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "至此,数据生成、数据预处理以及特征工程等操作已完成,处理好的数据即可传入模型进行训练。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }