{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# 快速入门:MindSpore Pandas数据处理\n", "\n", "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.0/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r2.0/docs/mindpandas/docs/source_zh_cn/mindpandas_quick_start.ipynb)\n", "\n", "数据预处理对于模型训练非常重要,好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例,介绍使用MindSpore Pandas处理数据的流程。\n", "\n", "## MindSpore Pandas执行模式设置\n", "\n", "MindSpore Pandas支持多线程与多进程模式,本示例使用多线程模式,更多详见[MindSpore Pandas执行模式介绍及配置说明](https://www.mindspore.cn/mindpandas/docs/zh-CN/r0.2/mindpandas_configuration.html),并设置切片维度为16*3,示例如下:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import mindpandas as pd\n", "import random\n", "\n", "pd.set_concurrency_mode(\"multithread\")\n", "pd.set_partition_shape((16, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据生成\n", "\n", "生成10000行、40列的二维数据,包含标签、稠密特征以及稀疏特征等信息。标签是值为“0”或“1“的随机数、稠密特征是取值范围为(-10, 10000)的随机数、稀疏特征为随机字符串。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "DENSE_NUM = 13\n", "SPARSE_NUM = 26\n", "ROW_NUM = 10000\n", "cat_val, int_val, lab_val = [], [], []\n", "\n", "def gen_cat_feature(length):\n", " result = hex(random.randint(0, 16 ** length)).replace('0x', '').upper()\n", " if len(result) < length:\n", " result = '0' * (length - len(result)) + result\n", " return str(result)\n", "\n", "def gen_int_feature():\n", " return random.randint(-10, 10000)\n", "\n", "def gen_lab_feature():\n", " x = random.randint(0, 1)\n", " return round(x)\n", "\n", "for i in range(ROW_NUM * SPARSE_NUM):\n", " cat_val.append(gen_cat_feature(8))\n", "np_cat = np.array(cat_val).reshape(ROW_NUM, SPARSE_NUM)\n", "df_cat = pd.DataFrame(np_cat, columns=[f'C{i + 1}' for i in range(SPARSE_NUM)])\n", "\n", "for i in range(ROW_NUM * DENSE_NUM):\n", " int_val.append(gen_int_feature())\n", "np_int = np.array(int_val).reshape(ROW_NUM, DENSE_NUM)\n", "df_int = pd.DataFrame(np_int, columns=[f'I{i + 1}' for i in range(DENSE_NUM)])\n", "\n", "for i in range(ROW_NUM):\n", " lab_val.append(gen_lab_feature())\n", "np_lab = np.array(lab_val).reshape(ROW_NUM, 1)\n", "df_lab = pd.DataFrame(np_lab, columns=['label'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据预处理\n", "\n", "将标签、稠密特征、稀疏特征等拼接为待处理的数据集,结果如下所示:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | label | \n", "I1 | \n", "I2 | \n", "I3 | \n", "I4 | \n", "I5 | \n", "I6 | \n", "I7 | \n", "I8 | \n", "I9 | \n", "... | \n", "C17 | \n", "C18 | \n", "C19 | \n", "C20 | \n", "C21 | \n", "C22 | \n", "C23 | \n", "C24 | \n", "C25 | \n", "C26 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "5795 | \n", "7051 | \n", "8277 | \n", "785 | \n", "9305 | \n", "7521 | \n", "5206 | \n", "6240 | \n", "172 | \n", "... | \n", "A5AE1E6D | \n", "25A100C3 | \n", "C6B8E0A4 | \n", "A94F6B56 | \n", "B27D726B | \n", "EB9F3C73 | \n", "D98D17B2 | \n", "793AB315 | \n", "8C12657F | \n", "AFCEEBFF | \n", "
1 | \n", "0 | \n", "6968 | \n", "8389 | \n", "4352 | \n", "3312 | \n", "4021 | \n", "5087 | \n", "2254 | \n", "4249 | \n", "4411 | \n", "... | \n", "EEAC1040 | \n", "BDC711B9 | \n", "16269D1B | \n", "D59EA7BB | \n", "460218D4 | \n", "F89E137C | \n", "F488ED52 | \n", "C1DDB598 | \n", "AE9C21C9 | \n", "11D47A2A | \n", "
2 | \n", "1 | \n", "1144 | \n", "9327 | \n", "9399 | \n", "7745 | \n", "8144 | \n", "7189 | \n", "1663 | \n", "1005 | \n", "6421 | \n", "... | \n", "54EE530F | \n", "68D2F7EF | \n", "EFD65C79 | \n", "B2F2CCF5 | \n", "86E02110 | \n", "31617C19 | \n", "44A2DFA4 | \n", "032C30D1 | \n", "C8098BAD | \n", "CE4DD8BB | \n", "
3 | \n", "1 | \n", "6214 | \n", "3183 | \n", "9229 | \n", "938 | \n", "9160 | \n", "2783 | \n", "2680 | \n", "4775 | \n", "4436 | \n", "... | \n", "639D80AA | \n", "3A14B884 | \n", "9FC92B4F | \n", "67DB3280 | \n", "1EE1FC45 | \n", "CE19F4C1 | \n", "F34CC6FD | \n", "C3C9F66C | \n", "CA1B3F85 | \n", "F184D01E | \n", "
4 | \n", "1 | \n", "3220 | \n", "3235 | \n", "2243 | \n", "50 | \n", "5074 | \n", "6328 | \n", "6894 | \n", "6838 | \n", "3063 | \n", "... | \n", "7671D909 | \n", "126B3F69 | \n", "1262514D | \n", "25C18137 | \n", "2BA958DE | \n", "D6CE7BE3 | \n", "18D4EEE1 | \n", "315D0FFB | \n", "7C25DB1D | \n", "6E4ABFB1 | \n", "
5 rows × 40 columns
\n", "\n", " | id | \n", "weight | \n", "label | \n", "is_training | \n", "
---|---|---|---|---|
0 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 31... | \n", "[0.5799200799200799, 0.705335731414868, 0.8280... | \n", "[0] | \n", "1 | \n", "
1 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | \n", "[0.6971028971028971, 0.8390287769784173, 0.435... | \n", "[0] | \n", "1 | \n", "
2 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 71... | \n", "[0.11528471528471529, 0.9327537969624301, 0.94... | \n", "[1] | \n", "1 | \n", "
3 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 38... | \n", "[0.6217782217782217, 0.3188449240607514, 0.923... | \n", "[1] | \n", "1 | \n", "
4 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -3... | \n", "[0.3226773226773227, 0.3240407673860911, 0.225... | \n", "[1] | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
9995 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | \n", "[0.09270729270729271, 0.3959832134292566, 0.03... | \n", "[0] | \n", "0 | \n", "
9996 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12... | \n", "[0.5147852147852148, 0.48810951239008793, 0.46... | \n", "[1] | \n", "0 | \n", "
9997 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -2... | \n", "[0.4792207792207792, 0.4045763389288569, 0.514... | \n", "[1] | \n", "0 | \n", "
9998 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -6... | \n", "[0.550949050949051, 0.1035171862509992, 0.2167... | \n", "[0] | \n", "0 | \n", "
9999 | \n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, -4... | \n", "[0.9004995004995004, 0.9000799360511591, 0.826... | \n", "[0] | \n", "0 | \n", "
10000 rows × 4 columns
\n", "