新闻推荐数据集_数据分析

博客小编 (34) 2024-09-04 17:01:01

新闻推荐：task-02 数据分析

数据分为训练集用户日志和测试机用户日志，新闻信息，文章词向量。
数据分析的价值：熟悉整个数据集的基本情况，即每个文件中有哪些数据，具体的文件中每个字段所表示的实际含义，数据集特征之间的相关性。
针对于新闻推荐来说，主要需要分析的有用户自身的一个状态，用户与文章的关系，文章与文章之间的相关性，文章本身的基本属性，分析这些属性有助于后面召回策略的选择及特征工程的具体方向。

导入函数库

%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns plt.rc('font', family = 'SimHei', size = 13) #设置画板显示格式 import os, re, os, sys, warnings warnings.filterwarnings('ignore') #隐藏程序运行警告

读取数据集

path = r'D:/新闻推荐/' trn_click = pd.read_csv(path + 'train_click_log.csv') item_df = pd.read_csv(path + 'articles.csv') item_df = item_df.rename(columns = { 
   'article_id' : 'click_article_id'}) item_emb_df = pd.read_csv(path + 'articles_emb.csv') tst_click = pd.read_csv(path + 'testA_click_log.csv')

检视已导入数据，寻找特征之间显然的联系

trn_click.head(3)

	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type
0		90	4	1	17	1	13	1
1	5408	78	4	1	17	1	13	1
2	50823	78	4	1	17	1	13	1

item_df.head(3)

	click_article_id	category_id	words_count
0	0	0	168
1	1	1	189
2	2	1	250

item_emb_df.head(3)

	article_id	emb_0	emb_1	emb_2	emb_3	emb_4	emb_5	emb_6	emb_7	emb_8	...	emb_240	emb_241	emb_242	emb_243	emb_244	emb_245	emb_246	emb_247	emb_248	emb_249
0	0	-0.	-0.	-0.	0.050855	0.	0.	-0.	-0.	-0.	...	0.	0.	0.	0.	0.	-0.	0.	-0.	0.	0.
1	1	-0.	-0.	0.	0.	0.	0.	-0.	-0.	-0.	...	-0.	0.	0.	-0.	0.	0.	-0.	0.	0.	-0.
2	2	-0.	-0.	-0.	-0.	0.044748	-0.	-0.	-0.066126	-0.	...	0.	0.	0.	-0.	-0.	0.	-0.	-0.	0.	-0.

3 rows × 251 columns

将用户日志中时间戳特征转化为更易于理解分析的排序特征

注意点：

使用groupby函数以user_id作为主键建立透视表。
对透视表中click_timestamp特征使用rank()函数，使用时注意因为时间戳值越大表示发生时间越晚，因此使用排序时需要使用降序排名以保证rank()特征不发生歧义。
使用rank()函数后需要注意将其数据类型强制转换成整型，避免因排名出现浮点数导致歧义
transform()函数使用时传入参数为待使用的函数。
merge()函数参数‘data’为被合并表，参数‘on’为两表拼接方式，参数‘how’为两表拼接时所依靠的主要键。
describe()函数为显示数据本身所包含的基本信息。
info()函数为显示DataFrame数据存储的本身属性信息。
nunique()函数为获取数据所具有的不同种类的总共个数。
unique()函数为获取数据所具有的全部总类。
count()函数为获取数据中非空元素的总共个数。
value_counts()函数为获取数据所有种类及对应种类对应统计个数。

trn_click['rank'] = trn_click.groupby('user_id')['click_timestamp'].rank(ascending = False).astype(int) tst_click['rank'] = tst_click.groupby('user_id')['click_timestamp'].rank(ascending = False).astype(int)

trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')

trn_click.head(2)

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts
0			90	4	1	17	1	13	1	11	11
1		5408	78	4	1	17	1	13	1	10	11

trn_click = trn_click.merge(item_df, how = 'left', on = ['click_article_id']) trn_click.head(2)

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
0			90	4	1	17	1	13	1	11	11	281	00	173
1		5408	78	4	1	17	1	13	1	10	11	4	00	118

trn_click.sort_values(by = 'user_id')

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
	0		20	4	1	17	1	25	2	1	2	281	00	370
	0	30760	20	4	1	17	1	25	2	2	2	26	00	162
	1	63746	89	4	1	17	1	25	6	1	2	133	00	162
	1		89	4	1	17	1	25	6	2	2	418	00	176
	2		95	4	3	20	1	25	2	1	2	297	00	215
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
			88	4	1	17	1	13	1	1	11	352	00	202
			86	4	1	17	1	13	1	6	11	281	00	285
		42223	86	4	1	17	1	13	1	5	11	67	00	186
			64	4	1	17	1	13	1	8	11	250	00	240
0			90	4	1	17	1	13	1	11	11	281	00	173

rows × 14 columns

trn_click.describe()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
count	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06	1.e+06
mean	1.e+05	1.e+05	1.e+12	3.e+00	1.e+00	1.e+01	1.e+00	1.e+01	1.e+00	7.e+00	1.e+01	3.056176e+02	1.e+12	2.011981e+02
std	5.e+04	9.e+04	3.e+08	3.e-01	1.035170e+00	6.e+00	1.e+00	7.e+00	1.e+00	1.016095e+01	1.e+01	1.e+02	8.e+09	5.e+01
min	0.000000e+00	3.000000e+00	1.e+12	1.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.e+12	0.000000e+00
25%	7.e+04	1.e+05	1.e+12	4.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.e+01	1.000000e+00	2.000000e+00	4.000000e+00	2.e+02	1.e+12	1.e+02
50%	1.e+05	2.038900e+05	1.e+12	4.000000e+00	1.000000e+00	1.e+01	1.000000e+00	2.e+01	2.000000e+00	4.000000e+00	8.000000e+00	3.e+02	1.e+12	1.e+02
75%	1.e+05	2.e+05	1.e+12	4.000000e+00	3.000000e+00	1.e+01	1.000000e+00	2.e+01	2.000000e+00	8.000000e+00	1.e+01	4.e+02	1.e+12	2.e+02
max	1.e+05	3.e+05	1.e+12	4.000000e+00	5.000000e+00	2.000000e+01	1.e+01	2.e+01	7.000000e+00	2.e+02	2.e+02	4.e+02	1.e+12	6.e+03

trn_click.info()

<class 'pandas.core.frame.DataFrame'> Int64Index:  entries, 0 to  Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id  non-null int64 1 click_article_id  non-null int64 2 click_timestamp  non-null int64 3 click_environment  non-null int64 4 click_deviceGroup  non-null int64 5 click_os  non-null int64 6 click_country  non-null int64 7 click_region  non-null int64 8 click_referrer_type  non-null int64 9 rank  non-null int32 10 click_cnts  non-null int64 11 category_id  non-null int64 12 created_at_ts  non-null int64 13 words_count  non-null int64 dtypes: int32(1), int64(13) memory usage: 123.1 MB

trn_click['user_id'].nunique()

trn_click.groupby(['user_id'])['click_article_id'].count().min(<

THE END

HDLBits(八)学习笔记——Counters(计数器)

京东应急物资供应链管理平台_京东智慧供应链

vivadoltx文件_tcl脚本语言

什么是覆盖方法_表格怎么覆盖相同内容

发表回复

请先登录账户再评论哦

新闻推荐数据集_数据分析

新闻推荐：task-02 数据分析

导入函数库

读取数据集

检视已导入数据，寻找特征之间显然的联系

将用户日志中时间戳特征转化为更易于理解分析的排序特征

HDLBits(八)学习笔记——Counters(计数器)

京东应急物资供应链管理平台_京东智慧供应链

vivadoltx文件_tcl脚本语言

什么是覆盖方法_表格怎么覆盖相同内容

推荐文章

Oracle的学习心得和知识总结（六）|Oracle数据库同义词技术详解

发表回复

热门文章

推荐文章

新闻推荐数据集_数据分析

新闻推荐：task-02 数据分析

导入函数库

读取数据集

检视已导入数据，寻找特征之间显然的联系

将用户日志中时间戳特征转化为更易于理解分析的排序特征

HDLBits(八)学习笔记——Counters(计数器)

京东应急物资供应链管理平台_京东智慧供应链

vivadoltx文件_tcl脚本语言

什么是覆盖方法_表格怎么覆盖相同内容

推 荐 文 章

Oracle的学习心得和知识总结（六）|Oracle数据库同义词技术详解

发表回复

热门文章

推荐文章

推荐文章