当前位置:首页 - Spark

pyspark日期处理及比较

作者:高景洋 日期:2021-01-28 11:38:26 浏览次数:1953

业务需求

1、通过pyspark将hbase中的数据拉出

2、通过pyspark按UpdatedDate 、EnteredDate、DeletedDate ,根据WebsiteID字段汇总数量


小白方法:

1、spark拉出hbase数据

2、rdd1 = hbase_result_rdd.map('对日期字段进行处理')

     df1 = rdd_to_df(rdd1) # 将rdd转换为dataframe

     df2 = df1.filter(df1['UpdatedDate']>datetime.datetime.today().date) # 理想中的样子

     现实中的样子:各种日期类型转换问题报错,如:数据中的字段值为None \ 2021-01-22 18:47:48 \ 2021-01-22T18:47:48

大白方法:

1、因此,我们转换思路,查一下,spark有没有内置的日期处理函数。因为之前已经用过内置的随机函数rand(),内置的md5函数md5(),内置的添加列函数lit()

2、果然,pyspark存在内置的日期处理函数:

     to_timestamp :将列转换为日期

     current_date : 获取当前日期

3、最后,come on ~ ,来看下最终代码~

     


from pyspark.sql.functions import rand,md5,lit,to_timestamp,current_date


print('更新时间')
df_update_date = list_filter_websiteids.filter(list_filter_websiteids['UpdatedDate']!='--').select('WebsiteID',to_timestamp(list_filter_websiteids['UpdatedDate']).alias('UpdatedDate'))
df_update_date.show()
df_update_date_1 = df_update_date.filter(df_update_date['UpdatedDate']>current_date()).groupby('WebsiteID').count()
df_update_date_1.show()


执行结果:

28-01-2021 11:05:54 CST hbase_scan INFO - 更新时间
28-01-2021 11:05:54 CST hbase_scan INFO - +---------+-------------------+
28-01-2021 11:05:54 CST hbase_scan INFO - |WebsiteID|        UpdatedDate|
28-01-2021 11:05:54 CST hbase_scan INFO - +---------+-------------------+
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-28 08:26:46|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-28 08:59:03|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-28 08:45:54|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - |        1|2021-01-22 18:47:48|
28-01-2021 11:05:54 CST hbase_scan INFO - +---------+-------------------+
28-01-2021 11:05:54 CST hbase_scan INFO - only showing top 20 rows
28-01-2021 11:05:54 CST hbase_scan INFO - 
28-01-2021 11:05:55 CST hbase_scan INFO - 
28-01-2021 11:05:58 CST hbase_scan INFO - [Stage 24:=============================>                            (1 + 1) / 2]
28-01-2021 11:05:58 CST hbase_scan INFO -                                                                                 
28-01-2021 11:05:59 CST hbase_scan INFO - +---------+-----+
28-01-2021 11:05:59 CST hbase_scan INFO - |WebsiteID|count|
28-01-2021 11:05:59 CST hbase_scan INFO - +---------+-----+
28-01-2021 11:05:59 CST hbase_scan INFO - |      108|    1|
28-01-2021 11:05:59 CST hbase_scan INFO - |      521|   55|
28-01-2021 11:05:59 CST hbase_scan INFO - |       71|  762|
28-01-2021 11:05:59 CST hbase_scan INFO - |       23|   29|
28-01-2021 11:05:59 CST hbase_scan INFO - |       70|   16|
28-01-2021 11:05:59 CST hbase_scan INFO - |      167|   17|
28-01-2021 11:05:59 CST hbase_scan INFO - |        1| 1220|
28-01-2021 11:05:59 CST hbase_scan INFO - |      235|  138|
28-01-2021 11:05:59 CST hbase_scan INFO - |      292|   28|
28-01-2021 11:05:59 CST hbase_scan INFO - |      171|    3|
28-01-2021 11:05:59 CST hbase_scan INFO - |       58|  263|
28-01-2021 11:05:59 CST hbase_scan INFO - |      243|    1|
28-01-2021 11:05:59 CST hbase_scan INFO - |      168|   16|
28-01-2021 11:05:59 CST hbase_scan INFO - |       24|  549|
28-01-2021 11:05:59 CST hbase_scan INFO - |       83|  540|
28-01-2021 11:05:59 CST hbase_scan INFO - |      492|    5|
28-01-2021 11:05:59 CST hbase_scan INFO - |      251|    1|
28-01-2021 11:05:59 CST hbase_scan INFO - |      279|    4|
28-01-2021 11:05:59 CST hbase_scan INFO - |       91|  565|
28-01-2021 11:05:59 CST hbase_scan INFO - |       94|  244|
28-01-2021 11:05:59 CST hbase_scan INFO - +---------+-----+
28-01-2021 11:05:59 CST hbase_scan INFO - only showing top 20 rows


本文永久性链接:
<a href="http://r4.com.cn/art172.aspx">pyspark日期处理及比较</a>
当前header:Host: r4.com.cn X-Host1: r4.com.cn X-Host2: r4.com.cn X-Host3: 127.0.0.1:8080 X-Forwarded-For: 18.118.252.9 X-Real-Ip: 18.118.252.9 X-Domain: r4.com.cn X-Request: GET /art172.aspx HTTP/1.1 X-Request-Uri: /art172.aspx Connection: close Accept: */* User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) Accept-Encoding: gzip, br, zstd, deflate