pyspark日期处理及比较
作者:高景洋 日期:2021-01-28 11:38:26 浏览次数:1953
业务需求:
1、通过pyspark将hbase中的数据拉出
2、通过pyspark按UpdatedDate 、EnteredDate、DeletedDate ,根据WebsiteID字段汇总数量
小白方法:
1、spark拉出hbase数据
2、rdd1 = hbase_result_rdd.map('对日期字段进行处理')
df1 = rdd_to_df(rdd1) # 将rdd转换为dataframe
df2 = df1.filter(df1['UpdatedDate']>datetime.datetime.today().date) # 理想中的样子
现实中的样子:各种日期类型转换问题报错,如:数据中的字段值为None \ 2021-01-22 18:47:48 \ 2021-01-22T18:47:48
大白方法:
1、因此,我们转换思路,查一下,spark有没有内置的日期处理函数。因为之前已经用过内置的随机函数rand(),内置的md5函数md5(),内置的添加列函数lit()
2、果然,pyspark存在内置的日期处理函数:
to_timestamp :将列转换为日期
current_date : 获取当前日期
3、最后,come on ~ ,来看下最终代码~
from pyspark.sql.functions import rand,md5,lit,to_timestamp,current_date
print('更新时间')
df_update_date = list_filter_websiteids.filter(list_filter_websiteids['UpdatedDate']!='--').select('WebsiteID',to_timestamp(list_filter_websiteids['UpdatedDate']).alias('UpdatedDate'))
df_update_date.show()
df_update_date_1 = df_update_date.filter(df_update_date['UpdatedDate']>current_date()).groupby('WebsiteID').count()
df_update_date_1.show()
执行结果:
28-01-2021 11:05:54 CST hbase_scan INFO - 更新时间 28-01-2021 11:05:54 CST hbase_scan INFO - +---------+-------------------+ 28-01-2021 11:05:54 CST hbase_scan INFO - |WebsiteID| UpdatedDate| 28-01-2021 11:05:54 CST hbase_scan INFO - +---------+-------------------+ 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-28 08:26:46| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-28 08:59:03| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-28 08:45:54| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - | 1|2021-01-22 18:47:48| 28-01-2021 11:05:54 CST hbase_scan INFO - +---------+-------------------+ 28-01-2021 11:05:54 CST hbase_scan INFO - only showing top 20 rows 28-01-2021 11:05:54 CST hbase_scan INFO - 28-01-2021 11:05:55 CST hbase_scan INFO - 28-01-2021 11:05:58 CST hbase_scan INFO - [Stage 24:=============================> (1 + 1) / 2] 28-01-2021 11:05:58 CST hbase_scan INFO - 28-01-2021 11:05:59 CST hbase_scan INFO - +---------+-----+ 28-01-2021 11:05:59 CST hbase_scan INFO - |WebsiteID|count| 28-01-2021 11:05:59 CST hbase_scan INFO - +---------+-----+ 28-01-2021 11:05:59 CST hbase_scan INFO - | 108| 1| 28-01-2021 11:05:59 CST hbase_scan INFO - | 521| 55| 28-01-2021 11:05:59 CST hbase_scan INFO - | 71| 762| 28-01-2021 11:05:59 CST hbase_scan INFO - | 23| 29| 28-01-2021 11:05:59 CST hbase_scan INFO - | 70| 16| 28-01-2021 11:05:59 CST hbase_scan INFO - | 167| 17| 28-01-2021 11:05:59 CST hbase_scan INFO - | 1| 1220| 28-01-2021 11:05:59 CST hbase_scan INFO - | 235| 138| 28-01-2021 11:05:59 CST hbase_scan INFO - | 292| 28| 28-01-2021 11:05:59 CST hbase_scan INFO - | 171| 3| 28-01-2021 11:05:59 CST hbase_scan INFO - | 58| 263| 28-01-2021 11:05:59 CST hbase_scan INFO - | 243| 1| 28-01-2021 11:05:59 CST hbase_scan INFO - | 168| 16| 28-01-2021 11:05:59 CST hbase_scan INFO - | 24| 549| 28-01-2021 11:05:59 CST hbase_scan INFO - | 83| 540| 28-01-2021 11:05:59 CST hbase_scan INFO - | 492| 5| 28-01-2021 11:05:59 CST hbase_scan INFO - | 251| 1| 28-01-2021 11:05:59 CST hbase_scan INFO - | 279| 4| 28-01-2021 11:05:59 CST hbase_scan INFO - | 91| 565| 28-01-2021 11:05:59 CST hbase_scan INFO - | 94| 244| 28-01-2021 11:05:59 CST hbase_scan INFO - +---------+-----+ 28-01-2021 11:05:59 CST hbase_scan INFO - only showing top 20 rows
本文永久性链接:
<a href="http://r4.com.cn/art172.aspx">pyspark日期处理及比较</a>
<a href="http://r4.com.cn/art172.aspx">pyspark日期处理及比较</a>
当前header:Host: r4.com.cn
X-Host1: r4.com.cn
X-Host2: r4.com.cn
X-Host3: 127.0.0.1:8080
X-Forwarded-For: 18.118.252.9
X-Real-Ip: 18.118.252.9
X-Domain: r4.com.cn
X-Request: GET /art172.aspx HTTP/1.1
X-Request-Uri: /art172.aspx
Connection: close
Accept: */*
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Accept-Encoding: gzip, br, zstd, deflate