-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path2018041801.html
More file actions
1 lines (1 loc) · 40.7 KB
/
2018041801.html
File metadata and controls
1 lines (1 loc) · 40.7 KB
1
<!DOCTYPE html><html class="theme-next mist use-motion" lang="zh-Hans"><head><meta name="generator" content="Hexo 3.9.0"><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"><meta name="theme-color" content="#222"><script src="/lib/pace/pace.min.js?v=1.0.2"></script><link href="/lib/pace/pace-theme-minimal.min.css?v=1.0.2" rel="stylesheet"><meta http-equiv="Cache-Control" content="no-transform"><meta http-equiv="Cache-Control" content="no-siteapp"><link href="/lib/fancybox/source/jquery.fancybox.css?v=2.1.5" rel="stylesheet" type="text/css"><link href="/lib/font-awesome/css/font-awesome.min.css?v=4.6.2" rel="stylesheet" type="text/css"><link href="/css/main.css?v=5.1.3" rel="stylesheet" type="text/css"><link rel="apple-touch-icon" sizes="180x180" href="/images/apple-touch-icon-240x240-playpi.png?v=5.1.3"><link rel="icon" type="image/png" sizes="32x32" href="/images/favicon-32x32-playpi.png?v=5.1.3"><link rel="icon" type="image/png" sizes="16x16" href="/images/favicon-16x16-playpi.png?v=5.1.3"><link rel="mask-icon" href="/images/logo-playpi.svg?v=5.1.3" color="#222"><meta name="keywords" content="Elasticsearch,Hadoop,Date"><link rel="alternate" href="/atom.xml" title="虾丸派" type="application/atom+xml"><meta name="description" content="最近在项目中遇到一个由 Elasticsearch 版本差异引起的奇怪现象,导致程序异常,一开始还以为是程序的问题,后来排查发现是由 Elasticsearch 的 Date 类型字段引起的,本文记录解决过程。开发环境基于 Elasticsearch v1.7.5、Elasticsearch v2.4.5。"><meta name="keywords" content="Elasticsearch,Hadoop,Date"><meta property="og:type" content="article"><meta property="og:title" content="es-hadoop 遇上 Elasticsearch 的 Date 类型字段"><meta property="og:url" content="https://www.playpi.org/2018041801.html"><meta property="og:site_name" content="虾丸派"><meta property="og:description" content="最近在项目中遇到一个由 Elasticsearch 版本差异引起的奇怪现象,导致程序异常,一开始还以为是程序的问题,后来排查发现是由 Elasticsearch 的 Date 类型字段引起的,本文记录解决过程。开发环境基于 Elasticsearch v1.7.5、Elasticsearch v2.4.5。"><meta property="og:locale" content="zh-Hans"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215180154.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215180503.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215190242.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215191343.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215184320.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215191943.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215192158.png"><meta property="og:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215192346.png"><meta property="og:updated_time" content="2020-01-16T12:14:18.000Z"><meta name="twitter:card" content="summary"><meta name="twitter:title" content="es-hadoop 遇上 Elasticsearch 的 Date 类型字段"><meta name="twitter:description" content="最近在项目中遇到一个由 Elasticsearch 版本差异引起的奇怪现象,导致程序异常,一开始还以为是程序的问题,后来排查发现是由 Elasticsearch 的 Date 类型字段引起的,本文记录解决过程。开发环境基于 Elasticsearch v1.7.5、Elasticsearch v2.4.5。"><meta name="twitter:image" content="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215180154.png"><script type="text/javascript" id="hexo.configurations">var NexT=window.NexT||{},CONFIG={root:"/",scheme:"Mist",version:"5.1.3",sidebar:{position:"left",display:"hide",offset:12,b2t:!1,scrollpercent:!0,onmobile:!1},fancybox:!0,tabs:!0,motion:{enable:!0,async:!1,transition:{post_block:"fadeIn",post_header:"slideDownIn",post_body:"slideDownIn",coll_header:"slideLeftIn",sidebar:"slideUpIn"}},duoshuo:{userId:"0",author:"博主"},algolia:{applicationID:"",apiKey:"",indexName:"",hits:{per_page:10},labels:{input_placeholder:"Search for Posts",hits_empty:"We didn't find any results for the search: ${query}",hits_stats:"${hits} results found in ${time} ms"}}}</script><link rel="canonical" href="https://www.playpi.org/2018041801.html"><title>es-hadoop 遇上 Elasticsearch 的 Date 类型字段 | 虾丸派</title></head><body itemscope itemtype="http://schema.org/WebPage" lang="zh-Hans"><div class="container sidebar-position-left page-post-detail"><div class="headband"></div><header id="header" class="header" itemscope itemtype="http://schema.org/WPHeader"><div class="header-inner"><div class="site-brand-wrapper"><div class="site-meta"><div class="custom-logo-site-title"><a href="/" class="brand" rel="start"><span class="logo-line-before"><i></i></span> <span class="site-title">虾丸派</span> <span class="logo-line-after"><i></i></span></a></div><h1 class="site-subtitle" itemprop="description">烂笔头</h1></div><div class="site-nav-toggle"><button><span class="btn-bar"></span> <span class="btn-bar"></span> <span class="btn-bar"></span></button></div></div><nav class="site-nav"><ul id="menu" class="menu"><li class="menu-item menu-item-home"><a href="/" rel="section"><i class="menu-item-icon fa fa-fw fa-home"></i><br>首页</a></li><li class="menu-item menu-item-tags"><a href="/tags/" rel="section"><i class="menu-item-icon fa fa-fw fa-tags"></i><br>标签</a></li><li class="menu-item menu-item-categories"><a href="/categories/" rel="section"><i class="menu-item-icon fa fa-fw fa-th"></i><br>分类</a></li><li class="menu-item menu-item-archives"><a href="/archives/" rel="section"><i class="menu-item-icon fa fa-fw fa-archive"></i><br>归档</a></li><li class="menu-item menu-item-about"><a href="/about/" rel="section"><i class="menu-item-icon fa fa-fw fa-user"></i><br>关于</a></li><li class="menu-item menu-item-books"><a href="/books/" rel="section"><i class="menu-item-icon fa fa-fw fa-book"></i><br>书籍</a></li><li class="menu-item menu-item-guide"><a href="/guide/" rel="section"><i class="menu-item-icon fa fa-fw fa-location-arrow"></i><br>指南</a></li><li class="menu-item menu-item-search"><a href="javascript:;" class="popup-trigger"><i class="menu-item-icon fa fa-search fa-fw"></i><br>搜索</a></li></ul><div class="site-search"><div class="popup search-popup local-search-popup"><div class="local-search-header clearfix"><span class="search-icon"><i class="fa fa-search"></i> </span><span class="popup-btn-close"><i class="fa fa-times-circle"></i></span><div class="local-search-input-wrapper"><input autocomplete="off" placeholder="搜索..." spellcheck="false" type="text" id="local-search-input"></div></div><div id="local-search-result"></div></div></div></nav></div></header><main id="main" class="main"><div class="main-inner"><div class="content-wrap"><div id="content" class="content"><div id="posts" class="posts-expand"><article class="post post-type-normal" itemscope itemtype="http://schema.org/Article"><div class="post-block"><link itemprop="mainEntityOfPage" href="https://www.playpi.org/2018041801.html"><span hidden itemprop="author" itemscope itemtype="http://schema.org/Person"><meta itemprop="name" content="虾丸派"><meta itemprop="description" content="记录知识 | 分享技术"><meta itemprop="image" content="/images/favicon-1536x1536-playpi.png"></span><span hidden itemprop="publisher" itemscope itemtype="http://schema.org/Organization"><meta itemprop="name" content="虾丸派"></span><header class="post-header"><h2 class="post-title" itemprop="name headline">es-hadoop 遇上 Elasticsearch 的 Date 类型字段</h2><div class="post-meta"><span class="post-time"><span class="post-meta-item-text">发表于</span> <time title="创建于" itemprop="dateCreated datePublished" datetime="2018-04-18T20:14:18+08:00">2018-04-18 </time></span><span class="post-category"><span class="post-meta-divider">|</span> <span class="post-meta-item-text">分类于</span> <span itemprop="about" itemscope itemtype="http://schema.org/Thing"><a href="/categories/series-of-fixbug/" itemprop="url" rel="index"><span itemprop="name">踩坑系列</span> </a></span></span><span id="busuanzi_container_page_pv" style="display:none"><span class="post-meta-divider">|</span> 阅读次数 <span id="busuanzi_value_page_pv"></span></span><div class="post-wordcount"><span class="post-meta-item-text">字数统计</span> <span title="字数统计">1,850字 </span><span class="post-meta-divider">|</span> <span class="post-meta-item-text">阅读时长 ≈</span> <span title="阅读时长">8分钟</span></div></div></header><div class="post-body" itemprop="articleBody"><p>最近在项目中遇到一个由 <code>Elasticsearch</code> 版本差异引起的奇怪现象,导致程序异常,一开始还以为是程序的问题,后来排查发现是由 <code>Elasticsearch</code> 的 <code>Date</code> 类型字段引起的,本文记录解决过程。开发环境基于 <code>Elasticsearch v1.7.5</code>、<code>Elasticsearch v2.4.5</code>。</p><a id="more"></a><h1 id="问题出现"><a href="# 问题出现" class="headerlink" title="问题出现"></a>问题出现</h1><p>业务场景是利用 <code>es-hadoop</code> 官方工具包读取 <code>Elasticsearch</code> 数据,进行一连串 <code>ETL</code> 处理,最后再写入 <code>Elasticsearch</code> 中。某一次照常处理一批数据,发现异常:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">java.lang.IllegalArgumentException: 2017/07/23</span><br><span class="line">... 省略 </span><br><span class="line">org.elasticsearch.hadoop.util.DateUtils.parseDateJdk (DateUtils.java:62)</span><br><span class="line">org.elasticsearch.hadoop.serailization.builder.JdkValueReader.parseDate (JdkValueReader:351)</span><br><span class="line">... 省略 </span><br></pre></td></tr></table></figure><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215180154.png" alt="异常信息" title="异常信息"></p><p>这导致 <code>Spark</code> 进程没有起来,程序退出。通过上面的异常可以明确看到日期转换的错误,无法转换日期为 <code>2017/07/23</code> 的数据,下面还有好几个类似的异常,进一步推断是无法转换 <code>yyyy/MM/dd</code> 格式的日期。</p><p>查看 <code>Elasticsearch</code> 的索引 <code>mapping</code> 定义,可以看到有一个 <code>publish_date</code> 字段的类型为 <code>Date</code>,并且设置了自定义格式 <code>yyyy/MM/dd HH:mm:ss||yyyy/MM/dd</code>,可以合理对应出现这种现象的数据。</p><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215180503.png" alt="publish_date 字段" title="publish_date 字段"></p><p>看来是这个 <code>Date</code> 类型的字段导致了这个异常,我又翻看了以前的成功任务记录,发现它们处理的数据也有 <code>publish_date</code> 字段,但是字段类型是 <code>long</code>,存储的是秒级时间戳,所以不会有这个问题。</p><p>我又仔细检查了一下线上环境,才发现线上的 <code>Elasticsearch</code> 版本升级了【部分业务使用了新的 <code>Elasticsearch</code> 集群】,升级为 <code>v2.4.5</code>,而以前是 <code>v1.7.5</code>,目前处于两者共存的状态,估计以后会逐渐升级。</p><p>好,目前把业务场景排查清楚了,接下来准备解决问题。</p><h1 id="分析解决"><a href="# 分析解决" class="headerlink" title="分析解决"></a>分析解决</h1><p>先查看一下源码【基于 <code>elasticsearch-hadoop v2.1.0</code>】,看看转换逻辑,可以发现,源码中能解析的是国际标准格式的日期,例如:<code>2018-02-07T05:01:05+08:00</code>【<code>ISO date</code>】,里面带着时区,而现在我们这种 <code>2017/07/23</code> 字符串格式的格式化日期,不能被解析。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line">public static Calendar parseDateJdk (String value) {</span><br><span class="line"> //check for colon in the time offset</span><br><span class="line"> int timeZoneIndex = value.indexOf ("T");</span><br><span class="line"> if (timeZoneIndex > 0) {</span><br><span class="line"> int sign = value.indexOf ("+", timeZoneIndex);</span><br><span class="line"> if (sign < 0) {</span><br><span class="line"> sign = value.indexOf ("-", timeZoneIndex);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> // +4 means it's either hh:mm or hhmm</span><br><span class="line"> if (sign > 0) {</span><br><span class="line"> // +3 points to either : or m</span><br><span class="line"> int colonIndex = sign + 3;</span><br><span class="line"> // +hh - need to add :mm</span><br><span class="line"> if (colonIndex >= value.length ()) {</span><br><span class="line"> value = value + ":00";</span><br><span class="line"> }</span><br><span class="line"> else if (value.charAt (colonIndex) != ':') {</span><br><span class="line"> value = value.substring (0, colonIndex) + ":" + value.substring (colonIndex);</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> return DatatypeConverter.parseDateTime (value);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215190242.png" alt="源码查看" title="源码查看"></p><p>那怎么办呢,再通过查看文档,发现有一个 <code>elasticsearch-hadoop</code> 参数可以控制日期类型数据的解析与否,参数名称为:<code>es.mapping.date.rich</code>,默认为 <code>true</code>,表示自动转换 <code>Date</code> 类型的字段,如上面的源码,会尝试解析为 <code>Calendar</code> 格式。</p><p>但是遇到格式错误的日期取值就抛出异常了,此时可以把这个选项关掉,设置为 <code>false</code>,不自动转换,而是直接读取字符串的格式,对字段的校验处理由我们业务的 <code>ETL</code> 进行,遇到的不合法的格式直接丢弃并记录就行,不影响整个程序的运行。</p><p>下图可以看到源码的解析流程,受到 <code>richDate</code> 参数的控制。</p><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215191343.png" alt="源码查看" title="源码查看"></p><p>注意,这个参数需要 <code>elasticsearch-hadoop</code> 的支持,例如 <code>v1.x</code> 就不行,必须使用 <code>v2.x</code> 或者以上版本。</p><p>同时,如果不想更改配置,还有另外一种解决方案,使用 <code>es.read.field.include</code> 参数指定必要的某些字段【不包含 <code>publish_date</code> 字段】,这样读取数据时就不会把 <code>publish_date</code> 字段读取出来了,也就不会涉及格式转换问题。但是此时需要确保处理完成后的数据不会再写回原来的索引,否则会导致数据被覆盖,<code>publish_date</code> 字段就会丢失,如果非要写回原来的索引,写入方式使用 <code>update</code> 而不是 <code>index</code>。</p><h1 id="扩展"><a href="# 扩展" class="headerlink" title="扩展"></a>扩展</h1><p>那如果 <code>Elasticsearch</code> 里面存储的是毫秒时间戳格式的日期,<code>elasticsearch-hadoop</code> 在读取时又是如何处理的呢?下面来验证一下。</p><p>首先,在测试的索引里面写入一些测试数据,有一个字段是毫秒时间戳格式:<code>publish_timestamp</code>,从 <code>Elasticsearch</code> 中挑选 1 条数据如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line">{</span><br><span class="line"> "took": 6,</span><br><span class="line"> "timed_out": false,</span><br><span class="line"> "_shards": {</span><br><span class="line"> "total": 80,</span><br><span class="line"> "successful": 80,</span><br><span class="line"> "skipped": 0,</span><br><span class="line"> "failed": 0</span><br><span class="line"> },</span><br><span class="line"> "hits": {</span><br><span class="line"> "total": 1,</span><br><span class="line"> "max_score": 11.363798,</span><br><span class="line"> "hits": [</span><br><span class="line"> {</span><br><span class="line"> "_index": "ds-banyan-newsforum-post-year-2019-v3",</span><br><span class="line"> "_type": "post",</span><br><span class="line"> "_id": "ae75c92981148654195408f9f5260930",</span><br><span class="line"> "_score": 11.363798,</span><br><span class="line"> "_source": {</span><br><span class="line"> "id": "ae75c92981148654195408f9f5260930",</span><br><span class="line"> "url": "https://fxhh.jd.com/detail.html?id=226608850",</span><br><span class="line"> "publish_timestamp": 1575129850000</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> ]</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215184320.png" alt="查看测试数据" title="查看测试数据"></p><p>配置 <code>pom.xml</code> 文件,引入 <code>v2.4.5</code> 的 <code>elasticsearch-hadoop</code> 依赖。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><!-- 2.4.5 版本获取 Node 过程兼容了 2.1.0 版本,但是读取 ES 数据中文字段会丢失 --></span><br><span class="line"><elasticsearch-hadoop.version>2.4.5</elasticsearch-hadoop.version></span><br><span class="line"></span><br><span class="line"> <!-- es-spark, 要指定 es-hadoop 新版本 --></span><br><span class="line"> <!-- 以下 2 个依赖包都需要 --></span><br><span class="line"> <dependency></span><br><span class="line"> <groupId>org.elasticsearch</groupId></span><br><span class="line"> <artifactId>elasticsearch-hadoop</artifactId></span><br><span class="line"> <version>${elasticsearch-hadoop.version}</version></span><br><span class="line"> <!-- 必须移除,与 spark-core_2.10 里面有冲突 --></span><br><span class="line"> <exclusions></span><br><span class="line"> <exclusion></span><br><span class="line"> <groupId>com.esotericsoftware</groupId></span><br><span class="line"> <artifactId>kryo</artifactId></span><br><span class="line"> </exclusion></span><br><span class="line"> </exclusions></span><br><span class="line"> </dependency></span><br><span class="line"> <dependency></span><br><span class="line"> <groupId>javax.servlet</groupId></span><br><span class="line"> <artifactId>javax.servlet-api</artifactId></span><br><span class="line"> <version>4.0.1</version></span><br><span class="line"> </dependency></span><br></pre></td></tr></table></figure><p>测试程序的逻辑就是一个简单的读取数据、<code>ETL</code> 处理流程。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD (jsc, sparkConf);</span><br></pre></td></tr></table></figure><p>在 <code>ETL</code> 处理时会取出 <code>publish_timestamp</code> 字段进行使用,我们可以本地 <code>debug</code> 查看它的取值。</p><p>默认情况下,<code>es.mapping.date.rich</code> 是开启的【取值为 <code>true</code>,自动转换日期字段】,本地 <code>debug</code>,查看 <code>publish_timestamp</code> 字段的取值,可以发现已经被转为了 <code>Java</code> 中的 <code>Date</code> 类型【取值 <code>Sun Dec 01 00:04:10 CST 2019</code>】。</p><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215191943.png" alt="转为 Date 类型" title="转为 Date 类型"></p><p>接着关闭 <code>es.mapping.date.rich</code>,本地 <code>debug</code>,查看 <code>publish_timestamp</code> 字段的取值,可以发现仍旧是毫秒时间戳【取值为 <code>1575129850000</code>】。</p><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215192158.png" alt="仍旧是时间戳格式" title="仍旧是时间戳格式"></p><p>把这个毫秒时间戳转为格式化日期,可以看到取值是 <code>Sun Dec 1 00:04:10 CST 2019</code>,与上面的 <code>debug</code> 结果一致。</p><p><img src="https://raw.githubusercontent.com/iplaypi/img-playpi/master/img/2018/20200215192346.png" alt="格式化时间戳" title="格式化时间戳"></p><h1 id="备注"><a href="# 备注" class="headerlink" title="备注"></a>备注</h1><p>关于 <code>elasticsearch-hadoop</code> 版本的选择,需要慎重,不仅要考虑匹配 <code>Elasticsearch</code> 环境的版本,还要注意一些坑。</p><p>例如,如果 <code>Elasticsearch</code> 版本为 <code>v2.4.5</code>,而使用 <code>elasticsearch-hadoop</code> 的版本为 <code>v2.1.0</code>,此时还无法完美支持 <code>Date</code> 字段,进而会导致程序异常,原因就是无法处理 <code>Date</code> 类型的字段,配置参数 <code>es.mapping.date.rich</code> 可以关闭转换逻辑。</p><p>此外最好还是升级 <code>elasticsearch-hadoop</code> 版本与 <code>Elasticsearch</code> 保持一致,例如升级到 <code>v2.4.5</code>【与 <code>Elasticsearch</code> 版本保持一致】。</p><p>但是,<code>v2.4.5</code> 版本的 <code>elasticsearch-hadoop</code> 自有它的坑【是很严重的 <code>bug</code>】,那就是它在处理数据时,会过滤掉中文的字段,导致读取中文字段丢失,影响中间的 <code>ETL</code> 处理逻辑。而如果数据处理完成后,再写回去原来的 <code>Elasticsearch</code> 索引就悲剧了,采用 <code>index</code> 方式会覆盖数据,导致中文字段全部丢失;采用 <code>update</code> 方式不会导致数据覆盖。</p><p>中文字段丢失问题,只针对某些版本,关于此问题的踩坑记录可以参考我的另外一篇博客:<a href="https://www.playpi.org/2017102301.html">es-hadoop 读取中文字段丢失问题</a> 。</p></div><div><div id="wechat_subscriber" style="display:block;padding:10px 0;margin:20px auto;width:100%;text-align:center"><img id="wechat_subscriber_qcode" src="/images/wechat-qr-personal.jpg" alt="虾丸派 wechat" style="width:200px;max-width:100%"><div>扫一扫添加博主,进技术交流群,共同学习进步</div></div></div><div><div style="padding:10px 0;margin:20px auto;width:90%;text-align:center"><div>永不止步</div><button id="rewardButton" disable="enable" onclick='var qr=document.getElementById("QR");"none"===qr.style.display?qr.style.display="block":qr.style.display="none"'><span>打赏</span></button><div id="QR" style="display:none"><div id="wechat" style="display:inline-block"><img id="wechat_qr" src="/images/wechat-pay-playpi.png" alt="虾丸派 微信支付"><p>微信支付</p></div></div></div></div><div><ul class="post-copyright"><li class="post-copyright-author"><strong>本文作者:</strong> 虾丸派</li><li class="post-copyright-link"><strong>本文链接:</strong> <a href="https://www.playpi.org/2018041801.html" title="es-hadoop 遇上 Elasticsearch 的 Date 类型字段">https://www.playpi.org/2018041801.html</a></li><li class="post-copyright-license"><strong>版权声明: </strong>本博客所有文章除特别声明外,均采用 <a href="https://creativecommons.org/licenses/by-nc-sa/3.0/" rel="external nofollow" target="_blank">CC BY-NC-SA 3.0</a> 许可协议。转载请注明出处!</li></ul></div><footer class="post-footer"><div class="post-tags"><a href="/tags/Elasticsearch/" rel="tag"><i class="fa fa-tag"></i> Elasticsearch</a> <a href="/tags/Hadoop/" rel="tag"><i class="fa fa-tag"></i> Hadoop</a> <a href="/tags/Date/" rel="tag"><i class="fa fa-tag"></i> Date</a></div><div class="post-nav"><div class="post-nav-next post-nav-item"><a href="/2018041301.html" rel="next" title="Elasticsearch 错误:None of the configured nodes are available"><i class="fa fa-chevron-left"></i> Elasticsearch 错误:None of the configured nodes are available</a></div><span class="post-nav-divider"></span><div class="post-nav-prev post-nav-item"><a href="/2018051401.html" rel="prev" title="Elasticsearch 常用 HTTP 接口">Elasticsearch 常用 HTTP 接口 <i class="fa fa-chevron-right"></i></a></div></div></footer></div></article><div class="post-spread"></div></div></div><div class="comments" id="comments"><div id="vcomments"></div></div></div><div class="sidebar-toggle"><div class="sidebar-toggle-line-wrap"><span class="sidebar-toggle-line sidebar-toggle-line-first"></span> <span class="sidebar-toggle-line sidebar-toggle-line-middle"></span> <span class="sidebar-toggle-line sidebar-toggle-line-last"></span></div></div><aside id="sidebar" class="sidebar"><div class="sidebar-inner"><ul class="sidebar-nav motion-element"><li class="sidebar-nav-toc sidebar-nav-active" data-target="post-toc-wrap">文章目录</li><li class="sidebar-nav-overview" data-target="site-overview-wrap">站点概览</li></ul><section class="site-overview-wrap sidebar-panel"><div class="site-overview"><div class="site-author motion-element" itemprop="author" itemscope itemtype="http://schema.org/Person"><img class="site-author-image" itemprop="image" src="/images/favicon-1536x1536-playpi.png" alt="虾丸派"><p class="site-author-name" itemprop="name">虾丸派</p><p class="site-description motion-element" itemprop="description">记录知识 | 分享技术</p></div><nav class="site-state motion-element"><div class="site-state-item site-state-posts"><a href="/archives/"><span class="site-state-item-count">144</span> <span class="site-state-item-name">日志</span></a></div><div class="site-state-item site-state-categories"><a href="/categories/index.html"><span class="site-state-item-count">13</span> <span class="site-state-item-name">分类</span></a></div><div class="site-state-item site-state-tags"><a href="/tags/index.html"><span class="site-state-item-count">294</span> <span class="site-state-item-name">标签</span></a></div></nav><div class="feed-link motion-element"><a href="/atom.xml" rel="alternate"><i class="fa fa-rss"></i> RSS</a></div><div class="links-of-author motion-element"><span class="links-of-author-item"><a href="https://github.com/iplaypi" target="_blank" title="GitHub"><i class="fa fa-fw fa-github"></i>GitHub</a> </span><span class="links-of-author-item"><a href="https://weibo.com/u/3086148515" target="_blank" title="微博"><i class="fa fa-fw fa-weibo"></i>微博</a> </span><span class="links-of-author-item"><a href="mailto:playpi@qq.com" target="_blank" title="E-Mail"><i class="fa fa-fw fa-envelope"></i>E-Mail</a></span></div><div class="cc-license motion-element" itemprop="license"><a href="https://creativecommons.org/licenses/by-nc-sa/4.0/" class="cc-opacity" target="_blank" rel="external nofollow"><img src="/images/cc-by-nc-sa.svg" alt="Creative Commons"></a></div><div class="links-of-blogroll motion-element links-of-blogroll-inline"><div class="links-of-blogroll-title"><i class="fa fa-fw fa-link"></i> 友情链接</div><ul class="links-of-blogroll-list"><li class="links-of-blogroll-item"><a href="https://github.com/iplaypi" title="GitHub" target="_blank" rel="external nofollow">GitHub</a></li><li class="links-of-blogroll-item"><a href="https://weibo.com/u/3086148515" title="Weibo" target="_blank" rel="external nofollow">Weibo</a></li><li class="links-of-blogroll-item"><a href="https://www.playpi.org" title="虾丸派" target="_blank" rel="external nofollow">虾丸派</a></li><li class="links-of-blogroll-item"><a href="https://www.playpi.org" title="playpi" target="_blank" rel="external nofollow">playpi</a></li><li class="links-of-blogroll-item"><a href="https://www.liaoxuefeng.com" title="廖雪峰" target="_blank" rel="external nofollow">廖雪峰</a></li><li class="links-of-blogroll-item"><a href="http://www.ruanyifeng.com" title="阮一峰" target="_blank" rel="external nofollow">阮一峰</a></li><li class="links-of-blogroll-item"><a href="https://travis-ci.org/iplaypi/iplaypi.github.io" title="travis-ci" target="_blank" rel="external nofollow">travis-ci</a></li><li class="links-of-blogroll-item"><a href="https://www.vultr.com/?ref=7861302-4F" title="Vultr" target="_blank" rel="external nofollow">Vultr</a></li></ul></div></div></section><section class="post-toc-wrap motion-element sidebar-panel sidebar-panel-active"><div class="post-toc"><div class="post-toc-content"><ol class="nav"><li class="nav-item nav-level-1"><a class="nav-link" href="#问题出现"><span class="nav-number">1.</span> <span class="nav-text">问题出现</span></a></li><li class="nav-item nav-level-1"><a class="nav-link" href="#分析解决"><span class="nav-number">2.</span> <span class="nav-text">分析解决</span></a></li><li class="nav-item nav-level-1"><a class="nav-link" href="#扩展"><span class="nav-number">3.</span> <span class="nav-text">扩展</span></a></li><li class="nav-item nav-level-1"><a class="nav-link" href="#备注"><span class="nav-number">4.</span> <span class="nav-text">备注</span></a></li></ol></div></div></section></div></aside></div></main><footer id="footer" class="footer"><div class="footer-inner"><div class="copyright">© 2016–<span itemprop="copyrightYear">2021</span> <span class="post-meta-divider">|</span> <span class="with-love"><i class="fa fa-heart"></i> </span><span class="author" itemprop="copyrightHolder">虾丸派</span> <span class="post-meta-divider">|</span> <span class="post-meta-item-icon"><i class="fa fa-area-chart"></i> </span><span class="post-meta-item-text">全站字数统计</span> <span title="全站字数统计">326.3k 字</span></div><div class="powered-by">由 <a class="theme-link" target="_blank" href="https://hexo.io" rel="external nofollow">Hexo</a> 强力驱动</div><span class="post-meta-divider">|</span><div class="theme-info">主题 <a class="theme-link" target="_blank" href="https://github.com/iissnan/hexo-theme-next" rel="external nofollow">NexT.Mist</a><script async src="//busuanzi.ibruce.info/busuanzi/2.3/busuanzi.pure.mini.js"></script><span id="busuanzi_container_site_pv" style="display:none"><span class="post-meta-divider">|</span> 总访问量 <span id="busuanzi_value_site_pv"></span> 次 </span><span id="busuanzi_container_site_uv" style="display:none"><span class="post-meta-divider">|</span> 总访客 <span id="busuanzi_value_site_uv"></span> 人</span></div><div class="busuanzi-count"><script async src="https://dn-lbstatics.qbox.me/busuanzi/2.3/busuanzi.pure.mini.js"></script></div></div></footer><div class="back-to-top"><i class="fa fa-arrow-up"></i> <span id="scrollpercent"><span>0</span>%</span></div></div><script type="text/javascript">"[object Function]"!==Object.prototype.toString.call(window.Promise)&&(window.Promise=null)</script><script type="text/javascript" src="/lib/jquery/index.js?v=2.1.3"></script><script type="text/javascript" src="/lib/fastclick/lib/fastclick.min.js?v=1.0.6"></script><script type="text/javascript" src="/lib/jquery_lazyload/jquery.lazyload.js?v=1.9.7"></script><script type="text/javascript" src="/lib/velocity/velocity.min.js?v=1.2.1"></script><script type="text/javascript" src="/lib/velocity/velocity.ui.min.js?v=1.2.1"></script><script type="text/javascript" src="/lib/fancybox/source/jquery.fancybox.pack.js?v=2.1.5"></script><script type="text/javascript" src="/js/src/utils.js?v=5.1.3"></script><script type="text/javascript" src="/js/src/motion.js?v=5.1.3"></script><script type="text/javascript" src="/js/src/scrollspy.js?v=5.1.3"></script><script type="text/javascript" src="/js/src/post-details.js?v=5.1.3"></script><script type="text/javascript" src="/js/src/bootstrap.js?v=5.1.3"></script><script src="//unpkg.com/valine@1.3.7/dist/Valine.min.js"></script><script type="text/javascript">new Valine({av:AV,el:"#comments",verify:!1,notify:!1,app_id:"FC5Jijeg1meo2K2OzPYWK327-gzGzoHsz",app_key:"6A1ReY8tjhPutK00F01YbJSq",placeholder:"没有问题吗?"})</script><script type="text/javascript">var isfetched=!1,isXml=!0,search_path="search.xml";0===search_path.length?search_path="search.xml":/json$/i.test(search_path)&&(isXml=!1);var path="/"+search_path,onPopupClose=function(t){$(".popup").hide(),$("#local-search-input").val(""),$(".search-result-list").remove(),$("#no-result").remove(),$(".local-search-pop-overlay").remove(),$("body").css("overflow","")};function proceedsearch(){$("body").append('<div class="search-popup-overlay local-search-pop-overlay"></div>').css("overflow","hidden"),$(".search-popup-overlay").click(onPopupClose),$(".popup").toggle();var t=$("#local-search-input");t.attr("autocapitalize","none"),t.attr("autocorrect","off"),t.focus()}var searchFunc=function(t,e,s){"use strict";$("body").append('<div class="search-popup-overlay local-search-pop-overlay"><div id="search-loading-icon"><i class="fa fa-spinner fa-pulse fa-5x fa-fw"></i></div></div>').css("overflow","hidden"),$("#search-loading-icon").css("margin","20% auto 0 auto").css("text-align","center"),$.ajax({url:t,dataType:isXml?"xml":"json",async:!0,success:function(t){isfetched=!0,$(".popup").detach().appendTo(".header-inner");var o=isXml?$("entry",t).map(function(){return{title:$("title",this).text(),content:$("content",this).text(),url:$("url",this).text()}}).get():t,n=document.getElementById(e),r=document.getElementById(s),t=function(){var m=n.value.trim().toLowerCase(),x=m.split(/[\s\-]+/);1<x.length&&x.push(m);var e,w=[];0<m.length&&o.forEach(function(t){var e=!1,o=0,h=0,n=t.title.trim(),r=n.toLowerCase(),s=t.content.trim().replace(/<[^>]+>/g,""),a=s.toLowerCase(),i=decodeURIComponent(t.url),c=[],l=[];if(""!=n&&(x.forEach(function(t){function e(t,e,o){var n=t.length;if(0===n)return[];var r,s=0,a=[];for(o||(e=e.toLowerCase(),t=t.toLowerCase());-1<(r=e.indexOf(t,s));)a.push({position:r,word:t}),s=r+n;return a}c=c.concat(e(t,r,!1)),l=l.concat(e(t,a,!1))}),(0<c.length||0<l.length)&&(e=!0,o=c.length+l.length)),e){function p(t,e,o,n){for(var r=n[n.length-1],s=r.position,a=r.word,i=[],c=0;s+a.length<=o&&0!=n.length;){a===m&&c++,i.push({position:s,length:a.length});var l=s+a.length;for(n.pop();0!=n.length&&(s=(r=n[n.length-1]).position,a=r.word,s<l);)n.pop()}return h+=c,{hits:i,start:e,end:o,searchTextCount:c}}[c,l].forEach(function(t){t.sort(function(t,e){return e.position!==t.position?e.position-t.position:t.word.length-e.word.length})});t=[];0!=c.length&&t.push(p(0,0,n.length,c));for(var u=[];0!=l.length;){var f=l[l.length-1],d=f.position,g=f.word,v=d-20,f=d+80;v<0&&(v=0),(f=f<d+g.length?d+g.length:f)>s.length&&(f=s.length),u.push(p(0,v,f,l))}u.sort(function(t,e){return t.searchTextCount!==e.searchTextCount?e.searchTextCount-t.searchTextCount:t.hits.length!==e.hits.length?e.hits.length-t.hits.length:t.start-e.start});e=parseInt("1");function $(o,t){var n="",r=t.start;return t.hits.forEach(function(t){n+=o.substring(r,t.position);var e=t.position+t.length;n+='<b class="search-keyword">'+o.substring(t.position,e)+"</b>",r=e}),n+=o.substring(r,t.end)}0<=e&&(u=u.slice(0,e));var C="";0!=t.length?C+="<li><a href='"+i+"' class='search-result-title'>"+$(n,t[0])+"</a>":C+="<li><a href='"+i+"' class='search-result-title'>"+n+"</a>",u.forEach(function(t){C+="<a href='"+i+'\'><p class="search-result">'+$(s,t)+"...</p></a>"}),C+="</li>",w.push({item:C,searchTextCount:h,hitCount:o,id:w.length})}}),1===x.length&&""===x[0]?r.innerHTML='<div id="no-result"><i class="fa fa-search fa-5x" /></div>':0===w.length?r.innerHTML='<div id="no-result"><i class="fa fa-frown-o fa-5x" /></div>':(w.sort(function(t,e){return t.searchTextCount!==e.searchTextCount?e.searchTextCount-t.searchTextCount:t.hitCount!==e.hitCount?e.hitCount-t.hitCount:e.id-t.id}),e='<ul class="search-result-list">',w.forEach(function(t){e+=t.item}),e+="</ul>",r.innerHTML=e)};n.addEventListener("input",t),$(".local-search-pop-overlay").remove(),$("body").css("overflow",""),proceedsearch()}})};$(".popup-trigger").click(function(t){t.stopPropagation(),!1===isfetched?searchFunc(path,"local-search-input","local-search-result"):proceedsearch()}),$(".popup-btn-close").click(onPopupClose),$(".popup").click(function(t){t.stopPropagation()}),$(document).on("keyup",function(t){27===t.which&&$(".search-popup").is(":visible")&&onPopupClose()})</script><script>!function(){var t=document.createElement("script"),e=window.location.protocol.split(":")[0];t.src="https"===e?"https://zz.bdstatic.com/linksubmit/push.js":"http://push.zhanzhang.baidu.com/push.js";e=document.getElementsByTagName("script")[0];e.parentNode.insertBefore(t,e)}()</script><script type="text/javascript" src="/js/src/js.cookie.js?v=5.1.3"></script><script type="text/javascript" src="/js/src/scroll-cookie.js?v=5.1.3"></script><script src="/live2dw/lib/L2Dwidget.min.js?094cbace49a39548bed64abff5988b05"></script><script>L2Dwidget.init({pluginRootPath:"live2dw/",pluginJsPath:"lib/",pluginModelPath:"assets/",tagMode:!1,debug:!1,model:{scale:1,jsonPath:"/live2dw/assets/hijiki.model.json"},display:{position:"left",width:100,height:200,hOffset:0,vOffset:-20},mobile:{show:!1,motion:!0,scale:.3},log:!1})</script></body></html>