xapian源码修改之group by
|
2016年06月15日, 16:27
(这个帖子最后修改于: 2016年06月15日 16:35 by rootzkb.)
|
|||
|
|||
xapian源码修改之group by
因为公司业务需要我修改了xapian源码用于实现聚合函数,首先在xapian中聚合是由Matchspy实现的,在multimatch.cc中,会通过仿函数进入Matchspy的内部去完成聚合操作。xapian本身只实现统计数据分组后,每组数据的个数。但这个功能无法满足很多的需求。比如我想分组之后求每组的最大值。
那么我就实现了分组之后,求最大值,最小值,求和之内的聚合函数的功能。首先我设计的类继承于Matchspy,然后自己重写了operator(),在这个里面我利用了std::map的特性,我以要分组的字段为key,最大值、最小值封装成的结构体为value。那么我就做到了分组,使用map的find方法可以减少我判断是否为同一分组的时间。按照xapian的思路我每次根据slot去document中取出相应的值比较得到最大值最小值,而整个数据的循环是在multimatch的get_mset的大循环中做的。 说那么多前言,我是遇到了问题了,功能是正确的,性能是不行的。普通的MatchAll查询2400万条数据用时约为10s,我自己的group by求max加上MatchAll用时22秒。求指教如何优化这个功能,我现在能想到的无非就是map的find耗时,还有最关键的从document中取数据的耗时,因为document中取数据需要组建key去BrassTableTree中去查询数据。 [hr] 附:部分核心修改的源码 //matchspy.h : add zkb /** @file matchspy.h * @brief MatchSpy implementation. */ /* Copyright (C) 2007,2008,2009,2010,2012 Olly Betts * Copyright (C) 2007,2009 Lemur Consulting Ltd * Copyright (C) 2010 Richard Boulton * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef XAPIAN_INCLUDED_MATCHSPY_H #define XAPIAN_INCLUDED_MATCHSPY_H #include <xapian/base.h> #include <xapian/enquire.h> #include <xapian/termiterator.h> #include <xapian/visibility.h> #include <string> #include <map> #include <set> #include <string> #include <vector> namespace Xapian { class Document; class Registry; /** Abstract base class for match spies. * * The subclasses will generally accumulate information seen during the match, * to calculate aggregate functions, or other profiles of the matching * documents. */ class XAPIAN_VISIBILITY_DEFAULT MatchSpy { private: /// Don't allow assignment. void operator=(const MatchSpy &); /// Don't allow copying. MatchSpy(const MatchSpy &); protected: /// Default constructor, needed by subclass constructors. MatchSpy() {} public: /** Virtual destructor, because we have virtual methods. */ virtual ~MatchSpy(); /** Register a document with the match spy. * * This is called by the matcher once with each document seen by the * matcher during the match process. Note that the matcher will often not * see all the documents which match the query, due to optimisations which * allow low-weighted documents to be skipped, and allow the match process * to be terminated early. * * @param doc The document seen by the match spy. * @param wt The weight of the document. */ virtual void operator()(const Xapian::Document &doc, Xapian::weight wt) = 0; /** Clone the match spy. * * The clone should inherit the configuration of the parent, but need not * inherit the state. ie, the clone does not need to be passed * information about the results seen by the parent. * * If you don't want to support the remote backend in your match spy, you * can use the default implementation which simply throws * Xapian::UnimplementedError. * * Note that the returned object will be deallocated by Xapian after use * with "delete". If you want to handle the deletion in a special way * (for example when wrapping the Xapian API for use from another * language) then you can define a static <code>operator delete</code> * method in your subclass as shown here: * [url]http://trac.xapian.org/ticket/554#comment:1[/url] */ virtual MatchSpy * clone() const; /** Return the name of this match spy. * * This name is used by the remote backend. It is passed with the * serialised parameters to the remote server so that it knows which class * to create. * * Return the full namespace-qualified name of your class here - if your * class is called MyApp::FooMatchSpy, return "MyApp::FooMatchSpy" from * this method. * * If you don't want to support the remote backend in your match spy, you * can use the default implementation which simply throws * Xapian::UnimplementedError. */ virtual std::string name() const; /** Return this object's parameters serialised as a single string. * * If you don't want to support the remote backend in your match spy, you * can use the default implementation which simply throws * Xapian::UnimplementedError. */ virtual std::string serialise() const; /** Unserialise parameters. * * This method unserialises parameters serialised by the @a serialise() * method and allocates and returns a new object initialised with them. * * If you don't want to support the remote backend in your match spy, you * can use the default implementation which simply throws * Xapian::UnimplementedError. * * Note that the returned object will be deallocated by Xapian after use * with "delete". If you want to handle the deletion in a special way * (for example when wrapping the Xapian API for use from another * language) then you can define a static <code>operator delete</code> * method in your subclass as shown here: * [url]http://trac.xapian.org/ticket/554#comment:1[/url] * * @param s A string containing the serialised results. * @param context Registry object to use for unserialisation to permit * MatchSpy subclasses with sub-MatchSpy objects to be * implemented. */ virtual MatchSpy * unserialise(const std::string & s, const Registry & context) const; /** Serialise the results of this match spy. * * If you don't want to support the remote backend in your match spy, you * can use the default implementation which simply throws * Xapian::UnimplementedError. */ virtual std::string serialise_results() const; /** Unserialise some results, and merge them into this matchspy. * * The order in which results are merged should not be significant, since * this order is not specified (and will vary depending on the speed of * the search in each sub-database). * * If you don't want to support the remote backend in your match spy, you * can use the default implementation which simply throws * Xapian::UnimplementedError. * * @param s A string containing the serialised results. */ virtual void merge_results(const std::string & s); /** Return a string describing this object. * * This default implementation returns a generic answer, to avoid forcing * those deriving their own MatchSpy subclasses from having to implement * this (they may not care what get_description() gives for their * subclass). */ virtual std::string get_description() const; }; /** Class for counting the frequencies of values in the matching documents. */ class XAPIAN_VISIBILITY_DEFAULT ValueCountMatchSpy : public MatchSpy { public: struct Internal; #ifndef SWIG // SWIG doesn't need to know about the internal class struct XAPIAN_VISIBILITY_DEFAULT Internal : public Xapian::Internal::RefCntBase { /// The slot to count. Xapian::valueno slot; /// Total number of documents seen by the match spy. Xapian::doccount total; /// The values seen so far, together with their frequency. std::map<std::string, Xapian::doccount> values; Internal() : slot(Xapian::BAD_VALUENO), total(0) {} explicit Internal(Xapian::valueno slot_) : slot(slot_), total(0) {} }; #endif protected: Xapian::Internal::RefCntPtr<Internal> internal; public: /// Construct an empty ValueCountMatchSpy. ValueCountMatchSpy() : internal() {} /// Construct a MatchSpy which counts the values in a particular slot. explicit ValueCountMatchSpy(Xapian::valueno slot_) : internal(new Internal(slot_)) {} /** Return the total number of documents tallied. */ size_t get_total() const { return internal.get() ? internal->total : 0; } /** Get an iterator over the values seen in the slot. * * Items will be returned in ascending alphabetical order. * * During the iteration, the frequency of the current value can be * obtained with the get_termfreq() method on the iterator. */ TermIterator values_begin() const; /** End iterator corresponding to values_begin() */ TermIterator values_end() const { return TermIterator(); } /** Get an iterator over the most frequent values seen in the slot. * * Items will be returned in descending order of frequency. Values with * the same frequency will be returned in ascending alphabetical order. * * During the iteration, the frequency of the current value can be * obtained with the get_termfreq() method on the iterator. * * @param maxvalues The maximum number of values to return. */ TermIterator top_values_begin(size_t maxvalues) const; /** End iterator corresponding to top_values_begin() */ TermIterator top_values_end(size_t) const { return TermIterator(); } /** Implementation of virtual operator(). * * This implementation tallies values for a matching document. * * @param doc The document to tally values for. * @param wt The weight of the document (ignored by this class). */ void operator()(const Xapian::Document &doc, Xapian::weight wt); virtual MatchSpy * clone() const; virtual std::string name() const; virtual std::string serialise() const; virtual MatchSpy * unserialise(const std::string & s, const Registry & context) const; virtual std::string serialise_results() const; virtual void merge_results(const std::string & s); virtual std::string get_description() const; }; //add zkb: 因为没有类似于ommatchspy这样的头文件,所以把数据类SinagleGroupItem //的全部实现和申明隐藏在cc文件中,把所有分组聚合功能集中在这个类 //这个类只是内部用,不对外暴露 class SinagleGroupIterator; //计算聚合函数的类 class XAPIAN_VISIBILITY_DEFAULT GroupMatchSpy : public MatchSpy { friend class SinagleGroupIterator; public: class Internal; //add zkb 针对一个字段的最大值最小值查询 typedef enum { MAX_VAL = 0x20, MIN_VAL = 0x40, SUM_VAL = 0x80, COUNT_VAL = 0x100 }value_op; GroupMatchSpy(const GroupMatchSpy &other); /* * 分组类 * @slot 分组聚合函数要求的值 * @group_slot 要求分组的字段(分组至少一个字段,如果多字段调用add_group_slot,添加字段) * @op 聚合函数的类型 * @limit 限制分组的大小 */ GroupMatchSpy(Xapian::valueno slot, Xapian::valueno group_slot, const int op, const int limit); ~GroupMatchSpy(); //多字段分组调用的接口 void add_group_slot(Xapian::valueno group_slot); //迭代器模式对外提供的迭代器式的访问接口,使用常函数,限制这个类并不访问自己的成员 SinagleGroupIterator begin() const; SinagleGroupIterator end() const; size_t size(); bool empty(); bool has_sum_val(); bool has_min_val(); bool has_max_val(); bool has_count_val(); /** Implementation of virtual operator(). * * This implementation tallies values for a matching document. * * @param doc The document to tally values for. * @param wt The weight of the document (ignored by this class). */ void operator()(const Xapian::Document &doc, Xapian::weight wt); virtual std::string get_description() const; private: /// @private @internal Reference counted internals. Xapian::Internal::RefCntPtr<Internal> internal; }; //聚合函数结果集的类 class XAPIAN_VISIBILITY_DEFAULT SinagleGroupIterator { public: SinagleGroupIterator(int index, const GroupMatchSpy& spy) : m_index_i(index) , m_spy(spy) { } int m_index_i; GroupMatchSpy m_spy; //实现迭代器相应的运算符,本质就是通过MtachSpy内部的迭代器来做,主要是考虑到分组有几十万组的时候每次find会耗时,但是每次迭代器 //单步后移会减少耗时,二元运算符重载,第一个为默认的this指针,所以只能写成全局的. friend bool operator!=(const SinagleGroupIterator &a, const SinagleGroupIterator &b); friend bool operator==(const SinagleGroupIterator &a, const SinagleGroupIterator &b); SinagleGroupIterator & operator++(); void get_group_name(std::vector<std::string>& group_name); const std::string& get_max_val(); const std::string& get_min_val(); double get_sum_val(); int get_count_val(); }; //实现迭代器类的方法 inline bool operator!=(const SinagleGroupIterator &a, const SinagleGroupIterator &b) { return (a.m_index_i != b.m_index_i); } inline bool operator==(const SinagleGroupIterator &a, const SinagleGroupIterator &b) { return (a.m_index_i == b.m_index_i); } } #endif // XAPIAN_INCLUDED_MATCHSPY_H //matchspy.cc : add zkb /** @file matchspy.cc * @brief MatchSpy implementation. */ /* Copyright (C) 2007,2008,2009,2010,2013,2014,2015 Olly Betts * Copyright (C) 2007,2009 Lemur Consulting Ltd * Copyright (C) 2010 Richard Boulton * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include <config.h> #include <xapian/matchspy.h> #include <xapian/document.h> #include <xapian/error.h> #include <xapian/queryparser.h> #include <xapian/registry.h> #include <map> #include <string> #include <vector> #include <net/length.h> //add zkb #include <sstream> #include "pack.h" #include "autoptr.h" #include "debuglog.h" #include "omassert.h" #include "serialise.h" #include "stringutils.h" #include "str.h" #include "termlist.h" #include <cfloat> #include <cmath> using namespace std; using namespace Xapian; MatchSpy::~MatchSpy() {} MatchSpy * MatchSpy::clone() const { throw UnimplementedError("MatchSpy not suitable for use with remote searches - clone() method unimplemented"); } string MatchSpy::name() const { throw UnimplementedError("MatchSpy not suitable for use with remote searches - name() method unimplemented"); } string MatchSpy::serialise() const { throw UnimplementedError("MatchSpy not suitable for use with remote searches - serialise() method unimplemented"); } MatchSpy * MatchSpy::unserialise(const string &, const Registry &) const { throw UnimplementedError("MatchSpy not suitable for use with remote searches - unserialise() method unimplemented"); } string MatchSpy::serialise_results() const { throw UnimplementedError("MatchSpy not suitable for use with remote searches - serialise_results() method unimplemented"); } void MatchSpy::merge_results(const string &) { throw UnimplementedError("MatchSpy not suitable for use with remote searches - merge_results() method unimplemented"); } string MatchSpy::get_description() const { return "Xapian::MatchSpy()"; } XAPIAN_NORETURN(static void unsupported_method()); static void unsupported_method() { throw Xapian::InvalidOperationError("Method not supported for this type of termlist"); } /// A termlist iterator over the contents of a ValueCountMatchSpy class ValueCountTermList : public TermList { private: map<string, Xapian::doccount>::const_iterator it; bool started; Xapian::Internal::RefCntPtr<Xapian::ValueCountMatchSpy::Internal> spy; public: ValueCountTermList(ValueCountMatchSpy::Internal * spy_) : spy(spy_) { it = spy->values.begin(); started = false; } string get_termname() const { Assert(started); Assert(!at_end()); return it->first; } Xapian::doccount get_termfreq() const { Assert(started); Assert(!at_end()); return it->second; } TermList * next() { if (!started) { started = true; } else { Assert(!at_end()); ++it; } return NULL; } TermList * skip_to(const string & term) { while (it != spy->values.end() && it->first < term) { ++it; } started = true; return NULL; } bool at_end() const { Assert(started); return it == spy->values.end(); } Xapian::termcount get_approx_size() const { unsupported_method(); return 0; } Xapian::termcount get_wdf() const { unsupported_method(); return 0; } Xapian::PositionIterator positionlist_begin() const { unsupported_method(); return Xapian::PositionIterator(); } Xapian::termcount positionlist_count() const { unsupported_method(); return 0; } }; /** A string with a corresponding frequency. */ class StringAndFrequency { std::string str; Xapian::doccount frequency; public: /// Construct a StringAndFrequency object. StringAndFrequency(const std::string & str_, Xapian::doccount frequency_) : str(str_), frequency(frequency_) {} /// Return the string. std::string get_string() const { return str; } /// Return the frequency. Xapian::doccount get_frequency() const { return frequency; } }; /** Compare two StringAndFrequency objects. * * The comparison is firstly by frequency (higher is better), then by string * (earlier lexicographic sort is better). */ class StringAndFreqCmpByFreq { public: /// Default constructor StringAndFreqCmpByFreq() {} /// Return true if a has a higher frequency than b. /// If equal, compare by the str, to provide a stable sort order. bool operator()(const StringAndFrequency &a, const StringAndFrequency &b) const { if (a.get_frequency() > b.get_frequency()) return true; if (a.get_frequency() < b.get_frequency()) return false; return a.get_string() < b.get_string(); } }; /// A termlist iterator over a vector of StringAndFrequency objects. class StringAndFreqTermList : public TermList { private: vector<StringAndFrequency>::const_iterator it; bool started; public: vector<StringAndFrequency> values; /** init should be called after the values have been set, but before * iteration begins. */ void init() { it = values.begin(); started = false; } string get_termname() const { Assert(started); Assert(!at_end()); return it->get_string(); } Xapian::doccount get_termfreq() const { Assert(started); Assert(!at_end()); return it->get_frequency(); } TermList * next() { if (!started) { started = true; } else { Assert(!at_end()); ++it; } return NULL; } TermList * skip_to(const string & term) { while (it != values.end() && it->get_string() < term) { ++it; } started = true; return NULL; } bool at_end() const { Assert(started); return it == values.end(); } Xapian::termcount get_approx_size() const { unsupported_method(); return 0; } Xapian::termcount get_wdf() const { unsupported_method(); return 0; } Xapian::PositionIterator positionlist_begin() const { unsupported_method(); return Xapian::PositionIterator(); } Xapian::termcount positionlist_count() const { unsupported_method(); return 0; } }; /** Get the most frequent items from a map from string to frequency. * * This takes input such as that in ValueCountMatchSpy::Internal::values and * returns a vector of the most frequent items in the input. * * @param result A vector which will be filled with the most frequent * items, in descending order of frequency. Items with * the same frequency will be sorted in ascending * alphabetical order. * * @param items The map from string to frequency, from which the most * frequent items will be selected. * * @param maxitems The maximum number of items to return. */ static void get_most_frequent_items(vector<StringAndFrequency> & result, const map<string, doccount> & items, size_t maxitems) { result.clear(); result.reserve(maxitems); StringAndFreqCmpByFreq cmpfn; bool is_heap(false); for (map<string, doccount>::const_iterator i = items.begin(); i != items.end(); ++i) { Assert(result.size() <= maxitems); result.push_back(StringAndFrequency(i->first, i->second)); if (result.size() > maxitems) { // Make the list back into a heap. if (is_heap) { // Only the new element isn't in the right place. push_heap(result.begin(), result.end(), cmpfn); } else { // Need to build heap from scratch. make_heap(result.begin(), result.end(), cmpfn); is_heap = true; } pop_heap(result.begin(), result.end(), cmpfn); result.pop_back(); } } if (is_heap) { sort_heap(result.begin(), result.end(), cmpfn); } else { sort(result.begin(), result.end(), cmpfn); } } void ValueCountMatchSpy::operator()(const Document &doc, weight) { Assert(internal.get()); ++(internal->total); string val(doc.get_value(internal->slot)); if (!val.empty()) ++(internal->values[val]); } TermIterator ValueCountMatchSpy::values_begin() const { Assert(internal.get()); return Xapian::TermIterator(new ValueCountTermList(internal.get())); } TermIterator ValueCountMatchSpy::top_values_begin(size_t maxvalues) const { Assert(internal.get()); AutoPtr<StringAndFreqTermList> termlist(new StringAndFreqTermList); get_most_frequent_items(termlist->values, internal->values, maxvalues); termlist->init(); return Xapian::TermIterator(termlist.release()); } MatchSpy * ValueCountMatchSpy::clone() const { Assert(internal.get()); return new ValueCountMatchSpy(internal->slot); } string ValueCountMatchSpy::name() const { return "Xapian::ValueCountMatchSpy"; } string ValueCountMatchSpy::serialise() const { Assert(internal.get()); string result; result += encode_length(internal->slot); return result; } MatchSpy * ValueCountMatchSpy::unserialise(const string & s, const Registry &) const { const char * p = s.data(); const char * end = p + s.size(); valueno new_slot; decode_length(&p, end, new_slot); if (p != end) { throw NetworkError("Junk at end of serialised ValueCountMatchSpy"); } return new ValueCountMatchSpy(new_slot); } string ValueCountMatchSpy::serialise_results() const { LOGCALL(REMOTE, string, "ValueCountMatchSpy::serialise_results", NO_ARGS); Assert(internal.get()); string result; result += encode_length(internal->total); result += encode_length(internal->values.size()); for (map<string, doccount>::const_iterator i = internal->values.begin(); i != internal->values.end(); ++i) { //result += encode_length(i->first.size()); result += i->first; result += encode_length(i->second); } RETURN(result); } void ValueCountMatchSpy::merge_results(const string & s) { LOGCALL_VOID(REMOTE, "ValueCountMatchSpy::merge_results", s); Assert(internal.get()); const char * p = s.data(); const char * end = p + s.size(); Xapian::doccount n; decode_length(&p, end, n); internal->total += n; map<string, doccount>::size_type items; decode_length(&p, end, items); while (p != end) { while (items != 0) { size_t vallen; decode_length_and_check(&p, end, vallen); string val(p, vallen); p += vallen; doccount freq; decode_length(&p, end, freq); internal->values[val] += freq; --items; } } } string ValueCountMatchSpy::get_description() const { string d = "ValueCountMatchSpy("; if (internal.get()) { d += str(internal->total); d += " docs seen, looking in "; d += str(internal->values.size()); d += " slots)"; } else { d += ")"; } return d; } //add zkb //随着聚合函数增多,如果一次switch case直接命中概率低 //数据实例类,所有的分组的信息存放在这里 struct SinagleGroupItem{ std::string m_max_str; std::string m_min_str; double m_sum_d; int m_count_i; SinagleGroupItem() { m_max_str.clear(); m_min_str.clear(); m_sum_d = 0.0; m_count_i = 0; }; }; //虽然使用函数指针要间接寻址,但是只需要做一次判断就可以 typedef void (GroupMatchSpy::Internal::*compare_fun)(SinagleGroupItem &, std::string &); class GroupMatchSpy::Internal : public Xapian::Internal::RefCntBase { public: //说明这个数据是否有效 bool m_is_max_effective; bool m_is_sum_effective; bool m_is_min_effective; bool m_is_count_effective; static const int INIT_CONUNT = 1; //要求的值对应的槽 Xapian::valueno m_slot; //要分组的字段对应的槽 std::set<Xapian::valueno> m_group_slot_set; //限制分组的个数 const int m_limit_i; //函数指针存放在这里 std::vector<compare_fun> m_compare_vec; //分组算法的查找使用MAP std::map<std::string, SinagleGroupItem> m_total_map; std::map<std::string, SinagleGroupItem>::iterator m_total_iter; Internal(Xapian::valueno slot, Xapian::valueno group_slot, const int op, const int limit) : m_is_max_effective(false) , m_is_sum_effective(false) , m_is_min_effective(false) , m_is_count_effective(false) , m_slot(slot) , m_limit_i(limit) { m_total_map.clear(); m_compare_vec.clear(); m_group_slot_set.clear(); m_group_slot_set.insert(group_slot); //根据选项添加运算的方法 add_calculation(op); } //添加多字段的接口 void add_group_slot(Xapian::valueno slot_) { m_group_slot_set.insert(slot_); } //判断是否是纯数字字符 bool is_num_string(std::string& str, double& d) { std::stringstream jud(str); char c; if (!(jud >> d)) { return false; } else if (jud >> c) { return false; } else { return true; } } //最大值 void set_max_value(SinagleGroupItem& key, std::string& val) { if (val > key.m_max_str) { key.m_max_str = val; } } //最小值 void set_min_value(SinagleGroupItem& key, std::string& val) { if (val < key.m_min_str) { key.m_min_str = val; } } //求和 void set_sum_value(SinagleGroupItem& key, std::string& val) { double val_num = 0.0; if (is_num_string(val, val_num)) { key.m_sum_d += val_num; } else { m_is_sum_effective = false; //存入时可以确保sum是存放在末尾 m_compare_vec.pop_back(); } } //使用哑元让接口和以上一致,统计每个分组的个数 void set_count_value(SinagleGroupItem& key, std::string&) { key.m_count_i++; } //实际的处理方法,只有在这里会修改map的迭代器,之后不会修改迭代器,所以遍历应该是安全的 void deal_group_fun(const Xapian::Document &doc) { std::string key_tag; for (std::set<Xapian::valueno>::const_iterator i = m_group_slot_set.begin(); i != m_group_slot_set.end(); ++i) { //为string添加size头,用于确定唯一的key,如果为空拼进去0 pack_string(key_tag, doc.get_value(*i)); } std::string val(doc.get_value(m_slot)); m_total_iter = m_total_map.find(key_tag); if (m_total_iter == m_total_map.end()) { if (m_limit_i > (int)m_total_map.size()) { //如为空占住位置 m_total_map[key_tag].m_max_str = val; m_total_map[key_tag].m_min_str = val; m_total_map[key_tag].m_count_i = INIT_CONUNT; //如果第一个字符串有效就放进,如果无效我看看有没有sum的要求,如果有就移除,如果没有什么都不用做 double val_num = 0.0; if (is_num_string(val, val_num)) { m_total_map[key_tag].m_sum_d = val_num; } else if (m_is_sum_effective) { //如果这组数据有一个非法,那么就认为这组数据都非法 m_is_sum_effective = false; m_compare_vec.pop_back(); } } } else { //for每次只做一次命中判断,而switch case要多次命中 for (size_t i = 0; i < m_compare_vec.size(); i++) { (this->*m_compare_vec[i])((m_total_iter->second), val); } } } void add_calculation(const int op_) { m_compare_vec.clear(); switch (op_) { case Xapian::GroupMatchSpy::MAX_VAL: { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_is_max_effective = true; } break; case Xapian::GroupMatchSpy::MIN_VAL: { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_is_min_effective = true; } break; case Xapian::GroupMatchSpy::SUM_VAL: { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_sum_effective = true; } break; case Xapian::GroupMatchSpy::COUNT_VAL: { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_is_count_effective = true; } break; case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_is_max_effective = true; m_is_min_effective = true; } break; case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::SUM_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_sum_effective = true; m_is_max_effective = true; } break; case (Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_sum_effective = true; m_is_min_effective = true; } break; case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MAX_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_is_max_effective = true; m_is_count_effective = true; } break; case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MIN_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_is_count_effective = true; m_is_min_effective = true; } break; case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::SUM_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_sum_effective = true; m_is_count_effective = true; } break; case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_is_min_effective = true; m_is_max_effective = true; m_is_count_effective = true; } break; case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_count_effective = true; m_is_sum_effective = true; m_is_min_effective = true; } break; case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::SUM_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_max_effective = true; m_is_count_effective = true; m_is_sum_effective = true; } break; case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_max_effective = true; m_is_min_effective = true; m_is_sum_effective = true; } break; case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL | Xapian::GroupMatchSpy::COUNT_VAL) : { m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value); m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value); m_is_sum_effective = true; m_is_max_effective = true; m_is_min_effective = true; m_is_count_effective = true; } break; default: { throw Xapian::InvalidArgumentError("invalid group by op ArgumentError add zkb"); } break; } } }; //这里的分号不能丢 GroupMatchSpy::GroupMatchSpy(Xapian::valueno slot, Xapian::valueno group_slot, const int op, const int limit) : internal(new GroupMatchSpy::Internal(slot, group_slot, op, limit)) { static const int UPPER_LIMIT = 1000000; static const int LOWER_LIMIT = 1; if (limit > UPPER_LIMIT || limit < LOWER_LIMIT) { throw Xapian::InvalidArgumentError("Beyond the upper limit 1000000 or Below the lower limit 1"); } } GroupMatchSpy::~GroupMatchSpy() { } GroupMatchSpy::GroupMatchSpy(const Xapian::GroupMatchSpy & other) : internal(other.internal){ } void GroupMatchSpy::add_group_slot(Xapian::valueno slot_) { // Assert(internal.get() != 0); internal->add_group_slot(slot_); } void GroupMatchSpy::operator()(const Xapian::Document &doc, Xapian::weight wt) { // Assert(internal.get() != 0); (void)wt; internal->deal_group_fun(doc); } //迭代器模式对外提供的迭代器式的访问接口,使用常函数,限制这个方法修改自己的成员变量 SinagleGroupIterator GroupMatchSpy::begin() const { Assert(internal.get() != 0); //把迭代器指向了头 internal->m_total_iter = internal->m_total_map.begin(); return SinagleGroupIterator(0, *this); } SinagleGroupIterator GroupMatchSpy::end() const { Assert(internal.get() != 0); return SinagleGroupIterator(internal->m_total_map.size(), *this); } size_t GroupMatchSpy::size() { Assert(internal.get() != 0); return internal->m_total_map.size(); } bool GroupMatchSpy::empty() { Assert(internal.get() != 0); return internal->m_total_map.empty(); } bool GroupMatchSpy::has_max_val() { return internal->m_is_max_effective; } bool GroupMatchSpy::has_min_val() { return internal->m_is_min_effective; } bool GroupMatchSpy::has_sum_val() { return internal->m_is_sum_effective; } bool GroupMatchSpy::has_count_val() { return internal->m_is_count_effective; } std::string GroupMatchSpy::get_description() const { return "Xapian::SinagleGroupMatchSpy get_description add by zkb"; } SinagleGroupIterator & SinagleGroupIterator::operator++() { Assert(m_spy.internal.get() != 0 || (m_spy.internal->m_total_map.size() != m_index_i)); m_index_i++; //虽然直接使用迭代器编码会很不好,但是为了效率暂时想的办法就是这样了 m_spy.internal->m_total_iter++; return *this; } void SinagleGroupIterator::get_group_name(std::vector<std::string>& group_name) { Assert(m_spy.internal.get() != 0); const char* start = m_spy.internal->m_total_iter->first.data(); const char* end = start + m_spy.internal->m_total_iter->first.size(); //得到group name std::string tmp; while (start != end) { unpack_string(&start, end, tmp); group_name.push_back(tmp); } } const std::string& SinagleGroupIterator::get_max_val() { Assert(m_spy.internal.get() != 0); //不增加删除元素迭代器应该就不会失效 return (m_spy.internal->m_total_iter->second.m_max_str); } const std::string& SinagleGroupIterator::get_min_val() { Assert(m_spy.internal.get() != 0); return (m_spy.internal->m_total_iter->second.m_min_str); } double SinagleGroupIterator::get_sum_val() { Assert(m_spy.internal.get() != 0); return (m_spy.internal->m_total_iter->second.m_sum_d); } int SinagleGroupIterator::get_count_val() { Assert(m_spy.internal.get() != 0); return (m_spy.internal->m_total_iter->second.m_count_i); } [hr] 求大神指点如何在xapian中加快group by的速度,或者有好的思路能一起讨论下~我的邮箱zhoukuanbin@163.com |
|||
|
正在浏览该主题的用户: 1 个游客