发表回复 
 
主题评价:
  • 1 次(票) - 平均星级: 4
  • 1
  • 2
  • 3
  • 4
  • 5
xapian源码修改之group by
2016年06月15日, 16:27 (这个帖子最后修改于: 2016年06月15日 16:35 by rootzkb.)
xapian源码修改之group by
因为公司业务需要我修改了xapian源码用于实现聚合函数,首先在xapian中聚合是由Matchspy实现的,在multimatch.cc中,会通过仿函数进入Matchspy的内部去完成聚合操作。xapian本身只实现统计数据分组后,每组数据的个数。但这个功能无法满足很多的需求。比如我想分组之​后求每组的最大值。
那么我就实现了分组之后,求最大值,最小值,求和之内的聚合函数的功能。首先我设计的类继承于Matchspy,然后自己重写了operator(),在这个里面我利用了​std::map的特性,我以要分组的字段为key,最大值、最小值封装成的结构体为value。那么我就做到了分组,使用map的find方法可以减少我判断是否为同一​分组的时间。按照xapian的思路我每次根据slot去document中取出相应的值比较得到最大值最小值,而整个数据的循环是在multimatch的get_ms​et的大循环中做的。
说那么多前言,我是遇到了问题了,功能是正确的,性能是不行的。普通的MatchAll查询2400万条数据用时约为10s,我自己的group by求max加上MatchAll用时22秒。求指教如何优化这个功能,我现在能想到的无非就是map的find耗时,还有最关键的从document中取数据的耗时,因​为document中取数据需要组建key去BrassTableTree中去查询数据。
[hr]
附:部分核心修改的源码
//matchspy.h : add zkb
/** @file matchspy.h
* @brief MatchSpy implementation.
*/
/* Copyright (C) 2007,2008,2009,2010,2012 Olly Betts
* Copyright (C) 2007,2009 Lemur Consulting Ltd
* Copyright (C) 2010 Richard Boulton
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/

#ifndef XAPIAN_INCLUDED_MATCHSPY_H
#define XAPIAN_INCLUDED_MATCHSPY_H

#include <xapian/base.h>
#include <xapian/enquire.h>
#include <xapian/termiterator.h>
#include <xapian/visibility.h>

#include <string>
#include <map>
#include <set>
#include <string>
#include <vector>

namespace Xapian {

class Document;
class Registry;

/** Abstract base class for match spies.
*
* The subclasses will generally accumulate information seen during the match,
* to calculate aggregate functions, or other profiles of the matching
* documents.
*/
class XAPIAN_VISIBILITY_DEFAULT MatchSpy {
private:
/// Don't allow assignment.
void operator=(const MatchSpy &);

/// Don't allow copying.
MatchSpy(const MatchSpy &);

protected:
/// Default constructor, needed by subclass constructors.
MatchSpy() {}

public:
/** Virtual destructor, because we have virtual methods. */
virtual ~MatchSpy();

/** Register a document with the match spy.
*
* This is called by the matcher once with each document seen by the
* matcher during the match process. Note that the matcher will often not
* see all the documents which match the query, due to optimisations which
* allow low-weighted documents to be skipped, and allow the match process
* to be terminated early.
*
* @param doc The document seen by the match spy.
* @param wt The weight of the document.
*/
virtual void operator()(const Xapian::Document &doc,
Xapian::weight wt) = 0;

/** Clone the match spy.
*
* The clone should inherit the configuration of the parent, but need not
* inherit the state. ie, the clone does not need to be passed
* information about the results seen by the parent.
*
* If you don't want to support the remote backend in your match spy, you
* can use the default implementation which simply throws
* Xapian::UnimplementedError.
*
* Note that the returned object will be deallocated by Xapian after use
* with "delete". If you want to handle the deletion in a special way
* (for example when wrapping the Xapian API for use from another
* language) then you can define a static <code>operator delete</code>
* method in your subclass as shown here:
* [url]http://trac.xapian.org/ticket/554#comment:1[/url]
*/
virtual MatchSpy * clone() const;

/** Return the name of this match spy.
*
* This name is used by the remote backend. It is passed with the
* serialised parameters to the remote server so that it knows which class
* to create.
*
* Return the full namespace-qualified name of your class here - if your
* class is called MyApp::FooMatchSpy, return "MyApp::FooMatchSpy" from
* this method.
*
* If you don't want to support the remote backend in your match spy, you
* can use the default implementation which simply throws
* Xapian::UnimplementedError.
*/
virtual std::string name() const;

/** Return this object's parameters serialised as a single string.
*
* If you don't want to support the remote backend in your match spy, you
* can use the default implementation which simply throws
* Xapian::UnimplementedError.
*/
virtual std::string serialise() const;

/** Unserialise parameters.
*
* This method unserialises parameters serialised by the @a serialise()
* method and allocates and returns a new object initialised with them.
*
* If you don't want to support the remote backend in your match spy, you
* can use the default implementation which simply throws
* Xapian::UnimplementedError.
*
* Note that the returned object will be deallocated by Xapian after use
* with "delete". If you want to handle the deletion in a special way
* (for example when wrapping the Xapian API for use from another
* language) then you can define a static <code>operator delete</code>
* method in your subclass as shown here:
* [url]http://trac.xapian.org/ticket/554#comment:1[/url]
*
* @param s A string containing the serialised results.
* @param context Registry object to use for unserialisation to permit
* MatchSpy subclasses with sub-MatchSpy objects to be
* implemented.
*/
virtual MatchSpy * unserialise(const std::string & s,
const Registry & context) const;

/** Serialise the results of this match spy.
*
* If you don't want to support the remote backend in your match spy, you
* can use the default implementation which simply throws
* Xapian::UnimplementedError.
*/
virtual std::string serialise_results() const;

/** Unserialise some results, and merge them into this matchspy.
*
* The order in which results are merged should not be significant, since
* this order is not specified (and will vary depending on the speed of
* the search in each sub-database).
*
* If you don't want to support the remote backend in your match spy, you
* can use the default implementation which simply throws
* Xapian::UnimplementedError.
*
* @param s A string containing the serialised results.
*/
virtual void merge_results(const std::string & s);

/** Return a string describing this object.
*
* This default implementation returns a generic answer, to avoid forcing
* those deriving their own MatchSpy subclasses from having to implement
* this (they may not care what get_description() gives for their
* subclass).
*/
virtual std::string get_description() const;
};


/** Class for counting the frequencies of values in the matching documents.
*/
class XAPIAN_VISIBILITY_DEFAULT ValueCountMatchSpy : public MatchSpy {
public:
struct Internal;

#ifndef SWIG // SWIG doesn't need to know about the internal class
struct XAPIAN_VISIBILITY_DEFAULT Internal
: public Xapian::Internal::RefCntBase
{
/// The slot to count.
Xapian::valueno slot;

/// Total number of documents seen by the match spy.
Xapian::doccount total;

/// The values seen so far, together with their frequency.
std::map<std::string, Xapian::doccount> values;

Internal() : slot(Xapian::BAD_VALUENO), total(0) {}
explicit Internal(Xapian::valueno slot_) : slot(slot_), total(0) {}
};
#endif

protected:
Xapian::Internal::RefCntPtr<Internal> internal;

public:
/// Construct an empty ValueCountMatchSpy.
ValueCountMatchSpy() : internal() {}

/// Construct a MatchSpy which counts the values in a particular slot.
explicit ValueCountMatchSpy(Xapian::valueno slot_)
: internal(new Internal(slot_)) {}

/** Return the total number of documents tallied. */
size_t get_total() const {
return internal.get() ? internal->total : 0;
}

/** Get an iterator over the values seen in the slot.
*
* Items will be returned in ascending alphabetical order.
*
* During the iteration, the frequency of the current value can be
* obtained with the get_termfreq() method on the iterator.
*/
TermIterator values_begin() const;

/** End iterator corresponding to values_begin() */
TermIterator values_end() const {
return TermIterator();
}

/** Get an iterator over the most frequent values seen in the slot.
*
* Items will be returned in descending order of frequency. Values with
* the same frequency will be returned in ascending alphabetical order.
*
* During the iteration, the frequency of the current value can be
* obtained with the get_termfreq() method on the iterator.
*
* @param maxvalues The maximum number of values to return.
*/
TermIterator top_values_begin(size_t maxvalues) const;

/** End iterator corresponding to top_values_begin() */
TermIterator top_values_end(size_t) const {
return TermIterator();
}

/** Implementation of virtual operator().
*
* This implementation tallies values for a matching document.
*
* @param doc The document to tally values for.
* @param wt The weight of the document (ignored by this class).
*/
void operator()(const Xapian::Document &doc, Xapian::weight wt);

virtual MatchSpy * clone() const;
virtual std::string name() const;
virtual std::string serialise() const;
virtual MatchSpy * unserialise(const std::string & s,
const Registry & context) const;
virtual std::string serialise_results() const;
virtual void merge_results(const std::string & s);
virtual std::string get_description() const;
};

//add zkb: 因为没有类似于ommatchspy这样的头文件,所以把数据类SinagleGroupItem
//的全部实现和申明隐藏在cc文件中,把所有分组聚合功能集中在这个类
//这个类只是内部用,不对外暴露
class SinagleGroupIterator;

//计算聚合函数的类
class XAPIAN_VISIBILITY_DEFAULT GroupMatchSpy : public MatchSpy {
friend class SinagleGroupIterator;
public:
class Internal;

//add zkb 针对一个字段的最大值最小值查询
typedef enum
{
MAX_VAL = 0x20,
MIN_VAL = 0x40,
SUM_VAL = 0x80,
COUNT_VAL = 0x100
}value_op;

GroupMatchSpy(const GroupMatchSpy &other);

/*
* 分组类
* @slot 分组聚合函数要求的值
* @group_slot 要求分组的字段(分组至少一个字段,如果多字段调用add_group_slot,添加字段)
* @op 聚合函数的类型
* @limit 限制分组的大小
*/
GroupMatchSpy(Xapian::valueno slot, Xapian::valueno group_slot, const int op, const int limit);
~GroupMatchSpy();

//多字段分组调用的接口
void add_group_slot(Xapian::valueno group_slot);

//迭代器模式对外提供的迭代器式的访问接口,使用常函数,限制这个类并不访问自己的成员
SinagleGroupIterator begin() const;
SinagleGroupIterator end() const;
size_t size();
bool empty();

bool has_sum_val();
bool has_min_val();
bool has_max_val();
bool has_count_val();

/** Implementation of virtual operator().
*
* This implementation tallies values for a matching document.
*
* @param doc The document to tally values for.
* @param wt The weight of the document (ignored by this class).
*/
void operator()(const Xapian::Document &doc, Xapian::weight wt);
virtual std::string get_description() const;
private:
/// @private @internal Reference counted internals.
Xapian::Internal::RefCntPtr<Internal> internal;
};

//聚合函数结果集的类
class XAPIAN_VISIBILITY_DEFAULT SinagleGroupIterator {
public:
SinagleGroupIterator(int index, const GroupMatchSpy& spy) : m_index_i(index)
, m_spy(spy) {
}

int m_index_i;
GroupMatchSpy m_spy;

//实现迭代器相应的运算符,本质就是通过MtachSpy内部的迭代器来做,主要是考虑到分组有几十万组的时候每次find会耗时,但是每次迭代器
//单步后移会减少耗时,二元运算符重载,第一个为默认的this指针,所以只能写成全局的.
friend bool operator!=(const SinagleGroupIterator &a, const SinagleGroupIterator &b);
friend bool operator==(const SinagleGroupIterator &a, const SinagleGroupIterator &b);
SinagleGroupIterator & operator++();
void get_group_name(std::vector<std::string>& group_name);
const std::string& get_max_val();
const std::string& get_min_val();
double get_sum_val();
int get_count_val();
};

//实现迭代器类的方法
inline bool operator!=(const SinagleGroupIterator &a, const SinagleGroupIterator &b) {
return (a.m_index_i != b.m_index_i);
}

inline bool operator==(const SinagleGroupIterator &a, const SinagleGroupIterator &b) {
return (a.m_index_i == b.m_index_i);
}

}

#endif // XAPIAN_INCLUDED_MATCHSPY_H

//matchspy.cc : add zkb
/** @file matchspy.cc
* @brief MatchSpy implementation.
*/
/* Copyright (C) 2007,2008,2009,2010,2013,2014,2015 Olly Betts
* Copyright (C) 2007,2009 Lemur Consulting Ltd
* Copyright (C) 2010 Richard Boulton
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/

#include <config.h>
#include <xapian/matchspy.h>

#include <xapian/document.h>
#include <xapian/error.h>
#include <xapian/queryparser.h>
#include <xapian/registry.h>

#include <map>
#include <string>
#include <vector>
#include <net/length.h>

//add zkb
#include <sstream>
#include "pack.h"

#include "autoptr.h"
#include "debuglog.h"
#include "omassert.h"
#include "serialise.h"
#include "stringutils.h"
#include "str.h"
#include "termlist.h"

#include <cfloat>
#include <cmath>

using namespace std;
using namespace Xapian;

MatchSpy::~MatchSpy() {}

MatchSpy *
MatchSpy::clone() const {
throw UnimplementedError("MatchSpy not suitable for use with remote searches - clone() method unimplemented");
}

string
MatchSpy::name() const {
throw UnimplementedError("MatchSpy not suitable for use with remote searches - name() method unimplemented");
}

string
MatchSpy::serialise() const {
throw UnimplementedError("MatchSpy not suitable for use with remote searches - serialise() method unimplemented");
}

MatchSpy *
MatchSpy::unserialise(const string &, const Registry &) const {
throw UnimplementedError("MatchSpy not suitable for use with remote searches - unserialise() method unimplemented");
}

string
MatchSpy::serialise_results() const {
throw UnimplementedError("MatchSpy not suitable for use with remote searches - serialise_results() method unimplemented");
}

void
MatchSpy::merge_results(const string &) {
throw UnimplementedError("MatchSpy not suitable for use with remote searches - merge_results() method unimplemented");
}

string
MatchSpy::get_description() const {
return "Xapian::MatchSpy()";
}

XAPIAN_NORETURN(static void unsupported_method());
static void unsupported_method() {
throw Xapian::InvalidOperationError("Method not supported for this type of termlist");
}

/// A termlist iterator over the contents of a ValueCountMatchSpy
class ValueCountTermList : public TermList {
private:
map<string, Xapian::doccount>::const_iterator it;
bool started;
Xapian::Internal::RefCntPtr<Xapian::ValueCountMatchSpy::Internal> spy;
public:

ValueCountTermList(ValueCountMatchSpy::Internal * spy_) : spy(spy_) {
it = spy->values.begin();
started = false;
}

string get_termname() const {
Assert(started);
Assert(!at_end());
return it->first;
}

Xapian::doccount get_termfreq() const {
Assert(started);
Assert(!at_end());
return it->second;
}

TermList * next() {
if (!started) {
started = true;
} else {
Assert(!at_end());
++it;
}
return NULL;
}

TermList * skip_to(const string & term) {
while (it != spy->values.end() && it->first < term) {
++it;
}
started = true;
return NULL;
}

bool at_end() const {
Assert(started);
return it == spy->values.end();
}

Xapian::termcount get_approx_size() const { unsupported_method(); return 0; }
Xapian::termcount get_wdf() const { unsupported_method(); return 0; }
Xapian::PositionIterator positionlist_begin() const {
unsupported_method();
return Xapian::PositionIterator();
}
Xapian::termcount positionlist_count() const { unsupported_method(); return 0; }
};

/** A string with a corresponding frequency.
*/
class StringAndFrequency {
std::string str;
Xapian::doccount frequency;
public:
/// Construct a StringAndFrequency object.
StringAndFrequency(const std::string & str_, Xapian::doccount frequency_)
: str(str_), frequency(frequency_) {}

/// Return the string.
std::string get_string() const { return str; }

/// Return the frequency.
Xapian::doccount get_frequency() const { return frequency; }
};

/** Compare two StringAndFrequency objects.
*
* The comparison is firstly by frequency (higher is better), then by string
* (earlier lexicographic sort is better).
*/
class StringAndFreqCmpByFreq {
public:
/// Default constructor
StringAndFreqCmpByFreq() {}

/// Return true if a has a higher frequency than b.
/// If equal, compare by the str, to provide a stable sort order.
bool operator()(const StringAndFrequency &a,
const StringAndFrequency &b) const {
if (a.get_frequency() > b.get_frequency()) return true;
if (a.get_frequency() < b.get_frequency()) return false;
return a.get_string() < b.get_string();
}
};

/// A termlist iterator over a vector of StringAndFrequency objects.
class StringAndFreqTermList : public TermList {
private:
vector<StringAndFrequency>::const_iterator it;
bool started;
public:
vector<StringAndFrequency> values;

/** init should be called after the values have been set, but before
* iteration begins.
*/
void init() {
it = values.begin();
started = false;
}

string get_termname() const {
Assert(started);
Assert(!at_end());
return it->get_string();
}

Xapian::doccount get_termfreq() const {
Assert(started);
Assert(!at_end());
return it->get_frequency();
}

TermList * next() {
if (!started) {
started = true;
} else {
Assert(!at_end());
++it;
}
return NULL;
}

TermList * skip_to(const string & term) {
while (it != values.end() && it->get_string() < term) {
++it;
}
started = true;
return NULL;
}

bool at_end() const {
Assert(started);
return it == values.end();
}

Xapian::termcount get_approx_size() const { unsupported_method(); return 0; }
Xapian::termcount get_wdf() const { unsupported_method(); return 0; }
Xapian::PositionIterator positionlist_begin() const {
unsupported_method();
return Xapian::PositionIterator();
}
Xapian::termcount positionlist_count() const { unsupported_method(); return 0; }
};

/** Get the most frequent items from a map from string to frequency.
*
* This takes input such as that in ValueCountMatchSpy::Internal::values and
* returns a vector of the most frequent items in the input.
*
* @param result A vector which will be filled with the most frequent
* items, in descending order of frequency. Items with
* the same frequency will be sorted in ascending
* alphabetical order.
*
* @param items The map from string to frequency, from which the most
* frequent items will be selected.
*
* @param maxitems The maximum number of items to return.
*/
static void
get_most_frequent_items(vector<StringAndFrequency> & result,
const map<string, doccount> & items,
size_t maxitems)
{
result.clear();
result.reserve(maxitems);
StringAndFreqCmpByFreq cmpfn;
bool is_heap(false);

for (map<string, doccount>::const_iterator i = items.begin();
i != items.end(); ++i) {
Assert(result.size() <= maxitems);
result.push_back(StringAndFrequency(i->first, i->second));
if (result.size() > maxitems) {
// Make the list back into a heap.
if (is_heap) {
// Only the new element isn't in the right place.
push_heap(result.begin(), result.end(), cmpfn);
} else {
// Need to build heap from scratch.
make_heap(result.begin(), result.end(), cmpfn);
is_heap = true;
}
pop_heap(result.begin(), result.end(), cmpfn);
result.pop_back();
}
}

if (is_heap) {
sort_heap(result.begin(), result.end(), cmpfn);
} else {
sort(result.begin(), result.end(), cmpfn);
}
}

void
ValueCountMatchSpy::operator()(const Document &doc, weight) {
Assert(internal.get());
++(internal->total);
string val(doc.get_value(internal->slot));
if (!val.empty()) ++(internal->values[val]);
}

TermIterator
ValueCountMatchSpy::values_begin() const
{
Assert(internal.get());
return Xapian::TermIterator(new ValueCountTermList(internal.get()));
}

TermIterator
ValueCountMatchSpy::top_values_begin(size_t maxvalues) const
{
Assert(internal.get());
AutoPtr<StringAndFreqTermList> termlist(new StringAndFreqTermList);
get_most_frequent_items(termlist->values, internal->values, maxvalues);
termlist->init();
return Xapian::TermIterator(termlist.release());
}

MatchSpy *
ValueCountMatchSpy::clone() const {
Assert(internal.get());
return new ValueCountMatchSpy(internal->slot);
}

string
ValueCountMatchSpy::name() const {
return "Xapian::ValueCountMatchSpy";
}

string
ValueCountMatchSpy::serialise() const {
Assert(internal.get());
string result;
result += encode_length(internal->slot);
return result;
}

MatchSpy *
ValueCountMatchSpy::unserialise(const string & s, const Registry &) const
{
const char * p = s.data();
const char * end = p + s.size();

valueno new_slot;
decode_length(&p, end, new_slot);
if (p != end) {
throw NetworkError("Junk at end of serialised ValueCountMatchSpy");
}

return new ValueCountMatchSpy(new_slot);
}

string
ValueCountMatchSpy::serialise_results() const {
LOGCALL(REMOTE, string, "ValueCountMatchSpy::serialise_results", NO_ARGS);
Assert(internal.get());
string result;
result += encode_length(internal->total);
result += encode_length(internal->values.size());
for (map<string, doccount>::const_iterator i = internal->values.begin();
i != internal->values.end(); ++i) {
//result += encode_length(i->first.size());
result += i->first;
result += encode_length(i->second);
}
RETURN(result);
}

void
ValueCountMatchSpy::merge_results(const string & s) {
LOGCALL_VOID(REMOTE, "ValueCountMatchSpy::merge_results", s);
Assert(internal.get());
const char * p = s.data();
const char * end = p + s.size();

Xapian::doccount n;
decode_length(&p, end, n);
internal->total += n;

map<string, doccount>::size_type items;
decode_length(&p, end, items);
while (p != end) {
while (items != 0) {
size_t vallen;
decode_length_and_check(&p, end, vallen);
string val(p, vallen);
p += vallen;
doccount freq;
decode_length(&p, end, freq);
internal->values[val] += freq;
--items;
}
}
}

string
ValueCountMatchSpy::get_description() const {
string d = "ValueCountMatchSpy(";
if (internal.get()) {
d += str(internal->total);
d += " docs seen, looking in ";
d += str(internal->values.size());
d += " slots)";
} else {
d += ")";
}
return d;
}

//add zkb
//随着聚合函数增多,如果一次switch case直接命中概率低

//数据实例类,所有的分组的信息存放在这里
struct SinagleGroupItem{
std::string m_max_str;
std::string m_min_str;
double m_sum_d;
int m_count_i;

SinagleGroupItem() {
m_max_str.clear();
m_min_str.clear();
m_sum_d = 0.0;
m_count_i = 0;
};
};

//虽然使用函数指针要间接寻址,但是只需要做一次判断就可以
typedef void (GroupMatchSpy::Internal::*compare_fun)(SinagleGroupItem &,
std::string &);

class GroupMatchSpy::Internal : public Xapian::Internal::RefCntBase {
public:
//说明这个数据是否有效
bool m_is_max_effective;
bool m_is_sum_effective;
bool m_is_min_effective;
bool m_is_count_effective;

static const int INIT_CONUNT = 1;
//要求的值对应的槽
Xapian::valueno m_slot;

//要分组的字段对应的槽
std::set<Xapian::valueno> m_group_slot_set;

//限制分组的个数
const int m_limit_i;

//函数指针存放在这里
std::vector<compare_fun> m_compare_vec;

//分组算法的查找使用MAP
std::map<std::string, SinagleGroupItem> m_total_map;
std::map<std::string, SinagleGroupItem>::iterator m_total_iter;

Internal(Xapian::valueno slot, Xapian::valueno group_slot, const int op, const int limit)
: m_is_max_effective(false)
, m_is_sum_effective(false)
, m_is_min_effective(false)
, m_is_count_effective(false)
, m_slot(slot)
, m_limit_i(limit)
{
m_total_map.clear();
m_compare_vec.clear();
m_group_slot_set.clear();

m_group_slot_set.insert(group_slot);
//根据选项添加运算的方法
add_calculation(op);
}

//添加多字段的接口
void add_group_slot(Xapian::valueno slot_) {
m_group_slot_set.insert(slot_);
}

//判断是否是纯数字字符
bool is_num_string(std::string& str, double& d) {
std::stringstream jud(str);
char c;

if (!(jud >> d)) {
return false;
}
else if (jud >> c) {
return false;
}
else {
return true;
}
}

//最大值
void set_max_value(SinagleGroupItem& key, std::string& val) {
if (val > key.m_max_str) {
key.m_max_str = val;
}
}

//最小值
void set_min_value(SinagleGroupItem& key, std::string& val) {
if (val < key.m_min_str) {
key.m_min_str = val;
}
}

//求和
void set_sum_value(SinagleGroupItem& key, std::string& val) {
double val_num = 0.0;
if (is_num_string(val, val_num)) {
key.m_sum_d += val_num;
}
else {
m_is_sum_effective = false;
//存入时可以确保sum是存放在末尾
m_compare_vec.pop_back();
}
}

//使用哑元让接口和以上一致,统计每个分组的个数
void set_count_value(SinagleGroupItem& key, std::string&) {
key.m_count_i++;
}

//实际的处理方法,只有在这里会修改map的迭代器,之后不会修改迭代器,所以遍历应该是安全的
void deal_group_fun(const Xapian::Document &doc) {
std::string key_tag;
for (std::set<Xapian::valueno>::const_iterator i = m_group_slot_set.begin(); i != m_group_slot_set.end(); ++i) {
//为string添加size头,用于确定唯一的key,如果为空拼进去0
pack_string(key_tag, doc.get_value(*i));
}
std::string val(doc.get_value(m_slot));

m_total_iter = m_total_map.find(key_tag);
if (m_total_iter == m_total_map.end()) {
if (m_limit_i > (int)m_total_map.size()) {
//如为空占住位置
m_total_map[key_tag].m_max_str = val;
m_total_map[key_tag].m_min_str = val;
m_total_map[key_tag].m_count_i = INIT_CONUNT;

//如果第一个字符串有效就放进,如果无效我看看有没有sum的要求,如果有就移除,如果没有什么都不用做
double val_num = 0.0;
if (is_num_string(val, val_num)) {
m_total_map[key_tag].m_sum_d = val_num;
}
else if (m_is_sum_effective) {
//如果这组数据有一个非法,那么就认为这组数据都非法
m_is_sum_effective = false;
m_compare_vec.pop_back();
}
}
}
else {
//for每次只做一次命中判断,而switch case要多次命中
for (size_t i = 0; i < m_compare_vec.size(); i++) {
(this->*m_compare_vec[i])((m_total_iter->second), val);
}
}
}

void add_calculation(const int op_) {
m_compare_vec.clear();

switch (op_) {
case Xapian::GroupMatchSpy::MAX_VAL: {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_is_max_effective = true;
}
break;
case Xapian::GroupMatchSpy::MIN_VAL: {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_is_min_effective = true;
}
break;
case Xapian::GroupMatchSpy::SUM_VAL: {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_sum_effective = true;
}
break;
case Xapian::GroupMatchSpy::COUNT_VAL: {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_is_count_effective = true;
}
break;
case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_is_max_effective = true;
m_is_min_effective = true;
}
break;
case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::SUM_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_sum_effective = true;
m_is_max_effective = true;
}
break;
case (Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_sum_effective = true;
m_is_min_effective = true;
}
break;
case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MAX_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_is_max_effective = true;
m_is_count_effective = true;
}
break;
case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MIN_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_is_count_effective = true;
m_is_min_effective = true;
}
break;
case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::SUM_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_sum_effective = true;
m_is_count_effective = true;
}
break;
case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_is_min_effective = true;
m_is_max_effective = true;
m_is_count_effective = true;
}
break;
case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_count_effective = true;
m_is_sum_effective = true;
m_is_min_effective = true;
}
break;
case (Xapian::GroupMatchSpy::COUNT_VAL | Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::SUM_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_max_effective = true;
m_is_count_effective = true;
m_is_sum_effective = true;
}
break;
case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_max_effective = true;
m_is_min_effective = true;
m_is_sum_effective = true;
}
break;
case (Xapian::GroupMatchSpy::MAX_VAL | Xapian::GroupMatchSpy::MIN_VAL | Xapian::GroupMatchSpy::SUM_VAL | Xapian::GroupMatchSpy::COUNT_VAL) : {
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_max_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_min_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_count_value);
m_compare_vec.push_back(&Xapian::GroupMatchSpy::Internal::set_sum_value);
m_is_sum_effective = true;
m_is_max_effective = true;
m_is_min_effective = true;
m_is_count_effective = true;
}
break;
default: {
throw Xapian::InvalidArgumentError("invalid group by op ArgumentError add zkb");
}
break;
}
}
}; //这里的分号不能丢

GroupMatchSpy::GroupMatchSpy(Xapian::valueno slot, Xapian::valueno group_slot, const int op, const int limit)
: internal(new GroupMatchSpy::Internal(slot, group_slot, op, limit)) {
static const int UPPER_LIMIT = 1000000;
static const int LOWER_LIMIT = 1;
if (limit > UPPER_LIMIT || limit < LOWER_LIMIT) {
throw Xapian::InvalidArgumentError("Beyond the upper limit 1000000 or Below the lower limit 1");
}
}

GroupMatchSpy::~GroupMatchSpy() {

}

GroupMatchSpy::GroupMatchSpy(const Xapian::GroupMatchSpy & other) : internal(other.internal){

}

void GroupMatchSpy::add_group_slot(Xapian::valueno slot_) {
// Assert(internal.get() != 0);
internal->add_group_slot(slot_);
}

void GroupMatchSpy::operator()(const Xapian::Document &doc, Xapian::weight wt) {
// Assert(internal.get() != 0);
(void)wt;
internal->deal_group_fun(doc);
}

//迭代器模式对外提供的迭代器式的访问接口,使用常函数,限制这个方法修改自己的成员变量
SinagleGroupIterator GroupMatchSpy::begin() const {
Assert(internal.get() != 0);
//把迭代器指向了头
internal->m_total_iter = internal->m_total_map.begin();
return SinagleGroupIterator(0, *this);
}

SinagleGroupIterator GroupMatchSpy::end() const {
Assert(internal.get() != 0);
return SinagleGroupIterator(internal->m_total_map.size(), *this);
}

size_t GroupMatchSpy::size() {
Assert(internal.get() != 0);
return internal->m_total_map.size();
}

bool GroupMatchSpy::empty() {
Assert(internal.get() != 0);
return internal->m_total_map.empty();
}

bool GroupMatchSpy::has_max_val() {
return internal->m_is_max_effective;
}

bool GroupMatchSpy::has_min_val() {
return internal->m_is_min_effective;
}

bool GroupMatchSpy::has_sum_val() {
return internal->m_is_sum_effective;
}

bool GroupMatchSpy::has_count_val() {
return internal->m_is_count_effective;
}

std::string GroupMatchSpy::get_description() const {
return "Xapian::SinagleGroupMatchSpy get_description add by zkb";
}

SinagleGroupIterator & SinagleGroupIterator::operator++() {
Assert(m_spy.internal.get() != 0 || (m_spy.internal->m_total_map.size() != m_index_i));
m_index_i++;
//虽然直接使用迭代器编码会很不好,但是为了效率暂时想的办法就是这样了
m_spy.internal->m_total_iter++;
return *this;
}

void SinagleGroupIterator::get_group_name(std::vector<std::string>& group_name) {
Assert(m_spy.internal.get() != 0);

const char* start = m_spy.internal->m_total_iter->first.data();
const char* end = start + m_spy.internal->m_total_iter->first.size();

//得到group name
std::string tmp;
while (start != end) {
unpack_string(&start, end, tmp);
group_name.push_back(tmp);
}
}

const std::string& SinagleGroupIterator::get_max_val() {
Assert(m_spy.internal.get() != 0);
//不增加删除元素迭代器应该就不会失效
return (m_spy.internal->m_total_iter->second.m_max_str);
}

const std::string& SinagleGroupIterator::get_min_val() {
Assert(m_spy.internal.get() != 0);
return (m_spy.internal->m_total_iter->second.m_min_str);
}

double SinagleGroupIterator::get_sum_val() {
Assert(m_spy.internal.get() != 0);
return (m_spy.internal->m_total_iter->second.m_sum_d);
}

int SinagleGroupIterator::get_count_val() {
Assert(m_spy.internal.get() != 0);
return (m_spy.internal->m_total_iter->second.m_count_i);
}
[hr]
求大神指点如何在xapian中加快group by的速度,或者有好的思路能一起讨论下~我的邮箱zhoukuanbin@163.com
查找这个用户的全部帖子
引用并回复
发表回复 


论坛跳转:


正在浏览该主题的用户: 1 个游客