Home > A Survey of Bitmap Index Compression Algorithms for Big Data

A Survey of Bitmap Index Compression Algorithms for Big Data

Page 1
TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll0?/??llpp???-??? Volume 20, Number 1, February 2015
A Survey of Bitmap Index Compression Algorithms for Big Data
Zhen Chen , Yuhao Wen, Junwei Cao, Wenxun Zheng, Jiahui Chang, Yinjun Wu, Ge Ma, Mourad Hakmaoui, and Guodong Peng
Abstract: With the growing popularity of Internet applications and the widespread use of mobile Internet, Internet traffic has maintained rapid growth over the past two decades. Internet Traffic Archival Systems (ITAS) for packets or flow records have become more and more widely used in network monitoring, network troubleshooting, and user behavior and experience analysis. As one of three key technologies in ITAS, we focus on bitmap index compression algorithm and give a detailed survey in this paper. The current state-of-the-art bitmap index encoding schemes include: BBC, WAH, PLWAH, EWAH, PWAH, CONCISE, and COMPAX. Based on differences in segmentation, chunking, merge compress, and Near Identical (NI) features, we provide a thorough categorization of the state- of-the-art bitmap index compression algorithms. We also propose some new bitmap index encoding algorithms, such as SECOMPAX, ICX, MASC, PLWAH+ and present the state diagrams for their encoding algorithms. We then evaluate their CPU and GPU implementations with a real Internet trace from CAIDA. Finally, we summarize and discuss the future direction of bitmap index compression algorithms. Beyond the application in network security and network forensic, bitmap index compression with faster bitwise-logical operations and reduced search space, are widely used in analysis in genome data, geographical information system, graph databases, image retrieval, Internet of things etc. It is expected that bitmap index compression will thrive and prosperous again in Big Data era since 1980s. Key words: Internet traffic; big data; traffic archival; network security; bitmap index; bitmap compression algorithm Zhen Chen and Junwei Cao are with the Research Institute of Information Technology, Tsinghua University, Beijing 100084, China. E-mail: fzhenchen, jcaog@tsinghua.edu.cn. Yuhao Wen is with Department of Electronic Engineering, Tsinghua University, Beijing 100084, China. Wenxun Zheng is with Department of Physics, Tsinghua University, Beijing 100084, China. Jiahui Chang is with Department of Aerospace Engineering, Tsinghua University, Beijing 100084, China. Guodong Peng is with Department of Mechanical Engineering, Tsinghua University, Beijing 100084, China. Yinjun Wu and Ge Ma are with Department of Automation, Tsinghua University, Beijing 100084, China. Mourad Hakmaoui is with Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China. Zhen Chen, Yuhao Wen, Wenxun Zheng, Jiahui Chang, Guodong Peng, Yinjun Wu, Ge Ma, Mourad Hakmaoui, and Junwei Cao are also with Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing 100084, China. To whom correspondence should be addressed. Manuscript received: 2014-12-05; accepted: 2014-12-28
1 Introduction
1.1 Big data The Internet has brought with its access to enormous quantities of rapidly changing data in all fields. As the leading search engine, Google provides powerfully customizable search capabilities to individuals[1]. Google records each users�� search behavior, including their Web access path and the access time for each page. The system is able to process more than 34000 requests per second. Important scientific events can generate staggering quantities of data, at enormous rates of growth. For example, the experiments of the European Large Hadron Collider (LHC) produced more than 15 petabytes of data at up to 1.5 gigabytes per second[2]. Network monitors, communication services, sensor networks, and financial services produce unlimited continuously streaming data, growing in real

Page 2
2 Tsinghua Science and Technology, February 2015, 20(1): 000-000
time. Streaming data is characterized by its growing data volume. Traditional relational database has been unable to meet the requirements of the storage, index, query and analytics of the growing streaming data which is critical to make timely decisions. 1.2 Internet traffic data With the commercial popularity of Internet applications and mobile wireless networks, Internet traffic is growing at an accelerating pace. A report from Cisco[3] predicts that Internet traffic will grow four- fold from 2011 to 2016, and reach 1.3 zettabytes (a zettabyte is one trillion gigabytes) in 2016. On the Internet, information is transmitted in packets, and transmitted along different paths in one or more networks, and reassembled at the destination. Internet data transmission is based on TCP/IP protocol, which
Fig. 1 Pcap file format.
divides the network packets into IP packets, TCP/UDP packets, and application packets due to different information they contain. Figure 1 shows the format of original data acquired from a physical link to the Internet. Pcap (Packet capture) is an internet ��packet sniffing�� API, which stores captured data as Pcap files. Raw network packets are stored sequentially in Pcap files. The Pcap file header describes the properties of Pcap files (e.g., file size). Packet header contains the description information (e.g., timestamps and size). The packet payload includes the contents of the complete TCP / IP network packets, see Fig. 2. Figure 2a shows an IP packet format, including packet header and payload. It has the following fields: version (4 bits), header length (4 bits), type of services (8 bits), total length (16 bits), identification (16 bits), flags (3 bits), fragment offset (13 bits), Time To Live (TTL) (8 bits), protocol (8 bits), header checksum (16 bits), source IP address (32 bits), and destination IP address (32 bits). Figure 2b shows the format of a TCP header formats, including the packet header and payload. It has the following fields: source port number (16 bits),
Fig. 2 TCP/IP packet format.

Page 3
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 3
destination port number (16 bits), sequence number (32 bits), acknowledgement number (32 bits), data offset (4 bits), reserved (6 bits), emergency bit URG, confirm bit ACK, reset bit RST, synchronization bit SYN, termination bit FIN, window size (16 bits), checksum (16 bits), urgent pointer (16 bits), option field, and padding field. Network flow records describe network message and transmission characteristics. The record can describe a User Datagram Protocol (UDP) connection or a Transmission Control Protocol (TCP) connection. The flow usually refers to a quintuple of the source IP address, source port, destination IP address, destination port, and protocol. A network flow record structure obtained from an Internet router is shown in Table 1. It has the following fields: source Autonomous System (AS) number, destination AS number, start and end time, and number of network packets. This is the Netflow V5 flow record format, developed by CISCO; it is the most popular flow record format. Netflow determines which flow packets belong to based on seven fields: source IP address, destination IP address, source port number, target port number, third-layer protocol type, TOS byte (DSCP), network equipment input (or output), and logical network port (ifIndex). CISCO��s Netflow V9, also known as IPFIX (IP Flow Information Export), became an IETF standard. (The Internet Engineering Task Force develops and promotes voluntary Internet standards.) It defines how IP flow information is to be formatted and transferred from an exporter to a collector. Mobile Internet operators obtain Call Detail Records (CDR) traffic information, their format shown in Table 2. It includes terminal attributes, location area code, CI number, service gateway, start time, end time,
Table 1 Network flow record structure (Router). Field Remark SrcIP Source IP address SrcPort Source port DstIP Destination IP DstPort Destination port Proto Layer 3 protocol (e.g., TCP UDP) TCPflags Cumulative TCP flags (e.g., SYN ACK FIN) SrcAS Source autonomous system number DestAS Destination autonomous system number Octs # Bytes exchanged in the flow Pkts # Packets exchanged in the flow Start Flow start time Duration Flow duration others Table 2 Flow record format (wireless base station). Field Remark Cell phone number May not contain a prefix such as +86,0086,86 Location area code LAC CI number Select the first CI when a network switches Terminal IMEI/ IMSI Start time YYYY-MM-DD HH:MM:SS.1234567 Duration (s) RAT type 1 represents 3G, 2 represents 2G Traffic (Byte) Upstream/ Downstream/total Terminal IP Traffic( in bytes) Source port/ Destination port APN 3gwap,3gnet,uniwap,uninet,cmwap,cmnet SGSN/ GGSN IP First access IP Http protocol User Agent/Content-type/URL/Status Code Others
TCP/IP information, HTTP application information, etc. 1.3 Traffic archiving and retrieval Network monitoring has been the core function of network management, network fault diagnosis, and network security. In addition to real-time firewalls and intrusion detection systems, traffic-archiving systems are important to network forensic. An Internet Traffic- Archiving System (ITAS) captures packet or flow records for subsequent analysis and processing. Such systems have many important applications. In addition, statistics from China Unicom China Unicom[4] shows that current mobile Internet users daily generate more than 30 billion records and 8.4 TB of data in the telecommunications business, resulting in trillions of records, requiring petabyte storage capacity. With mobile Internet users doubling about every six months, the number of Internet records they generate will further increase[5]. Network security has become a major challenge, and technologies such as intrusion detection systems, signature detection, and security scanning technology have arisen to meet it[6]. But Internet data volumes have exceeded current real-time detection and analysis capabilities. It has therefore become essential to capture Internet traffic for forensic analysis[7–9]. This paper is organized as follows: The structure and function of traffic archive systems is introduced

Page 4
4 Tsinghua Science and Technology, February 2015, 20(1): 000-000
in Section 2. Section 3 describes the key technologies of packet capture, compression, and bitmap index encoding of network flow archiving systems. We conclude the paper and discuss future work in Section 4.
2 ITAS
2.1 ITAS structures and functions Internet traffic archiving systems, or ITAS, typically include flow data acquisition, index storage, and query processing. There are two categories of flow data- acquisition, corresponding to two types of data: (1) packet-level network data and (2) flow-level network data. A typical traffic-archiving system structure is shown in Fig. 3. Figure 3 describes a typical traffic archiving system. After being captured, network data can be routed to an index module in real time, or stored to disk for subsequent processing. The index module updates the index with the new data, typically compressing it to reduce storage requirement. The indices are stored by rows or columns after indices compression. In order to reduce storage overhead, we adopt the method of storage compression in the process of , such as zip etc. When a query arrives, the query processor calls the current index and returns the corresponding result. 2.2 Key technologies in ITAS 2.2.1 High-performance packet/flow capture technologies Internet traffic��s volume and velocity are very large, and still growing. There are tens of millions of 64- byte packets to deal with each second and similar orders of magnitude of flow records from routers. In a 10Gbps link, for example, the network will reach 14 million packets per second with 64 bytes per packet. Even after polymerization, there are still millions of flow records. Though only dealing with flow data, a backbone router��s link of telecom operators would generate millions of records per second which will reach 30 billion per day[10–13]. Therefore, how to capture and store high arriving speed of packets and streams in real time is a major challenge. 2.2.2 High-performance packet/flow storage technologies In many ITAS, packets and flow records are stored in relational databases by rows, this consumes much storage and exhibits inefficient retrieval characteristics. Current systems have begun to store the data in columns, compressing it to reduce storage overhead. A popular compression method is LZO (Lempel-Ziv-Oberhumer), known for its focus on performance. Other new methods such as RasterZip and BreadZip[14] include information-flow optimization. 2.2.3 High-performance packet/flow indexing technologies The accepted approach to indexing large volume of data is to employ an inverted index, which is widely used, for example, in search engines. However, a more efficient approach is using bitmap index which is a special case of inverted index also widely used by search engines. Bitmap Index is very efficient in dealing with the network flow queries, especially forensic analysis. The ��index space�� for an enormous datastore will explode by huge; consequently, companies such as Google have paid much attention to its compression[15, 16]. In order to create an efficient index, the general index space should be fragmented (i.e., shard or segmentation), and the index file should be compressed at the same time. Bitmap compression
Fig. 3 The structure of an Internet traffic-archiving system.

Page 5
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 5
algorithm is an effective method for compressing the bitmap index, to solve the index-space explosion issue. 2.3 Categories of ITAS We compare and summarize the features of Internet traffic-archiving systems in Table 3. Raw data can be stored in a database by rows or by columns. In bitmap indexes, the usual choice is ��by columns��. For example, NET-FLi, an efficient on- the-fly compression, archiving and indexing system for streaming network traffic, stores data in columns, and uses the LZO compression algorithm to reduce the storage size. Traffic archiving entails packet capture, storage, indexing, and querying. Different traffic- archiving systems focus on different aspects. This paper takes the bitmap index compression algorithm as a starting point for understanding traffic-archiving systems.
3 Key Technologies of Bitmap Index Compression Coding
Efficient indexing of network packets or flow is central to traffic archiving systems. Indexing of traffic data has the following characteristic: (1) Large volume of data: The number of index messages is massive, even for brief periods. (2) High rates of incoming data: To keep up with the rate of packet influx, systems must be highly efficient. (3) Fixed data structure: The index information for each network packet has a fixed format with a fixed length. (4) Appending without modification: Network packet index information will increase only. Once the information is generated, it can��t be changed. (5) High redundancy: Data items of network data are frequently repeated. Due to the characteristics of network flow data, relational databases are not appropriate for this task. Traditional relational databases are optimized for handling frequent changes. They use B or B+ trees, which are not particularly suitable for indexing traffic data. Another common technology of large-data retrieval systems is the inverted index. The core data structure of the inverted index is the posting list. Sequence of integers of a KEY is stored in inverted lists, such as a timestamp and offset, and the most typical KEY is the position list, which shows the position where a word appears. Due to the characteristics of network flow data, net flow archiving systems use bitmap indexing rather than inverted indexes. 3.1 Introduction to bitmap indexing The concept of bitmap index was first introduced by Spiegler and Maayan in 1985[29]. The first commercial database product was published to implement a bitmap index in 1987[30–35]. It uses a sequence of bits to indicate the the presence or absence of an item in the indexed data. With bitmap indexing, logical operations, such as AND, OR, NOT and XOR etc., can be used to answer complex queries. Bitmap index once was designed for scientific data and database, which is usually generated by scientific instruments or scientific simulation. Scientific data are extremely large and without further modification change. Bitmap index databases solve the problem on how to quickly identify a small amount of selected data in a mass of scientific data, while the traditional relational database is not suitable for this work. The technologies used in bitmap index databases are bitmap indexing, bitmap compression, and classification. In bitmap index database, the data are stored in columnar and a corresponding bitmap index is therein. A bitmap index example is shown in Table 4.
Table 3 A comparison of different traffic-archiving systems. Data format Indexing Storage Packet Flow record Bitmap Hash Tree Column based Row based TelegraphCQ[17] No Yes PostgreSQL(DBS) No No Gigascope[18] No Yes No No Yes No No MIND[19] No Yes No Yes No No No Hyperion[20] Yes No BloomFilter Multi-level indexing No No No FastBit[21]+TelegraphCQ No Yes WAH, PostgreSQL(DBS) Yes No TimeMachine[22] Yes No No Yes No No Yes NET-FLi No Yes COMPAX No No LZO No NetStore[23] No Yes No No No Segmentation No RASTERZip[24] No Yes Yes No No Yes No PcapIndex[25] Yes No COPMAX No No Yes No TifaFlow[26–28] Yes No Yes No No No No

Page 6
6 Tsinghua Science and Technology, February 2015, 20(1): 000-000 Table 4 An example of a bitmap index which consists of six tuples (rows). The column as a attribute can have one of four values, 0, 1, 2, and 3. Row Column Bitmap index =1 =2 =3 =4 1 1 1 0 0 0 2 2 0 1 0 0 3 4 0 0 0 1 4 3 0 0 1 0 5 2 0 1 0 0 6 4 0 0 0 1
Currently, classical bitmap index compression algorithms are Byte-aligned Bitmap Compression (BBC)[36], Word-Aligned Hybrid (WAH)[37, 38], Position List Word-Aligned Hybrid (PLWAH)[39], Enhanced Word-Aligned Hybrid (EWAH)[40], COmpressed N Composable Integer SEt (CONCISE)[41], Partitioned Word-Aligned Hybrid (PWAH)[42], COMPressed Adaptive indeX (COMPAX)[43], UCB[44], VLC[45], VAL-WAH[46], RoaringBitmap[47], RLH[48], and DFWAH[49]. There are improved versions of PLWAH (PLWAH with adaptive counter), such as PLWAH+[50] and APLWAH. An improved version of COMPAX (COMPAX + oLSH and COMPAX2) has also presented. Figure 4 shows
Fig. 4 The advancement of bitmap index compression algorithm.
the various bitmap index compression algorithms in chronological order. 3.2 Categories of bitmap index compression algorithms To summarize the characteristics of bitmap index compression algorithms, we compare the bitmap index compression algorithms based on different dimensions, i.e. Segmentation, Chunking, Merge Compress, and Near Identical (NI), as shown in Table 5. 3.3 Bitmap index compression algorithms 3.3.1 BBC Antoshenkov[36] proposed the byte-aligned bitmap compression method in 1995, the bitmap bytes are classified as gaps and maps. Gaps containing only 0��s or only 1��s and maps containing a mixture of both. Continuous gaps are encoded by their byte length and a fill bit is used to differ 0 and 1 gaps. Map bytes are directly compiled after the control byte without encoding. A pair (gap, map) is encoded into a single atom which is composed as ��control byte + gap length + map��. Control byte can be divided into four categories: (1) The bitmap bytes which contain 0-3 gaps and 0- 15 maps can be encoded as follows: 1 [fill bit] [fill length (2 bits) ] [tail length (4 bits)]. A fill bit is used to represent all 0��s or all 1��s; fill length is used to represent the length of gap; and tail length is used to indicate the length of a map. (2) Bitmap bytes that contains 0-3 gaps, with 1 bit different between the gap and the map, can be encoded as follows: 01 [fill bit] [fill length (2 bits) ] [odd bit position (3 bits)]. ��Odd-bit position�� is the different 1 bit (dirty bit) position between the map and the gap. (3) Bitmap bytes that contain more than 3 gaps and
Table 5 Bitmap index encoding algorithms. Methods Segmentation Chunking (bit) Merge Compress Near Identical Piggyback LFL/FLF Dirty Bits Dirty Bytes NI-0 NI-1 NI-0 NI-1 BBC No 8 Yes No Yes Yes No No WAH No 31 No No No No No No EWAH No 31 No No No No No No PWAH No 63 Yes No No No No No PLWAH No Yes Yes No Yes No No No CONCISE No Yes Yes No Yes No No No COMPAX Yes 31 No Yes No No No Yes SECOMPAX Yes 31 Yes Yes No No Yes Yes ICX Yes 31 Yes Yes No No Yes Yes MASC Yes No No Yes No No Yes Yes PLWAH+ Yes No Yes Yes Yes Yes No No

Page 7
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 7
0-15 maps can be encoded as follows: 001 [fill bit] [tail length (4 bits)]. Gap lengths follow the control byte. (4) Bitmap bytes that contains more than 3 gaps with 1 bit different between the gap and the map can be encoded as follows: 0001 [fill bit] [odd bit position (3 bits)]. 3.3.2 WAH WAH is the default bitmap index compression algorithm in Fastbit database. It makes the bit sequences into 31 bits (63 bits for WAH 64) as a group. There are two types of group: original Literal, which contains 0 and 1 in 31 bits; and original Fill, which contains all 0��s or all 1��s in 31bits. Literal group: The type flag is 0; the remaining 31 bits are the original Literal group. Fill group: Divided into 1-Fill and 0-Fill. The type flag is 1, and the second bit is the sign of Fill type (0-Fill is 0, 1- Fill is 1). The remaining 30bits are a counter, which indicates the number of consecutive 0-Fill (or 1-Fill) groups. 3.3.3 PLWAH PLWAH groups the bit sequence into 31-bit units. There are Literal and Fill groups, but there are some changes in compression. In WAH, there are an extra flag to distinct Fill group and Literal group, and each group codes as 32 bits. For Fill groups, the lower 25 bits are a counter. By introducing the definitions of ��nearly identical�� and ��position list��, PLWAH further enriches the codebook, increasing the encoding types and improving compression efficiency. A slight improvement over PLWAH is achieved by the method of ��PLWAH + adaptive counter��. When there is a large number of consecutive 0��s or 1��s, so that a word can��t fully accommodate a counter, this method uses the same type of two consecutive Fill-word, making two counters into a single large counter. The first Fill-word records the 25 least-significant bits, and the second Fill-word records the 25 most-significant bits. Meanwhile, the position list of the first Fill-word is empty, and the position list of the second Fill-word is calculated as usual. 3.3.4 EWAH EWAH defines a 32-bit field that contains consecutive 0��s or 1��s as a ��clean�� segment and defines a 32- bit field that contains mixed 0��s and 1��s as a ��dirty�� segment. EWAH algorithm also uses two types of words, where the first type is a 32-bit verbatim word; the second type of word is a marker word, whose first bit indicates which clean word will follow, and 16 bits are used to store the number of clean words. The remaining 15bits are used to store the number of dirty words following the clean words. EWAH encoding bitmaps begin with a marker word, shown in Fig. 5. Because EWAH uses only 16 bits to store the number of clean words, it may be less efficient than WAH when there are many consecutive sequences of clean words. Three ways are proposed to alleviate this compression overhead over clean words. First, more than half of the bits can be allocated to encode the runs of clean words. Second, when a marker word indicates a run of 216 clean words, the next word will be used to indicate the number of remaining clean words. Finally, this compression penalty is less relevant when using 64- bit words instead of 32-bit words. When there are few dirty words in the bitmaps, WAH might be preferable. Even considering EWAH and WAH indexes of similar sizes, each EWAH marker word needs to be accessed three times to determine the running bits and two running lengths, while no word needs to be accessed more than twice with WAH. When there are lots of long runs of dirty words in the bitmaps, EWAH might be preferable. EWAH will access each dirty word at most once, while WAH needs to check the first bit of each dirty word to ascertain it is a dirty word. The EWAH decoder can skip a sequence of dirty words, while the WAH decoder must access them all. 3.3.5 CONCISE CONCISE is based on WAH. In a compressed 32- bit segment, when the leftmost bit is 1, the following 31bits are uncompressed; when the leftmost bit is 0, it means Fill, and the second bit indicates the type of Fill. The following 5bits are the position of a ��flipped�� bit within the first 31-bit block of the Fill. The remaining 25bits count the number of 31-blocks that compose the fill minus one. When position bits equal
Fig. 5 EWAH encoding method.

Page 8
8 Tsinghua Science and Technology, February 2015, 20(1): 000-000
0 (binary ��00000��), the word is a ��pure�� fill. Otherwise, position bits indicate the bit to switch within the first 31- bit block of the sequence represented by the Fill word. That means that 1 (binary ��00001��) indicates that it has to flip the rightmost bit, while 31 (binary ��11111��) indicates that it has to flip the leftmost bit. In Fig. 6, for example, words #0, #3, and #5 are Literal words, and words #1, #2, and #4 are Fill words. Word #0 is used to represent integers in the range 0–30; word #1 for integers in the range 31–92; and word #2 for integers in the range 93–1022. Word #2 also indicates that integers in the range 94–1022 are missing, but 93 is in the set, since position bits say that the first number of the ��missing numbers�� sequence is an exception. Word #3 represents for integers in the range 1023–1053, word #4 for integers in the range 1054– 1040187391, and word #5 for integers in the range 1 040 187 392–1 040 187 422. 3.3.6 PWAH Based on WAH, PWAH extends word length to 64 bits, and makes a word into P pieces, starting with a P- bit header, which is used to indicate the type of Fill or Literal. The length of Literal can be indicated flexibly to save space. Figure 7 shows the PWAH-4 bitmap encoding method. There are three kinds of PWAH algorithm, PWAH- 2, PWAH-4, and PWAH-8. PWAH-2 uses WAH compression coding. In the experiments, PWAH-8 is used more frequently than the others. Combined with the Nuutila algorithm[51], PWAH-8 can be effective in solving the accessibility query problem of large-scale
Fig. 6 CONCISE encoding method.
graphs. 3.3.7 COMPAX There are more codewords in COMPAX. The definitions of Literal and Fill are same as with WAH and PLWAH. Every 31 bits are a chunk, and the chunks are divided into four groups according to the following features: (1) Literal-Fill-Literal (LFL); (2) Fill-Literal-Fill (FLF); (3) Fill (F); and (4) Literal (L). The COMPAX+oLSH (online-Locality-Sensitive- Hashing) compression method is the same as that of COMPAX, but this compression method needing the reordering of input data stream in advance, which improves compression rate considerably. 3.3.8 SECOMPAX/ICX and MASC/PLWAH+ 3.3.8.1 New ideas in bitmap index encoding SECOMPAX/ICX/MASC/PLWAH+ are proposed based on COMPAX/PLWAH. SECOMPAX/ICX is based on COMPAX2, introduces the concept of NI-1 Literal, and extends the number of dirty bytes to 2; while MASC/PLWAH+ is based on PLWAH and introduces new features from run-length encoding. In terms of ��piggyback�� (the combination of two symbols) and ��FLF/LFL�� (the combination of three symbols), the current bitmap indexing encoding algorithm can be classified as two types. According to these categories, the state-of-the-art bitmap index compression algorithms are shown in Fig. 8. (1) Scope Extended COMPAX (SECOMPAX) SECOMPAX[52] is based on COMPAX2, an enhanced version of COMPAX. The differences are as follows: (a) Nearly Identical Literal (NI-L) SECOMPAX extends the ��dirty byte�� concept to consist of 0-NI-L and 1-NI-L, i.e., a Literal word can be nearly identical to a 0-Fill word or a 1-Fill word. This maintains symmetry in the codebook. (b) Extended LFL: In addition to the 0-NI-L + F + 0-NI-L type, there are three others: 0-NI- L + F + 1-NI-L, 1-NI-L + F + 0-NI-L, and 1-NI-L + F + 1-NI-L. Extended FLF: In addition to the 0F + 0-NI-L + 0F1F+0-NI-L+1F type, there are six others: 0F + 1- NI-L + 0F, 1F + 1-NI-L + 1F; 1F + 1-NI-L + 0F, 0F + 1-NI-L + 1F; 1F + 0-NI-L + 0F, and 0F + 0-NI-L + 1F. By the inclusion of new codewords, some kinds
Fig. 7 PWAH-4 bitmap encoding method.

Page 9
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 9 Fig. 8 New ideas in bitmap index encoding.
of origin sequences that COMPAX cannot further compress can be dealt with by SECOMPAX: (a) In the form of FLF, if two Fill words are of different types, or a Literal word is nearly identical to Fill word, COMPAX does not try to encode them as FLF. COMPAX encodes them as an F, an L, and an F, while SECOMPAX can encode them as FLF. Obviously, in this circumstance, the result encoded by COMPAX is twice as long as those encoded by SECOMPAX. (b) In the form of LFL, COMPAX can only deal with that both of the two Literal-words are nearly identical to 0-Fill words. If one or both of the Literal words are identical to a 1-Fill word, COMPAX encodes them as an L-word, an F-word, and an L-word (L+F+L, 3 words in total), not as LFL (1 word in total). However, SECOMPAX can encode all of those circumstances as LFL, which saves a great deal of room compared with COMPAX. (2) Improved COMPAX (ICX) ICX is proposed to further compress the bitmap by considering the possible two dirty-bytes cases, which are not considered in the COMPAX. And we extend PLWAH nearly identically concept to consider one or two dirty bytes case, and represent these cases with Nearly Identical bits in the new codebook. By this new codebook, more possible LF combinations can be encoded, and they can be easily compressed into one word. ICX can achieve a better compression ratio than COMPAX. In the cases where the number of 1��s is comparable to the number of 0��s, ICX performs especially better than both WAH and COMPAX, since they haven��t taken those cases into consideration. (3) MAximized Stride with Carrier (MASC) MASC is proposed to further improve the compression performance without impairing query performance. MASC uses as long a stride size as possible—not limited to 31bits as in PLWAH and COMPAX—to encode the consecutive zero bits and nonzero bits in a more compact way. MASC records origin bitmap index sequences into a new format. MASC is a new extended version of PLWAH. The concept ��carrier�� in MASC and ��piggybacked�� in PLWAH are similar. However, the carrier can carry at most 30 nonzero bits, while PLWAH can piggyback only a single nonzero bit. In addition, we generalize the concept of Literal word and eventually obsolete it. As a consequence, several (no more than 30) nonzero bits can be carried by the former 0-Fill word and output a 1- carried 0-Fill word, while PLWAH has to encode them in a Literal word, or two Literal words in the worst case, when consecutive nonzero bits are located in two adjacent chunks. Considering zero bits�� and nonzero bits�� distribution in real data sets, they usually appear in batches—especially after being sorted by the hash value of each record. Thus MASC can perform better than PLWAH. (4) Position List Word-Aligned Hybrid Plus

Page 10
10 Tsinghua Science and Technology, February 2015, 20(1): 000-000
(PLWAH+) We propose the PLWAH+ bitmap compression scheme based on PLWAH. First, we add the definition of an LF word that can piggyback more NI words, which is not considered in PLWAH. According to the experiment, with about a 20% reduction in the amount of Literal words, it can save 3% or more of the storage, compared to PLWAH. And the result shows that the PLWAH+ is more suitable for streaming network traffic. Furthermore, the definition of the NI Chunk is enlarged, which performs better in some cases, where the ratio of set bits is not at a low level. PLWAH+ further classifies NI into two kinds: Nearly Identical 0 Fill word (NI-0 word) and Nearly Identical 1 Fill word (NI-1 word). According to a number of tests, a prediction can be made that PLWAH+ is more suitable for complex databases than PLWAH. 3.3.8.2 Encoding algorithm state diagrams The state diagrams of SECOMPAX, ICX, MASC, and PLWAH+ are shown in Figs. 9–12. Figure 9 shows the state diagram of the SECOMPAX encoding algorithm. The states of the encoder represent the different word types that have been stored. Figure 10 introduces the encoding state diagram for ICX. As ICX handles new cases when the number of ��dirty bytes�� is 2, there are more edges than in SECOMPAX. Figure 11 shows the state diagram of the MASC encoding algorithm. As MASC encodes the maximu
A, Fill; B, NI-Fill; C, Literal
Fig. 9 Encoding algorithm of SECOMPAX. (x, y) means the next character is x, and the output is y except if y=null, output nothing, and if y is underlined, the output is a codeword as a whole.
A, Fill; B, NI-Fill; C, Literal
Fig. 10 Encoding automaton in ICX. Fig. 11 Encoding automaton in MASC. States 0: 0-Fill; 1: 1-Fill; and C: 1-carried-0-Fill (0-Fill word carried 1��s). The meaning of transfer edge (x, y, z) is as follows: x, input bit (0 or 1); y, output bit; z, the counter. If label (x, y) does not occur, the counter will increase by 1, by default. Fig. 12 Encoding automaton in PLWAH+. ��Start�� is the start state of a compression procedure, symbol ��F�� stands for codeword Fill word, symbol ��P�� stands for codeword NI word, and symbol ��L�� stands for codeword Literal word.

Page 11
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 11 Fig. 13 Bitmap index encoding with PLWAH/COMPAX2/ SECOMPAX/ICX/MASC.
length of consecutive 0��s or 1��s, the counter is not in multiples of 31, as in WAH. The PLWAH+��s FSM is shown in Fig. 12. The pair (x;y) labels the edge to stand for an action taken at a shift of states, which means when x is the input from a WAH result, the state moves along the corresponding edge and y is the output to the final result. In particular, if y equals ��null��, it outputs nothing. If y has one symbol like ��FP��, it outputs a combination of the two codewords, i.e., FL. If y has two symbols, such as ��F�� and ��L��, it outputs two codewords, ��F�� and ��L��. 3.3.8.3 Implementation and evaluation We evaluate SECOMPAX, ICX, MASC, and PLWAH+ with a real Internet traffic trace from CAIDA[53]. This Internet traffic trace was anonymized and captured from a core router by CAIDA at the end of 2013. There are 13581181 packets in this trace. First, we reorder packets with the mechanism based on the principle of locality-based hashing used in Ref. [54]. Then we convert five-tuples hSIP, sport, DIP, dport, protoi to bitmaps, and compress the bitmaps in each column with a fixed block size of 4Kb, which is also used in Ref. [54]. In these experiments, source IP (4 bytes), destination IP (4bytes), source port (2bytes), and destination port (2 bytes) are compressed with PLWAH, COMPAX, and SECOMPAX. For SrcIP, there are four bytes, and each byte expands to 256 columnar files. There are 1024 columnar files, which contain 13 581 181 bits (the number of packets in the trace). The situation is similar for the DstIP and Ports fields. Figure 13 shows the average size of a compressed columnar file with PLWAH, COMPAX2, and SECOMPAX, ICX, and MASC, where each original columnar file has 13 581 181 bits (the number of packets in the trace). Compared with PLWAH, COMPAX2 and SECOMPAX reduce the size of the index for a source IP address by 6.74%, and for a destination IP by 6.05%. While compared with COMPAX2, SECOMPAX reduces the size of the source IP by 4.01%, and destination IP by 3.97%. ICX and MASC show further reduction in compressing the bitmap files. A detailed comparison is illustrated for source IP (4 bytes), destination IP (4 bytes), source port (2 bytes), and destination port (2bytes) in Figs.14–17. From Figs.14–17, it is clear that SECOMPAX/ICX/MASC have a better compression ratio and smaller space consumption than COMPAX2 and PLWAH, especially in Byte 0 in SrcIP and DstIP. Compared with PLWAH, SECOMPAX can reduce the size of the index for SrcIP Byte 0 by 7.62% and DstIP by 8.32%. Among these, MASC has the best performance, as the improvement it shows reaches about 16%–18%. 3.3.8.4. GPU implementation Usually, the encoding and decoding operation in the process of compressing runs in a CPU. To further
Fig. 14 Size after compression using five encoding schemes in DstIP. Fig. 15 Size after compression using five encoding schemes in DstPort.

Page 12
12 Tsinghua Science and Technology, February 2015, 20(1): 000-000 Fig. 16 Size after compression using five encoding schemes in SrcIP. Fig. 17 Size after compression using five encoding schemes in srcPort.
accelerate the compression, we propose a GPU-based solution, to offload the indexing and parallelize the encoding. As the CPU is also responsible for overall system operation, in consideration of total performance issues, it is better to offload the bitmap indexing operation to a GPU. Andrzejewski and Wrembel[55, 56] introduced GPU-based WAH and PLWAH. However, those implementations could not avoid extending the original data into bitmap form before processing, which increased memory consumption and decreased performance. A new way to compress bitmaps without extending the original data was introduced in Ref. [40], which will be also used in this paper. Especially in Ref. [57], Fusco et al. evaluated the GPU-based WAH and PLWAH with a sequence of random integers to mimic the five-tuples of Internet trace, and prove the potential of a GPU to achieve the speed of indexing a million of packets per second. GPU-based SECOMPAX, ICX, and MASC algorithms are implemented with Thrust, a C++ library provided with the NVIDIA SDK, designed to enhance code productivity and, more importantly, performance and portability across NVIDIA GPUs. To evaluate the performance of our implementation, we used a similarly priced CPU and GPU: a 3.4-GHz Intel i7-2600K processor with 8 Mb of cache and an NVIDIA GTX-760 GPU, fitted in a PCI-e Gen 2.0 slot. The input data we used is an anonymous Internet trace data set from CAIDA of 13 581 810 packets too. We extract their five tuples (source IP, destination IP, source port, destination port, and protocol number). For simulating circumstances in practice, we cut those packets by 3968 (12831), and created bitmap indexes for 14 (bytes) 3968 one at a time. We construct and compress a bitmap index using GPU-MASC, and compare its result with the encoding result of MASC. The memory consumption is the same, and is shown in Fig. 18. However, GPU-MASC can build bitmap indexes for 12831 packets in 8.057 ms, while MASC takes 157.3ms, because MASC cannot encode in parallel on a CPU. Thus, GPU-MASC improves encoding speed by 19.5 times. Based on Fig. 19, the throughput of GPU-MASC is 492 491 packets per second. However, GPU-MASC constructs and encodes bitmap indexes for all 14 bytes in the five tuples for Internet trace packets, while other algorithms on GPU only construct two bytes of the five tuples, one at a time. So the equivalent throughput for
Fig. 18 Encoding-speed comparison between GPU-MASC and CPU-MASC. Fig. 19 Encoding time of GPU-WAH and GPU-MASC.

Page 13
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 13
GPU-MASC is 3 447 437 packets per second.
4 Outlooks
Bitmap Indexing is a powerful technique to accelerate ad-hoc query in Big Data. Bitmap index compression with potentially higher compression rates, faster bitwise-logical operations, and reduced search space, are booming research area since 1980s and thriving and prosperous again in Big Data era. Beyond scientific data management[58–60], novel bitmap data representation and compacting of index has a wide usage in many other area, e. g., biological network[61], gene context analysis[62], RFID-based Item Tracking[63], Relational XML Twig Query Processing[64], geographical data warehouses[65], Geographical Information System (GIS)[66], Graph Databases[67], Content-based image retrieval (CBIR)[68], inverted indexes in Search engines[70? ] and many other data analysis area etc.
5 Conclusions and Futures Works
We introduce and analyze the traffic-archiving systems and the key technologies of bitmap index compression. First, we provide an introduction to classical traffic-archiving systems in recent years; then we present a survey of bitmap index compression algorithms. We summarize bitmap index compression algorithms in terms of Segmentation, Chunking, Merge Compress, and Near Identical. We also propose some new bitmap encoding algorithms, such as SECOMPAX, ICX, MASC, and PLWAH+, and show their state diagrams for encoding procedures. We also evaluate their CPU and GPU implementations with a real Internet trace from CAIDA. Finally, we summarize and discuss the future direction of bitmap index compression algorithms. With the rapid growth of Internet traffic, traffic- archiving systems and bitmap index compression scheme research will face new challenges, which must be overcome to design more efficient high-speed indexing technologies. These improved bitmap index compression schemes will have important research value, which will provide powerful technical support for high-performance network-traffic archiving systems. Acknowledgements We are grateful to the teachers and students both in NSLAB and QoSlab. The authors would like to thank Prof. Jun Li of NSLAB from RIIT for his guidance. This work was supported by the National Key Basic Research and Development (973) Program of China (Nos. 2012CB315801 and 2011CB302805), the National Natural Science Foundation of China A3 Program (No. 61161140320), and the National Natural Science Foundation of China (No. 61233016). This work was also supported by Intel Research Council with the title of Security Vulnerability Analysis based on Cloud Platform with Intel IA Architecture. References
[1] Google, Web History, http://www.google.com/psearch, 2008.08.10. [2] Business Intelligence Lowdown, http://www. businessintelligeneelowdown.corn/2007/02/top 10 largest. html, 2008.08.10. [3] Cisco Visual Networking Index: Forecast and Methodology, 20132018, http://www.cisco.com/c/en/us/ solutions/collateral/serviceprovider/ip-ngn-ip-next- generationnetwork/white paper c11-481360.html, 2014.06.10. [4] W. Huang, Z. Chen, W. Dong, H. Li, B. Cao, and J. Cao, Mobile internet big data platform in China Unicom, Tsinghua Science and Technology, vol. 19, no. 1, pp. 95– 101, 2014. [5] Z. Chen, W. Huang, and J. Cao, Big Data Engineering for Internet Traffic, (in Chinese). Beijing, China: Tsinghua University Press, 2014. [6] R. Lin, B. Wu, F. Yang, Y. Zhao, and J. Hou, An efficient adaptive failure detection mechanism for cloud platform based on volterra series, Communications, China, vol. 11, no. 4, pp. 1–12, 2014. [7] Y. Meng, Z. Luan, and D. Qian, Differentiating data collection for cloud environment monitoring, Communications, China, vol. 11, no. 4, pp. 13–24, 2014. [8] L. Pu, J. Xu, B. Yu, and J. Zhang, Smart cafe: A mobile local computing system based on indoor virtual cloud, Communications, China, vol. 11, no. 4, pp. 38–49, 2014. [9] M. Li, A. Lukyanenko, S. Tarkoma, and A. Yla-Jaaski, MPTCP incast in data center networks, Communications, China, vol. 11, no. 4, pp. 25–37, 2014. [10] L. Rizzo, Device polling support for FreeBSD, http://info.iet.unipi.it/luigi/polling/, 2002. [11] L. Deri, Improving passive packet capture: Beyond device polling, in Proceedings of SANE, 2004, pp. 85–93. [12] L. Rizzo, Netmap: A novel framework for fast packet I/O, in Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX Association, 2012, p. 9. [13] A. Papadogiannakis, M. Polychronakis, and E. P. Markatos, Scap: Stream-oriented network traffic capture and analysis for high-speed networks, in Proceedings of the 2013 Conference on Internet Measurement Conference, 2013, pp. 441–454. [14] G. Ma, Z. Guo, X. Li, Z. Chen, J. Cao, Y. Jiang, and X. Guo, BreadZip: A combination of network traffic data and bitmap index encoding algorithm, in Systems, Man and Cybernetics (SMC), in 2014 IEEE International Conference on IEEE, 2014, pp. 3235–3240.

Page 14
14 Tsinghua Science and Technology, February 2015, 20(1): 000-000 [15] L.A.Barroso, J.Clidaras, andU.Hlzle, Thedatacenter as a computer: An introduction to the design of warehouse-scale machines, Synthesis Lectures on Computer Architecture, 2013, doi:10.2200/ S00516ED2V01Y201306CAC024. [16] J. Dean, Challenges in building large-scale information retrieval systems, in Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, 2009. [17] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, and M. A. Shah, TelegraphCQ: Continuous dataflow processing, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data ACM, 2003, p. 668. [18] C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk, Gigascope: A stream database for network applications, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data ACM, 2003, pp. 647– 651. [19] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, and W. H. Hong, Advanced indexing techniques for widearea network monitoring, in Data Engineering Workshops, 2005. 21st International Conference on IEEE, 2005, pp. 1184–1184. [20] P. Desnoyers and P. J. Shenoy, Hyperion: High volume stream archival for retrospective querying, in USENIX Annual Technical Conference, 2007, pp. 45–58. [21] K. Wu, FastBit: An efficient indexing technology for accelerating data-intensive science, Journal of Physics: Conference Series, vol. 16. no. 1, pp. 556–560, 2005. [22] G. Maier, R. Sommer, H. Dreger, A. Feldmann, V. Paxson, and F. Schneider, Enriching network security analysis with time travel, ACM SIGCOMM Computer Communication Review, vol. 38, no. 4, pp. 183–194, 2008. [23] P. Giura and N. Memon, NetStore: An efficient storage infrastructure for network forensics and monitoring, in Recent Advances in Intrusion Detection. Springer Berlin, 2010, pp. 277–296. [24] F. Fusco, M. Vlachos, and X. Dimitropoulos, RasterZip: Compressing network monitoring data with support for partial decompression, in Proceedings of the 2012 ACM Conference on Internet Measurement Conference, 2012, pp. 51–64. [25] F. Fusco, X. Dimitropoulos, M. Vlachos, and L. Deri, PcapIndex: An index for network packet traces with legacy compatibility, ACM SIGCOMM Computer Communication Review, vol. 42, no. 1, pp. 47–53, 2012. [26] J. Li, S. Ding, M. Xu, F. Y. Han, X. Guan, and Z. Chen, TIFA: Enabling real-time querying and storage of massive stream data, in Networking and Distributed Computing (ICNDC), 2011 Second International Conference on, 2011, pp. 61–64. [27] Z. Chen, L. Ruan, J. Cao, Y. Yu, and X. Jiang, TIFAflow: Enhancing traffic archiving system with flow granularity for forensic analysis in network security, Tsinghua Science and Technology, vol. 18, no. 4, pp. 406–417, 2013. [28] Z. Chen, X. Shi, L. Ruan, F. Xie, and J. Li, High speed traffic archiving system for flow granularity storage and querying, presented at ICCCN 2012 Workshop on PMECT, 2012. [29] I. Spiegler and R. Maayan, Storage and retrieval considerations of binary data bases, Information Processing and Management, vol. 21, no. 3, pp. 233–254, 1985. [30] P. Cheng, Bitmap index techniques and its research advancement, Science and Technologies Information, vol. 27, no. 26, pp. 134–135, 2010. [31] Z. Huang, W. Lv, and J. Huang, Improved BLAST algorithm based on bitmap indexes and B+ tree, Computer Engineering and Applications, vol. 49, no. 11, pp. 118– 120, 2013. [32] B. Yang, Y. Qi, Y. Xue, and J. Li, Bitmap data structure: Towards high-performance network algorithms designing, Computer Engineering and Applications, vol. 45, no. 15, pp. 1–5, 2009. [33] H. Garcia-Molina, J. D. Ullman, and J. Widom, Database System Implementation, Second Edition. Prentice Hall, 2009. [34] P. E. O��Neil, Model 204 architecture and performance, in High Performance Transaction Systems. Springer Berlin Heidelberg, 1989, pp. 39–59. [35] P. O��Neil and D. Quass, Improved query performance with variant indexes, ACM Sigmod Record, vol. 26, no. 2, pp. 38–49, 1997. [36] G. Antoshekov, Byte-aligned bitmap compression, in Proceedings of the Data Compression Conference (DCC��95), 1995, p. 476. [37] K. Wu, E. J. Otoo, and Arie Shoshani, Compressing bitmap indexes for faster search operations, in Scientific and Statistical Database Management, 2002. Proceedings. 14th International Conference on, 2002, pp. 99–108. [38] K. Wu, E. J. Otoo, and A. Shoshani, Optimizing bitmap indices with efficient compression, ACM Transactions on Database Systems (TODS), vol. 31, no. 1, pp. 1–38, 2006. [39] F. Deli��ege and T. B. Pedersen, Position list word aligned hybrid: Optimizing space and performance for compressed bitmaps, in Proceeding of the 13th International Conference on Extending Database Technology, 2010. [40] D. Lemire, O. Kaser, and K. Aouiche, Sorting improves word-aligned bitmap indexes, Data & Knowledge Engineering, vol. 69, no. 1, pp. 3–28, 2010. [41] A. Colantonio and R. Di Pietro, Concise: Compressed ncomposable integer set, Information Processing Letters, vol. 110, no. 16, pp. 644–650, 2010. [42] S. J. van Schaik and O. de Moor, A memory efficient reachability data structure through bit vector compression, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 2011, pp. 913–924. [43] F. Fusco, M. P. Stoecklin, and M. Vlachos, Net-fli: Onthe- fly compression, archiving and indexing of streaming network traffic, Proceedings of the VLDB Endowment, vol. 3, nos. 1&2, pp. 1382–1393, 2010.

Page 15
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 15 [44] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu, Update conscious bitmap indices, in 19th IEEE International Conference on Scientific and Statistical Database Management SSBDM��07, 2007. [45] F. Corrales, D. Chiu, and J. Sawin, Variable length compression for bitmap indices, in Database and Expert Systems Applications, A. Hameurlain, S. W. Liddle, K.- D. Schewe, and X. Zhou, eds. Springer Berlin Heidelberg, 2011. [46] G. Guzun, G. Canahuate, D. Chiu, and J. Sawin, A tunable compression framework for bitmap indices, in IEEE International Conference on Data Engineering (ICDE 2014), 2014, pp. 484–495. [47] S. Chambi, D. Lemire, O. Kaser, and R. Godin, Better bitmap performance with Roaring bitmaps, arXiv preprint arXiv:1402.6407, 2014. [48] M. Stabno and R. Wrembel, RLH: Bitmap compression technique based on run-length and Huffman encoding, Information Systems, vol. 34, no. 4, pp. 400–414, 2009. [49] A. Schmidt, D. Kimmig, and M. Beine, A proposal of a new compression scheme of medium-sparse bitmaps, International Journal on Advances in Software, vol. 4, nos. 3&4, pp. 401–411, 2011. [50] J. Chang, Z. Chen, W. Zheng, Y. Wen, J. Cao, and W. L. Huang, PLWAH+: A bitmap index compressing scheme based on PLWAH, in Proceedings of the Tenth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, 2014, pp. 257–258. [51] E. Nuutila. An efficient transitive closure algorithm for cyclic digraphs. Information Processing Letters, vol. 52: 207–213, 1994. [52] Y. Wen, Z. Chen, G. Ma, J. Cao, W. Zheng, G. Peng, and W. L. Huang, SECOMPAX: A bitmap index compression algorithm, in Computer Communication and Networks (ICCCN), 2014 23rd International Conference on, IEEE, 2014, pp. 1–7. [53] M. F. Oberhumer, The Lempel-Ziv-Oberhumer Packer, http://www.lzop.org/, November 2010. [54] F. Reiss, K. Stockinger, K. Wu, A. Shoshani, and J. M. Hellerstein, Enabling real-time querying of live and historical stream data, in Scientific and Statistical Database Management, 2007. SSBDM��07. 19th International Conference on, IEEE, 2007, p. 28. [55] W. Andrzejewski and R. Wrembel, GPU-WAH: Applying GPUs to compressing bitmap indexes with word aligned hybrid, in Database and Expert Systems Applications, P. G. Bringas, A. Hameurlain, and G. Quirchmayr, eds. Springer Berlin Heidelberg, 2010, pp. 315–329. [56] W. Andrzejewski and R. Wrembel, GPU-PLWAH: GPU- based implementation of the PLWAH algorithm for compressing bitmaps, Control & Cybernetics, vol. 40, no. 3, pp. 627–650, 2011. [57] F. Fusco, M. Vlachos, X. Dimitropoulos, and L. Deri, Indexing million of packets per second using GPUs, in Proceedings of the 2013 Conference on Internet Measurement Conference, 2013, pp. 327–332. [58] S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, and N. F. Samatova, Scalable in situ scientific data encoding for analytical query processing, in Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013, pp. 1–12. [59] R. R. Sinha, S. Mitra, and M. Winslett, Bitmap indexes for large scientific data sets: A case study, in Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, 2006, pp. 10–pp. [60] R. R. Sinha, and M. Winslett, Multi-resolution bitmap indexes for scientific data, ACM Transactions on Database Systems (TODS), vol. 32(3), no. 16. [61] Y. Zhang, F. N. Abu-Khzam, N. E. Baldwin, E. J. Chesler, M. A. Langston, and N. F. Samatova, Genome- scale computational approaches to memory-intensive applications in systems biology, in Proceedings of the ACM/IEEE SC 2005 Conference on Supercomputing��2005, 2005, p. 12. [62] A. Romosan, A. Shoshani, K. Wu, V. Markowitz, and K. Mavrommatis, Accelerating gene context analysis using bitmaps, in Proceedings of the 25th International Conference on Scientific and Statistical Database Management, 2013, p. 26. [63] Y. Hu, S. Sundara, T. Chorma, and J. Srinivasan, Supporting RFID-based item tracking applications in Oracle DBMS using a bitmap datatype, in Proceedings of the 31st international conference on Very large data bases, 2005, pp. 1140–1151. [64] K. H. Lee and B. Moon, Bitmap indexes for relational XML twig query processing, in Proceedings of the 18th ACM conference on Information and knowledge management, 2009, pp. 465–474. [65] T. L. Lopes Siqueira, R. R. Ciferri, V. C. Times, and C. D. de Aguiar Ciferri, A spatial bitmap-based index for geographical data warehouses, in Proceedings of the 2009 ACM symposium on Applied Computing, 2009, pp. 1336– 1342. [66] J. Zhang and S. You, Dynamic tiled map services: Supporting query-based visualization of large-scale raster geospatial data, in Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application,2010, p. 19. [67] N. Martnez-Bazan, M. . guila-Lorente, V. Munts-Mulero, D. Dominguez-Sal, S. Gmez-Villamor, and J. L. Larriba- Pey, Efficient graph management based on bitmap indices, in Proceedings of the 16th International Database Engineering & Applications Sysmposium, 2012, pp. 110– 119. [68] G. H. Cha, Bitmap indexing method for complex similarity queries with relevance feedback, in Proceedings of the 1st ACM international workshop on Multimedia databases, 2003, pp. 55–62. [69] M. Fontoura, M. Gurevich, V. Josifovski, and S. Vassilvitskii, Efficiently encoding term co-occurrences in inverted indexes, in Proceedings of the 20th ACM

Page 16
16 Tsinghua Science and Technology, February 2015, 20(1): 000-000 international conference on Information and knowledge management, 2011, pp. 307–316. [70] D. Dash, J. Rao, N. Megiddo, A. Ailamaki, and G. Lohman, Dynamic faceted search for discovery-driven analysis, in Proceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 3–12. Zhen Chen is an associate professor in Research Institute of Information Technology in Tsinghua University. He received his BEng and PhD degrees from Xidian University in 1998 and 2004, respectively. He once worked as postdoctoral researcher in Network Institute of Department of Computer Science in Tsinghua University during 2004 to 2006. He is also a visiting scholar in UC Berkeley ICSI in 2006. He joined Research Institute of Information Technology in Tsinghua University since 2006. His research interests include network architecture, computer security, and data analysis. He has published around 80 academic papers. Yuhao Wen is an undergraduate student studying in Department of Electronic Engineering at Tsinghua University. His research interests include big data and networks. Wenxun Zheng is an undergraduate student studying in Department of Physics at Tsinghua University. His research interests is now on bitmap index compression. Jiahui Chang is an undergraduate student studying in Department of Aerospace Engineering at Tsinghua University. He majors in Engineering Mechanics. Guodong Peng is an undergraduate student studying in Department of Mechanical Engineering at Tsinghua University. His research interests include mechanical design and network security. Yinjun Wu is an undergraduate student studying in Department of Automation at Tsinghua University. His research interests include bitmap indexing algorithms. Ge Ma is currently a master student studying in Department of Automation at Tsinghua University. He got his BEng degree from Tsinghua University in 2013. His research interests include future network and bitmap index compressing. Mourad Hakmaoui is a postgraduate student studying for PhD degree in Department of Computer Science and Technologies at Tsinghua University. He got his master and bachelor degrees from University of Mohammed 5th-Morocco in 2008 and 2012, respectively. His research interests include bitmap indexing algorithms and fast query for Big Data.

Page 17
Zhen Chen et al.: A Survey of Bitmap Index-Compression Algorithms for Big Data 17 Junwei Cao is currently a professor and deputy director of Research Institute of Information Technology, Tsinghua University, China. He is also the Director of Open Platform and Technology Division, Tsinghua National Laboratory for Information Science and Technology. His research is focused on advanced computing technology and applications. Before joining Tsinghua in 2006, Junwei Cao was a research scientist of Massachusetts Institute of Technology, USA. Before that he worked as a research staff member of NEC Europe Ltd., Germany. Junwei Cao got his PhD degree in computer science from University of Warwick, UK, in 2001. He got his MEng and BEng degrees from Tsinghua University in 1998 and 1996, respectively. Junwei Cao has published over 130 academic papers and books, cited by international researchers for over 3000 times. Junwei Cao is a senior member of the IEEE Computer Society and a member of the ACM and CCF.

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP