Thursday, July 19, 2007

InnoDB Row Counting using Indexes

This is always mentioned that InnoDB is slower in giving results for COUNT(*) as compared to MyISAM. But as Peter points out in his blog that this fact only applies to COUNT(*) queries without WHERE clause. This text is from Peter's blog only - "If you have query like SELECT COUNT(*) FROM IMAGE WHERE USER_ID=5 this query will be executed same way both for MyISAM and Innodb tables by performing index rage scan. This can be faster or slower both for MyISAM and Innodb depending on various conditions." Let's see what EXPLAIN has in store for us.

   1: mysql> CREATE TABLE `test_index` (
   2:   `id` int(11) NOT NULL AUTO_INCREMENT,
   3:   `x` int(11) DEFAULT NULL,
   4:   `y` int(11) DEFAULT NULL,
   5:   `z` varchar(255) NOT NULL DEFAULT 'testing',
   6:   PRIMARY KEY (`id`),
   7:   KEY `x` (`x`,`y`)
   8: ) ENGINE=InnoDB;
   9:  
  10: mysql> EXPLAIN SELECT COUNT(*) FROM test_index \G
  11: *************************** 1. row ***************************
  12:            id: 1
  13:   select_type: SIMPLE
  14:         table: test_index
  15:          type: index
  16: possible_keys: NULL
  17:           key: PRIMARY
  18:       key_len: 4
  19:           ref: NULL
  20:          rows: 4875772
  21:         Extra: Using index
  22: 1 row in set (0.01 sec)
  23:  
  24: mysql> SELECT COUNT(*) FROM test_index;
  25: +----------+
  26: | COUNT(*) |
  27: +----------+
  28: |  4915200 |
  29: +----------+
  30: 1 row in set (2.61 sec)

The explain states that the counting is going to be done on PRIMARY index and Using Index. The best part is that since it is going to use PRIMARY index and since it is NOT NULL, MySQL will actually count the values from the index itself. So, contrary to the thought that something like COUNT(1) will work faster is not true in this case. Here is an interesting case from a bug.

   1: mysql> CREATE TABLE `test` (
   2:   `id` int(11) NOT NULL,
   3:   `int_k` int(11) NOT NULL,
   4:   `data1` varchar(255) NOT NULL,
   5:   `data2` varchar(255) NOT NULL,
   6:   PRIMARY KEY (`id`),
   7:   KEY `int_k` (`int_k`)
   8: ) ENGINE=InnoDB DEFAULT CHARSET=latin1
   9: 1 row in set (0.00 sec)
  10:  
  11: mysql> CREATE PROCEDURE populate()
  12: begin
  13:   declare i int;
  14:   set i = 0;
  15:   start transaction;
  16:   while i < 300000 do
  17:     insert into test (id, int_k, data1, data2)
  18:       values (i, i, repeat("-", 250), repeat("-", 250));
  19:     set i = i + 1;
  20:     if i % 1000 = 0 then
  21:       start transaction;
  22:     end if;
  23:   end while;
  24:   commit;
  25: end
  26: 1 row in set (0.00 sec)
  27:  
  28: mysql> call populate;
  29: Query OK, 0 rows affected (1 min 0.65 sec)
  30:  
  31: mysql> SELECT COUNT(*) FROM test;
  32: +----------+
  33: | COUNT(*) |
  34: +----------+
  35: |   300000 |
  36: +----------+
  37: 1 row in set (1.08 sec)
  38:  
  39: mysql> SELECT COUNT(*) FROM test use index (int_k);
  40: +----------+
  41: | COUNT(*) |
  42: +----------+
  43: |   300000 |
  44: +----------+
  45: 1 row in set (0.12 sec)

Using a secondary index is faster. But why? Generally speaking, the PRIMARY index should be faster because it is usually in order and can be read with sequential I/O at around 15 times more speed than generally fragmented secondary index. Actually this is a special case, since the secondary index is inserted into the table in perfect order, which is very rare. Also, as Heikki points out in the bug

"Since the minimum record size of InnoDB is about 20 bytes, and the fill-factor of a secondary index is typically 70 %, we can calculate that if the row length is > 15 * 1.5 * 20 = 450 bytes, then scanning the secondary index would probably be a better option." Till this feature is implemented and we have a much better optimized count(*) for InnoDB, please use secondary index explicitly for counting rows if you satisfy any of above conditions. Hope this blog was helpful to you. Keep posting your comments.

del.icio.us Tags: , , ,

4 comments:

Eric Bergen said...

How can innodb use only the secondary index to scan rows. Doesn't it have to go back to the primary key to determine if that row is valid for the current snapshot?

Parvesh Garg said...

Eric,

Good point. It has to go back to the primary key, not primary index. Remember, the cost of scanning the primary index is primarylen+rowlen, but for secondary index it is secondarylen+primarylen. The calculation by Heikki, cited in the post, shows that it is a good bet to scan secondary index in this
case.

Thanks for pointing this out.

Regards,
Parvesh

Eric Bergen said...

I followed up on this. As it turns out innodb stores the transaction id in every index so it doesn't have to go back to the primary key. It can get all information from the secondary key scan.

Parvesh Garg said...

Eric,

Thanks for following this up. It actually makes sense to do so, as it is not required only in counting rows but in any query using indexes.