Index and Key have same meaning. Index is general word but Key is Database specific word.
Index is for searching record effectively.
$ sudo apt-get install ipython python-mysql
$ ipython
In [1]: import MySQLdb
In [2]: con = MySQLdb.connect(user='testuser', passwd='password', db='testdb')
In [3]: cur = con.cursor()
In [4]: cur.execute("CREATE table test1 (id INTEGER PRIMARY KEY AUTO_INCREMENT, val INTEGER)")
Out[4]: 0L
In [5]: values = [(i, i*2) for i in range(1, 100000)]
In [6]: cur.executemany("INSERT INTO test1 VALUES (%s, %s)", values)
Out[6]: 99999L
In [7]: con.commit()
$ mysql -u testuser -ppassword testdb
mysql> select * from test1 where id=100;
+-----+------+
| id | val |
+-----+------+
| 100 | 200 |
+-----+------+
1 row in set (0.00 sec)
mysql> select * from test1 where val=200;
+-----+------+
| id | val |
+-----+------+
| 100 | 200 |
+-----+------+
1 row in set (0.02 sec)
As you can see, select by PK consumes 0.00 sec and select by where consumes 0.02 sec on table having 100000 records.
If 1M records in Table, selecting without Index may consume 0.2sec. This means only 5 (* CPU cores) queries can be executed in one second.
We need to execute thousands of queries in one second.
InnoDB is primary storage engine of MySQL.
InnoDB uses tree data structure to store and search records.
id | name |
---|---|
1 | Kabul |
2 | Qandahar |
3 | Herat |
4 | Mazar |
5 | Amsterdam |
6 | Rotterdam |
This table is stored like this:
This figure is not accurate, but illustrate essence.
Important note:
SELECT * FROM country WHERE id=5:
As you saw, InnoDB doesn’t need to compare key to all records. This is why search by Key is fast.
Note
If using sorted array, binary search is fast as Tree. But tree structure is faster on deleting and inserting.
SELECT * FROM country WHERE id>=2 LIMIT 3:
Since leaf node is linked to next leaf, InnoDB can scan rows efficiently.
Searching with primary key reaches to records directory. If you create other keys (index or constraint), InnoDB creates secondary index.
Secondary index is Tree too. But it’s leaf node stores primary key instead of actual record.
For example:
schema:
create table example (
id INTEGER PRIMARY KEY,
a INTEGER,
b INTEGER,
c INTEGER,
KEY a (a),
KEY ab (a, b)
)
table:
+---------+-----+-----+-----+
| id(PK) | a | b | c |
+=========+=====+=====+=====+
| 1 | 10 | 11 | 21 |
+---------+-----+-----+-----+
| 2 | 20 | 22 | 32 |
+---------+-----+-----+-----+
| 3 | 30 | 33 | 43 |
+---------+-----+-----+-----+
| 4 | 40 | 44 | 54 |
+---------+-----+-----+-----+
| 5 | 50 | 55 | 65 |
+---------+-----+-----+-----+
| 6 | 60 | 66 | 76 |
+---------+-----+-----+-----+
index on (a):
+-----+-----+
| a | id |
+=====+=====+
| 10 | 1 |
+-----+-----+
| 20 | 2 |
+-----+-----+
| 30 | 3 |
+-----+-----+
| 40 | 4 |
+-----+-----+
| 50 | 5 |
+-----+-----+
| 60 | 6 |
+-----+-----+
index on (a, b):
+-----+-----+-----+
| a | b | id |
+=====+=====+=====+
| 10 | 11 | 1 |
+-----+-----+-----+
| 20 | 22 | 2 |
+-----+-----+-----+
| 30 | 33 | 3 |
+-----+-----+-----+
| 40 | 44 | 4 |
+-----+-----+-----+
| 50 | 55 | 5 |
+-----+-----+-----+
| 60 | 66 | 6 |
+-----+-----+-----+
Normally, searching by secondary index cause two step lookup: (1) search PK by index, (2) search record by PK.
But when all required columns are contained in the index, MySQL doesn’t search actual record.
For example, SELECT id FROM example WHERE a BETWEEN 20 AND 50 only requires a and id.
All index is sorted by lexicographical order:
(1, 1) < (1, 2) < (1, 3) < (2, 1) < (2, 2) < ...
SELECT * FROM example WHERE a BETWEEN 20 AND 30 ORDER BY (a, b) is efficient.
SELECT * FROM example WHERE b BETWEEN 20 AND 30 ORDER BY (a, b) is not efficient because MySQL can’t use the index (a, b) for selecting. On this case, MySQL scan full table to find records matches b BETWEEN 20 AND 30 and copy them to temporary table. After scan, MySQL sorts the temporary table.
MySQL 5.5 can show how SELECT query executed by EXPLAIN SELECT... query.
Note
MySQL 5.6 supports EXPLAINing INSERT, UPDATE, DELETE, ... queries too. On MySQL 5.5, you can explain UPDATE and DELETE by replace it to SELECT.
example:
mysql> explain select * from test1 where val=200;
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | test1 | ALL | NULL | NULL | NULL | NULL | 100808 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)
mysql> explain select * from test1 where id=100;
+----+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | test1 | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)
Important columns:
If type is ALL, MySQL scan whole table. If type is index, MySQL scan whole index. Otherwise, MySQL uses index efficiently.
key means what key(index) is used.
rows means estimated rows to test. If this value is large (>1000), it’s problem.
Index has significant cost. So you should not create index everywhere.
Since finding slow query is easier than finding unnecessary index, I recommend start with minimum, obviously required indexes.
Slow query log is useful feature to find slow query. It logs queries consumes specified execution time.
You can insert many dummy data to development environment. Before releasing, check slowlog, find slow queries and consider how to solve. (Sometimes there is better way than creating index.)
To use slow query log:
slow_query_log - on to enable slow query log.
log_output - where slow query log saved.
- TABLE - log saved to mysql.slow_log table.
- FILE - log saved to file specified by slow_query_log_file.
long_query_time - queries takes longer this value are logged. (0.01~0.1)
log_queries_not_using_indexes - on to log queries doesn’t using indexes.
min_examined_row_limit - Suppress slow query log when examined rows below this value.
See also: http://dev.mysql.com/doc/refman/5.5/en/slow-query-log.html