master與standby的switchover
repmgr可以透過 repmgr standby switchover
協助我們進行master與stanby的角色互換,由於switchover操作非常複雜,所有事前準備工作必須嚴謹看待,避免實際錯誤switchover所造成的二次災難。在開始前,請再三確認下列事項是否都先確認過:
- 設定好兩方節點的SSH免密碼登入,詳細請參考前面章節1的"設定SSH免密碼登入"部份。
- repmgr standby switchover底層是使用pg_rewind,請確保兩邊節點的postgres使用者可搜尋並執行該指令。Debian/Ubuntu預設是無法直接執行pg_rewind,請參考章節1的"加入pg_ctl搜尋路徑部份"。
- wal_log_hints = on 必須要開啟。
總是加上參數-v來執行repmgr standby switchover,確保沒有漏掉任何除錯訊息。
switchover必須執行在standby機器上,而不是master上。
為避免網路狀況不佳,導致switchover不完全,建議讓switchover, replication走獨立內部網路。
switchover只適用於只有兩個節點(master, stnadby)的情況下,若有第3個standby節點,在switchover後必須手動執行repmgr standby follow。
若有使用repmgrd做自動failover,在手動switchover之前一定要先停掉所有節點的repmgrd,否則在switchover後,standby會被自動拉起來作master。
先查看並確認目前master掛在哪個節點上,以下圖為例目前master長在node2上,請到將被提升為master的standby機器上,也就是node1上執行所有作業。
postgres@node1:~$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------|----------|----------------------------------------------
* master | node2 | | host=192.168.252.51 user=repmgr dbname=repmgr
standby | node1 | node2 | host=192.168.252.50 user=repmgr dbname=repmgr
[on node1]
直接執行repmgr -v standby switchover
postgres@node1:~$ repmgr -v standby switchover
NOTICE: looking for configuration file in current directory
NOTICE: looking for configuration file in /etc
NOTICE: configuration file found at: /etc/repmgr.conf
NOTICE: switching current node 1 to master server and demoting current master to standby...
NOTICE: 0 files copied to /tmp/repmgr-node2-archive
NOTICE: current master has been stopped
ERROR: connection to database failed: could not connect to server: Connection refused
Is the server running on host "192.168.252.51" and accepting
TCP/IP connections on port 5432?
NOTICE: promoting standby
NOTICE: promoting server using 'pg_ctl -D /var/lib/postgresql/10/main promote'
waiting for server to promote.... done
server promoted
NOTICE: STANDBY PROMOTE successful
NOTICE: Executing pg_rewind on old master server
NOTICE: 0 files copied to /var/lib/postgresql/10/main
NOTICE: restarting server using 'pg_ctl -w -D /var/lib/postgresql/10/main -m fast restart'
pg_ctl: PID file "/var/lib/postgresql/10/main/postmaster.pid" does not exist
Is server running?
starting server anyway
ERROR: unable to restart server
NOTICE: node 2 is replicating in state "streaming"
NOTICE: replication slot "repmgr_slot_1" deleted on node 2
NOTICE: switchover was successful
Note: 從-v選項可看到一些ERROR,因switchover在某些階段,會需要暫停某一邊的服務,造成暫時無法連線的情況,但不一定是有問題,最後還是要從結果來看。
最後確認是否正確switchover,從下面結果可看出,master已從原本node2轉移到node1
postgres@node1:~$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------|----------|----------------------------------------------
standby | node2 | | host=192.168.252.51 user=repmgr dbname=repmgr
* master | node1 | | host=192.168.252.50 user=repmgr dbname=repmgr
且node1也有接上Streaming Replication
postgres=# SELECT * FROM pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
repmgr_slot_2 | | physical | | | f | t | 2443 | | | 0/B000AB8 |
(1 row)
但在repmgr3.3執行完switch over後,若直接去查看repl_nodes,有時會發現原standby node的狀態沒有更新到
repmgr=# SELECT * FROM repmgr_dbc.repl_nodes ORDER BY id;
id | type | upstream_node_id | cluster | name | conninfo | slot_name | priority | active
----+--------+------------------+---------+-------+-----------------------------------------------+---------------+----------+--------
1 | master | | dbc | node2 | host=192.168.252.51 user=repmgr dbname=repmgr | repmgr_slot_2 | 100 | t
2 | master | | dbc | node1 | host=192.168.252.50 user=repmgr dbname=repmgr | repmgr_slot_1 | 100 | f
(2 rows)
若遇到該狀況,請直接手動從資料表中刪除
repmgr=# DELETE FROM repmgr_dbc.repl_nodes WHERE name = 'node1';
並在switchover後的standby host重新註冊repmgr
postgres@node1:~$ repmgr standby register