PG Replication on 10 Master / Slave Configuration using traditional cp

How to do Streaming / Replication Setup on PG :
*************************************

Stage 1 : Note down, In this Stage 1, The below example shows that Primary and Standby replication only..No fail over & Fail Back..
For failover , refer stage 2 and fail back refer stage 3 .
Primary :

Primary Side : 192.168.100.11
Secondary : 192.168.100.12

1. CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'replicator';

2. Edit postgresql.conf as below
ALTER SYSTEM SET wal_level TO 'replica'; Must be at least set to hot_standby until version 9.5 or replica
ALTER SYSTEM SET archive_mode TO 'ON';
ALTER SYSTEM SET max_wal_senders TO '3';
ALTER SYSTEM SET wal_keep_segments TO '10';
ALTER SYSTEM SET listen_addresses TO '*';
ALTER SYSTEM SET hot_standby TO 'ON';
ALTER SYSTEM SET archive_command TO 'cp %p /mnt/server/archivedir/%f';

You may not need archive_mode or archive_command or restore_command for streaming replication. But, when you setup streaming replication and due to a network lag or whatever reason, if the Slave is falling behind, also if the WALs in pg_xlog or pg_wal are recycled, without archiving the WALs the slave can never get back to sync with Master. Thus, we made sure to ensure we suggest the best practices while building streaming replication, rather than just building it. In every Production environment, these parameters are a must.

3. Add the below on pg_hba.conf file.

host replication replicator 192.168.100.12/32 md5

4. As a postgres user, export PATH=$PATH:/usr/pgsql-10/bin

pg_ctl -D /var/lib/pgsql/10/data restart -mf

Standby Server Side :
=====================

Remove existing custom ts directories or Data directory ..

pg_basebackup -h 192.168.100.11 -U replicator -p 5432 -D /var/lib/pgsql/10/data -P -Xs -R

pg_basebackup -D /var/lib/postgresql/10/replica1 -U rep_user -w -P -R # -X stream

-R : Creates recovery.conf on standby which is mandatory for recovery process.

standby_mode = 'on'
primary_conninfo = 'host=192.168.0.28 port=5432 user=replicator password=replicator'
restore_command = 'cp /path/to/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'

pg_ctl -D /var/lib/pgsql/10/data start

One way of checking how up to date is our replication is by checking the amount of WAL records generated in the primary, but not yet applied in the standby.

Validation :

On Master
==========
$ ps -eaf | grep sender

You can see more details by querying the master’s pg_stat_replication view.

Shell

$ psql
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 1287
usesysid | 24615
usename | replicator
application_name | walreceiver
client_addr | 192.168.0.28
client_hostname |
client_port | 36924
backend_start | 2018-09-07 10:40:48.074496-04
backend_xmin |
state | streaming
sent_lsn | 0/50000D68
write_lsn | 0/50000D68
flush_lsn | 0/50000D68
replay_lsn | 0/50000D68
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async

On Slave
==========
$ ps -eaf | grep receiver
$ ps -eaf | grep startup

1. goto production db, create one db and check the same in secondary node.

postgres=# SELECT pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
0/50002C8
(1 row)

Slave:

postgres=# SELECT pg_last_wal_receive_lsn();
pg_last_wal_receive_lsn
-------------------------
0/50002C8
(1 row)

postgres=# SELECT pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
0/50002C8
(1 row)
Note: These functions are for PostgreSQL 10. For previous versions, you need to use: SELECT pg_last_xlog_receive_location(); and SELECT pg_last_xlog_replay_location();

We can use the following query to get the lag in seconds.

PostgreSQL 10:

SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;
Previous Versions:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
END AS log_delay;

Ref : https://www.percona.com/blog/2018/09/07/setting-up-streaming-replication-postgresql/

STAGE 2 : The below is the continuation of Stage 1. Now standby is getting data from production.
the below is going to show that, We are going to crash the primary and do testing for Fail Over. ( Means : Standby to Primary conversion )

Ref : https://blog.2ndquadrant.com/pgrewind-and-pg95/

Step 1 : List out the background process running on primary...

bash-4.2$ ps -ef|grep -i post
root 1285 1268 0 19:14 pts/0 00:00:00 sudo su postgres
root 1286 1285 0 19:14 pts/0 00:00:00 su postgres
postgres 1288 1286 0 19:14 pts/0 00:00:00 bash
postgres 2358 1 0 19:35 pts/0 00:00:00 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
postgres 2360 2358 0 19:35 ? 00:00:00 postgres: logger process
postgres 2365 2358 0 19:35 ? 00:00:00 postgres: checkpointer process
postgres 2366 2358 0 19:35 ? 00:00:00 postgres: writer process
postgres 2367 2358 0 19:35 ? 00:00:00 postgres: stats collector process
postgres 3227 2358 0 19:47 ? 00:00:00 postgres: wal writer process
postgres 3228 2358 0 19:47 ? 00:00:00 postgres: autovacuum launcher process
postgres 3229 2358 0 19:47 ? 00:00:00 postgres: archiver process last was 00000002.history
postgres 3230 2358 0 19:47 ? 00:00:00 postgres: bgworker: logical replication launcher
postgres 12267 12229 0 22:43 ? 00:00:00 postgres: wal sender process repuser 192.168.100.11(57610) streaming 0/12000140 ===> Note Sender Process here.

step 2 : List out the background process running on standby...

root 1285 1268 0 19:14 pts/0 00:00:00 sudo su postgres
root 1286 1285 0 19:14 pts/0 00:00:00 su postgres
postgres 1288 1286 0 19:14 pts/0 00:00:00 bash
postgres 2358 1 0 19:35 pts/0 00:00:00 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
postgres 2360 2358 0 19:35 ? 00:00:00 postgres: logger process
postgres 2361 2358 0 19:35 ? 00:00:02 postgres: startup process recovering 00000001000000000000 0010
postgres 2365 2358 0 19:35 ? 00:00:00 postgres: checkpointer process
postgres 2366 2358 0 19:35 ? 00:00:00 postgres: writer process
postgres 2367 2358 0 19:35 ? 00:00:00 postgres: stats collector process
postgres 15007 14998 0 22:43 ? 00:00:06 postgres: wal receiver process streaming 0/12000140 ===> Note Receiver Process here.

Step 3 : On Primary execute the command as postgres user ==> pg_ctl -D /var/lib/pgsql/10/data stop ( assuming that Crashed )

on Standby execute the below.

pg_ctl -D ${STANDBY1_PGDATA} promote

Note : If you check log file, now it will display to accept dbc connections. Earlier it shows accept read only.

Note : Hereafter, This stdby instance called as Promoted Standby ..Even if you startup the Production DB also, , no changes will be visible from the promoted standby since it is prod now.

Now Promoted standby is the New Master Node ( Primary )

STAGE 3 : Now, We assume that We resolved the disk issues in OLD Primary Server and We have to change the state from OLD Primary to Standby mode.

Step 1 : Make sure that no processes running on old primary(pgnode1 )

pg_ctl -D ${OLD_MASTER_PGDATA} stop

Step 2 :

Run the pg rewind command on OLD primary ( ie., pgnode1 Node ),

/var/lib/pgsql/10/data

bash-4.2<pgnode1>$ pg_rewind --target-pgdata=/var/lib/pgsql/10/data --source-server=" host=pgnode2 port=5432 user=postgres dbname=postgres"
servers diverged at WAL location 0/10000098 on timeline 1
rewinding from last common checkpoint at 0/10000028 on timeline 1
Done!
bash-4.2$

Note : If you get any errors like could not find previous WAL record at 0/501E680, Failure Exiting then

pg_rewind --target-pgdata=${OLD_MASTER_PGDATA} \
--source-server="port=5433 user=postgres dbname=postgres"
servers diverged at WAL position 0/501E680 on timeline 1
could not open file "/var/lib/pgsql/9.5/master/pg_xlog/000000010000000000000005": No such file or directory

the procedure is to manually copy the missing WALs and put them into the pg_wal/pg_xlog of the master(pgnode2), then run pg_rewind again on pgnode1

Step 3 : On New Primary Node (Promoted primary node - pgnode2 ), You have to edit pg_hba.conf file to include new standby server IP.

#local replication all peer
local replication repuser trust
host replication repuser 192.168.100.11/32 md5 =====================> Change the Stanby Node IP ( Now OLD Primary is the standby)
host replication all ::1/128 ident

Step 4 : Restart cluster services to take affect pg_ctl -D $PG_DATA restart on Promoted Standby ( Primary Node now )

Step 4 : On Standby Node ( OLD Primary Node - pgnode1 ), go to $PG_DATA location and make changes on recovery.conf file.
This file is responsible to know who is the Primary for this standby recovery process.

Rename the resolve.done to resolve.conf file on pgnode1..Change the host IP address. It should match the Actual PROD running now. ( pgnode2 )

bash-4.2$ hostname
pgnode1
bash-4.2$ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=repuser password=repuser123 host=192.168.100.12 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
restore_command = 'cp /var/lib/pgsql/10/backups/archivedir/%f %p'
recovery_target_timeline = 'latest'
bash-4.2$

Step 5 : On pgnode1 ( standby node ) : Run the below command to start and check the logs.

pg_ctl -D /var/lib/pgsql/10/data start

Note : Now you can check the updates from logs on both New Primary and converted secondary nodes..Create any table on new primary should reflect to converted standby ..

My PG Experience Test Cases

Search This Blog

PG Replication on 10 Master / Slave Configuration using traditional cp

Comments

Post a Comment

Popular posts from this blog

How to configure Barman along with One Node Wal Streaming.

DB Link Test on Local Cluster across Two DBs with Same Owner and Public Schema