Tuesday, March 12, 2013

Getting sysbench to work with Postgres on Flash SSD

System information:

OS: RHEL 5.8.
Sysbench: 0.4.12
Postgres: 9.2.3
RAM: 64GB
CPU: 16 cores E5-2650 0 @ 2.00GHz
SSD: STEC s1120 2TB PCIe
STEC s840 2TB SAS
Running EXT3 on the SSDs

Installing Sysbench.

1. Download sysbench 0.4.12 from http://sourceforge.net/projects/sysbench/
tar zxvf

 ./configure --without-mysql --with-pgsql
2. Edit ./sysbench/drivers/pgsql/drv_pgsql.c
as per
http://jkshah.blogspot.com/2010/10/postgres-9-and-sysbench-0412.html
otherwise sysbench prepare will take a very long time to populate the table.

3. make
Make barfs at libtool with some weird error that looks like its from X11.
../libtool: line 838: X--tag=CC: command not found
../libtool: line 871: libtool: ignoring unknown tag : command not found
../libtool: line 838: X--mode=link: command not found
../libtool: line 1004: *** Warning: inferring the mode of operation is deprecated.: command not found
../libtool: line 1005: *** Future versions of Libtool will require --mode=MODE be specified.: command not found


This isn't an X11 error. Its because libtool sets ECHO=echo and then uses $echo. Edit libtool so all occurrences of $echo are changed to $ECHO, and make will work.
4. make install 
Installs sysbench into /usr/local/bin by default. Make sure you have that in your path.

Installing/initializing Postgres

1. Follow instructions for install PG RPMs here:
http://yum.postgresql.org/
and here
http://wiki.postgresql.org/wiki/YUM_Installation

Make sure the right drive is mounted properly on /var/lib/pgsql before running any initialization commands.

2. Start up PG:
/usr/pgsql-9.2/bin/pg_ctl -D /var/lib/pgsql/9.2/data start

Configuration/tuning

1. Mount ext3 FS on the SSDs with the following options.
data=writeback,noatime,nodiratime
Ext4 has better options for SSDs, but is recommended only for RHEL 6.x distributions with the newer kernel.
2. Postgres docs recommend raising kernel.shmmax and kernel.shmall, however with 64G of memory these values appear to be pretty large anyway.
3. Changes to postgresql.conf:

shared_buffers = 32GB        # Half of available memory
work_mem = 12GB              # Increase the amount of memory used for in-memory sorts/user.
maintenance_work_mem = 1GB    # Used in vacuum    
effective_io_concurrency = 16      # Increase # of I/O threads. Could probably bump this higher for SSD.
checkpoint_segments = 256   # Checkpoint every 4GB instead of every 48M which is the default with value =3      
effective_cache_size = 32GB      # Half of available memory.
default_statistics_target = 1000   # Used by query planner
update_process_title = off           # Don't update ps -e with SQL statements.
autovacuum_max_workers = 16      # Increase to # of cores available to speed up optimization.

I didn't test this, but apparently changing random_page_cost to 1 can help as well.
http://www.databasesoup.com/2012/05/random-page-cost-revisited.html

Running sysbench

1. First create the PG database:
su - postgres
createdb dbtest
# Confirm its been created
psql -c "\l"

2. Run sysbench prepare which populates the table in PG (sbtest by default)
sysbench --test=oltp --pgsql-user=postgres_user --pgsql-password=postgres_user_password --pgsql-host="" --pgsql-db=dbtest --db-driver=pgsql --oltp-table-size=100000000 --num-threads=16 prepare

We're creating a 100M row table here which consumes about 30G of space. This can take about 10 minutes or so.

3. Run the actual test:
THREADS="8 16 24 32 48 60 72 84 100"
for t in $THREADS
do
  # First clear cache
  vmstat -SM
  sync; echo 3 > /proc/sys/vm/drop_caches
  vmstat -SM
  sysbench --test=oltp   --pgsql-user=postgres_user --pgsql-password=postgres_user_password --pgsql-host="" --pgsql-db=dbtest --db-driver=pgsql --oltp-dist-type=gaussian --oltp-table-size=100000000  --oltp-test-mode=complex --num-threads=$t  --max-time=600 --max-requests=25000000 run
done

If you run sysbench with oltp-test-mode=complex, it seems to behave differently with MySQL/InnoDB vs. Postgres as described here:
http://jkshah.blogspot.com/2010/11/sysbench-postgresql-90-and-oltp-complex.html

If your table size is small (i.e. 1M rows) or if you use more threads, you will run into deadlocks.
I worked around this by increasing the number of rows in sbtest to 100M , and restricting max-requests to 25M. Need to work with the sysbench folks to fix this.

Labels: , , , , , , , ,

Tuesday, September 25, 2012

Installing SCST and SRP for RHEL (Centos/OEL) 6.2, building kernel etc

(0) First, make sure Infiniband works.

1. Install OFED driver (./mlnxofedinstall)
2. Start the following:
    a. /etc/init.d/openibd start. Check /etc/infiniband/openib.conf, make sure chkconfig openibd is on.
    b. /etc/init.d/opensmd start. If you're not going through a switch or if your switch doesn't have a subnet manager.
cd /etc/infiniband/; osmtest -f c ; osmtest to test the health of your network.
3.  ibstatus should show an active port at this point if your cable is connected properly.

(I) Build a new kernel.
Download scst source from sourceforge, e.g: http://sourceforge.net/projects/scst/files/scst/2.2.0/
1. yum install yum-utils

2. Create /etc/yum.repos.d/src.repo to point to RHEL source:
[rhel-src]
name=Red Hat Enterprise Linux $releasever - $basearch - Source
baseurl=ftp://ftp.redhat.com/pub/redhat/linux/enterprise/$releasever/en/os/SRPMS/
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
# End file
3. yumdownloader --source kernel
4. yum install rpm-build
5. yum install unifdef
6. useradd mockbuild; su - mockbuild
7. rpm -i {kernel source rpm}
This will create SOURCES,SPECS,SRPMS dirs under ~mockbuild/rpmbuild. Assuming you rpm -i'd as user mockbuild.
8. mkdir -p ~mockbuild/rpmbuild/BUILD
9. rpmbuild -bp SPECS/kernel.spec
10.cd BUILD/kernel*/linux*/
Apply patches for scst as per README_RHEL:

   patch -p1 \<${SCST_SOURCE_DIR}/scst/kernel/rhel/scst_exec_req_fifo-rhel5.patch
   patch -p1 \<${SCST_SOURCE_DIR}/iscsi-scst/kernel/patches/rhel/put_page_callback-rhel5.patch

Don't forget to change EXTRAVERSION in Makefile. And adding 3 digit suffix. Kernel rev is 2.6-220 in stock RHEL 6.2, but kernel source RPM is for 2.6-279, and Makefile doesn't reflect the "279" unless you change EXTRAVERSION.
11. make
Results in this error:

crypto/signature/ksign-publickey.c:2:17: error: key.h: No such file or directory
crypto/signature/ksign-publickey.c: In function ‘ksign_init’:
crypto/signature/ksign-publickey.c:10: error: ‘ksign_def_public_key’ undeclared (first use in this function)
crypto/signature/ksign-publickey.c:10: error: (Each undeclared identifier is reported only once
crypto/signature/ksign-publickey.c:10: error: for each function it appears in.)
crypto/signature/ksign-publickey.c:11: error: ‘ksign_def_public_key_size’ undeclared (first use in this function)
make[2]: *** [crypto/signature/ksign-publickey.o] Error 1
make[1]: *** [crypto/signature] Error 2
make: *** [crypto] Error 2

12. Fix:
yum install ncurses-devel
cd BUILD/kernel*/linux*/
make menuconfig.
a) Select "Enable loadable module support", then "Module signature verification (EXPERIMENTAL)". Disable it.
b) Then go back to the main menu, select "Cryptographic API" then "In-kernel signature checker (EXPERIMENTAL)" and disable that one too.

13. make
14. make modules
15. make modules_install
16. make install
May get errors if you installed a library module for the current running kernel. This just means you'll need to re-install the library module after you boot into the new one. Hope you saved it.
17. Change /boot/grub/menu.lst so the new kernel is booted up by default (i.e. default=0 or default=1)
18. reboot

(II) Install scst

cd ${SCST_SOURCE_DIR} (e.g. scst-2.2.0)
make enable_proc
make
make install

Make won't work unless you have the kernel mods from above. Procfs support means scst will create /proc/scst_tgt, where you'll be able to configure vdisks and so on. This is deprecated and replaced by /sys support. However procfs is much simpler to grok so it was enabled.
Make installs scst.ko as a kernel module among other things (e.g: scst_vdisk.ko, scst_tgt.ko etc) You should be able to say "modprobe scst|scst_vdisk", and confirm via lsmod that it was loaded.

(III) Install scstadmin

This is a cooler way to configure scst than "echo 1 > blah/blah/blah". So first download it:

wget http://sourceforge.net/projects/scst/files/scstadmin/2.2.0/scstadmin-2.2.0.tar.bz2/download
bunzip2, untar, cd scstadmin-2.2.0
make enable_proc
make 
make install

(IV) Install srpt

SCSI RDMA Protocol (SRP) lets you access SCSI devices via RDMA. Using RDMA means you can directly access memory without going through the networking stack, operating system etc.
Srpt manages the target, i..e where you have the storage.
Download from http://sourceforge.net/projects/scst/files/srpt/2.2.0/
make
make install
This installs a bunch of binaries and a kernel module, ib_srpt.

(V) Configuration

Startup scripts:
Installing scstadmin should have created /etc/init.d/scst. Don't forget to chkconfig scst on.
Go into the startup script and change one line:
SCST_MODULES="scst scst_vdisk ib_srpt"
# touch /etc/scst.conf so the startup script won't complain.

scst_vdisk is the handler for virtual disks (files, devices, ISOs etc) There is another handler scst_disk, which is a passthrough for disks. Some IB adapters don't support scst_disk. ib_srpt is the target implementation of SRP.

/etc/init.d/scst start/stop. If it works, you should be able to see all three modules from above via lsmod.

Creating devices:
Use this to create devices for real devices in /dev. The syntax is different for the version of scstadmin which uses sysfs.
scstadmin -adddev DISK0 -handler vdisk -path /dev/skd0 -options BLOCKIO -blocksize 512
scstadmin -SetT10DeviceId 100 -device DISK0 -handler vdisk
scstadmin -assigndev DISK0 -group Default -lun 0
scstadmin -WriteConfig /etc/scst.conf

Did it work? The adddev and sett10deviceid options created this:
# cat /proc/scsi_tgt/vdisk/vdisk
Name              Size(MB)    Block size  Options         File name                             T10 device id
DISK0             934792        512              BIO               /dev/skd0                                     100

Assigndev created this:
# cat /proc/scsi_tgt/groups/Default/devices
Device (host:ch:id:lun or name)           LUN          Options
DISK0                                                          0

Check /etc/scst.conf also to ensure changes were written.
Now /etc/init.d/scst reload to re-read /etc/scst.conf

(VI) Configuring the initiator

1. ibstatus to make sure IB ports are up
2. modprobe ib_srp to load the SRP initiator.
3. ibsrpdm -c to figure out target information. 
4.  ibsrpdm -c | while read target_info; do echo "${target_info}" > /sys/class/infiniband_srp/${SRP_HCA_NAME}/add_target; done 

Check /var/log/messages to follow progress. If nothing was screwed up this far, you should see new devices (sdc,sdd etc)
fdisk -l to check them out, create filesystems etc.

Hope this helps. If you're running a later version of RHEL or a different distro of Linux drop me an email and I might be able to help.


Labels: , , , , , , ,