Archive for the ‘Troubleshooting’ Category

Documents related to troubleshooting

Wednesday, September 8th, 2010

The Check Point knowledge base contains a lot of useful documents related to troubleshooting. Here’s a selection. Feel free to send an email to blog@lachmann.org when you think that a document is missing in the list.

SmartSPLAT – very nice SSH GUI client for SPLAT

Thursday, September 2nd, 2010

I’d like to share with you that today I got aware of the project SmartSPLAT.

Cagdas Ulucan, CCSE+ from Turkey, developed a nice GUI that uses a simple SSH connection to login into your SPLAT-based box and display, change and collect a lot of useful information.

SmartSPLAT

The three shell windows show output of fw monitor, actual fw logging and the main commands, parameters for them can be set using the GUI.

When you click on a button (for example “debug vpn”), you can actually see what commands are issued to the shell, so here you have a learning effect.

The tool has a build-in ftp and syslog server, so produced debug files can the uploaded easily.

At the first moment you’re overwhelmed of all the tabs that address different (troubleshooting) topics, but I think the GUI will improve and Cagdas will find a way to enhance the presentation of his tool.

What is really cool is the cluster view, where you have a windows with two panes, each representing one cluster member. An easy way to send commands to both cluster members and compare the results!

Try his tool, it’s completely free and very very useful.
Send him his suggestion for improvement and make it even better.

Tobias Lachmann

code generator for fw monitor and tcpdump

Wednesday, August 25th, 2010

Joost de Cock has a PHP application running on this site which allows you to easily create INSPECT code to use with the fw monitor command or an equivalent expressions to use with tcpdump.

A very handy tool, try it!

Tobias Lachmann

Determine current Antivirus version

Friday, August 20th, 2010

We’ve seen problems with updating the AntiVirus patterns in the past on UTM-1 appliances.
Somehow the reported version numbers seemed wrong.

But where to check what’s the current version?

Easy answer to that:
http://sigcheck.checkpoint.com/Siglist2.txt

Compare your version from SmartView Monitor or avsu_client to the version you see on the above page.

Tobias Lachmann

Display errors in SmartView Monitor

Tuesday, August 17th, 2010

Sometimes SmartView Monitor gets confused and it displaying wrong (cached) information.

To clear this up you do the following:

- issue cpstop on the Security Management server
- delete $FWDIR/conf/applications.C,
$FWDIR/conf/applications.C.backup,
$FWDIR/conf/CPMILinksMgr.db
and $FWDIR/conf/CPMILinksMgr.db.private
- issue cpstart
- install policy again
- open SmartView Monitor again

Tobias Lachmann

Update to R71 – enlarging UTM-1 appliance root partitions

Friday, June 18th, 2010

In one of my previous blog entries I described a way to enlarge partitions of UTM-1 appliances. This was necessary especially for the older x50 series appliances, as they had a smaller hard drive and a bad partition layout.

In the past I only enlarged the partition that held the log files because that’s were you have the most data. The procedure was working just fine and I was happy.

A couple of days ago I started updating x50 series appliances from R65 to R71. Even with cleaning up the system of unused files right before the update I got into serious trouble. The cause was that the root partition was nearly about full.

The update process itself came up with no error, but while operating the appliance the root partition was completely full in no time. Especially updating the URL Filterung database, which is now about 370MB, filled the root partition quickly.

When I tried enlarging the root partition with the described procedure I failed.

Resizing requires to unmount the partition before – but you can’t unmount the root partition.

So I had to find another way to modify the partition sizes of the appliance.

Here’s what I did:

I downloaded an ISO-Image of grml, a Linux Live system for sysadmins. Then I modified the ISO to display output on the serial console. You can download this modified ISO here.

I connected an USB-DVD-Drive to the appliance and booted the ISO image.

On the boot screen I added some parameters for the startup process:

Some information and boot options available via keys F2 - F10. http://grml.org/
grml 2010.04 - Release Codename Grmlmonster 2010.04.29
boot: serial debug=noscreen lang=de lvm

When grml was finished, it gave me a console with all the needed tools. LVM was loaded already and I was good to go.

I checked for the volume groups on the hard drive with the vgscan command:

root@grml ~ # vgscan -v
Wiping cache of LVM-capable devices
Wiping internal VG cache
Reading all physical volumes. This may take a while...
Finding all volume groups
Finding volume group "vg_splat"
Found volume group "vg_splat" using metadata type lvm2

Then I activated the logical volumes with vgchange:

root@grml ~ # vgchange -a y
6 logical volume(s) in volume group "vg_splat" now active

You can display the volume group with vgdisplay:

root@grml ~ # vgdisplay
--- Volume group ---
VG Name vg_splat
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 7
VG Access read/write
VG Status resizable
MAX LV 255
Cur LV 6
Open LV 0
Max PV 255
Cur PV 1
Act PV 1
VG Size 72.47 GiB
PE Size 4.00 MiB
Total PE 18553
Alloc PE / Size 7424 / 29.00 GiB
Free PE / Size 11129 / 43.47 GiB
VG UUID dCQA6u-z70X-LIsE-Xhmb-n5ho-ZMrX-JyBePy

You can display the logical volumes with lvscan:

root@grml ~ # lvscan
ACTIVE '/dev/vg_splat/lv_current' [5.00 GiB] inherit
ACTIVE '/dev/vg_splat/lv_log' [10.00 GiB] inherit
ACTIVE '/dev/vg_splat/lv_hfa' [5.00 GiB] inherit
ACTIVE '/dev/vg_splat/lv_upgrade' [5.00 GiB] inherit
ACTIVE '/dev/vg_splat/lv_fcd' [2.00 GiB] inherit
ACTIVE '/dev/vg_splat/lv_fcd62' [2.00 GiB] inherit

Then I did the resizing of the volumes groups to better values:

root@grml ~ # lvresize -L 11GB /dev/vg_splat/lv_current
Extending logical volume lv_current to 11.00 GiB
Logical volume lv_current successfully resized

root@grml ~ # lvresize -L 25G /dev/vg_splat/lv_log
Extending logical volume lv_log to 25.00 GiB
Logical volume lv_log successfully resized

Keep in mind that you will need some free space for imaging purposes, so don’t use up all the space on the hard drive!

Then a file system check has to be done, followed by the resizing of the file system.

root@grml ~ # e2fsck -f /dev/vg_splat/lv_current
e2fsck 1.41.11 (14-Mar-2010)
Superblock last mount time is in the future.
(by less than a day, probably due to the hardware clock being incorrectly set) Fix? yes

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/vg_splat/lv_current: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vg_splat/lv_current: 26973/655360 files (0.1% non-contiguous), 384238/1310720 blocks

root@grml ~ # resize2fs /dev/vg_splat/lv_current
resize2fs 1.41.11 (14-Mar-2010)
Resizing the filesystem on /dev/vg_splat/lv_current to 2883584 (4k) blocks.
The filesystem on /dev/vg_splat/lv_current is now 2883584 blocks long.

root@grml ~ # e2fsck -f /dev/vg_splat/lv_log
e2fsck 1.41.11 (14-Mar-2010)
Superblock last mount time is in the future.
(by less than a day, probably due to the hardware clock being incorrectly set) Fix? yes

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vg_splat/lv_log: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vg_splat/lv_log: 56/1310720 files (3.6% non-contiguous), 49409/2621440 blocks

root@grml ~ # resize2fs /dev/vg_splat/lv_log
resize2fs 1.41.11 (14-Mar-2010)
Resizing the filesystem on /dev/vg_splat/lv_log to 6553600 (4k) blocks.
The filesystem on /dev/vg_splat/lv_log is now 6553600 blocks long.

To finish, deactive the logical volumes:

root@grml ~ # vgchange -a n
0 logical volume(s) in volume group "vg_splat" now active

root@grml ~ # lvscan
inactive '/dev/vg_splat/lv_current' [11.00 GiB] inherit
inactive '/dev/vg_splat/lv_log' [25.00 GiB] inherit
inactive '/dev/vg_splat/lv_hfa' [5.00 GiB] inherit
inactive '/dev/vg_splat/lv_upgrade' [5.00 GiB] inherit
inactive '/dev/vg_splat/lv_fcd' [2.00 GiB] inherit
inactive '/dev/vg_splat/lv_fcd62' [2.00 GiB] inherit

That’s it. Reboot again and start the Secure Platform.

Check with df -h that you have the desired partition layout:

[Expert@cpmodule]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_splat-lv_current
11G 1.4G 8.9G 14% /
none 11G 1.4G 8.9G 14% /dev/pts
/dev/hdc1 145M 13M 125M 9% /boot
none 502M 0 502M 0% /dev/shm
/dev/mapper/vg_splat-lv_log
25G 33M 24G 1% /var/log

Tobias Lachmann

Delete all ARP entries on SPLAT

Wednesday, May 19th, 2010

We stumbled over this one yesterday: some servers behind a gateway had a problem with ARP resolution and we wanted to make sure that ARP worked. To verify this we tried to delete all ARP entries and see if the ARP cache was filled up again (and correctly).

While Windows has arp -d * as a working command to delete all entries at once, under Linux and therefor SPLAT you have to try something different.

This little script will do the job for you:

#!/bin/bash
for arpentries in `awk -F ' ' '
{ if ( $1 ~ /[0-9{1,3}].[0-9{1,3}].[0-9{1,3}].[0-9{1,3}]/ )
print $1 }' /proc/net/arp`
do
arp -d $arpentries
done

Tobias Lachmann

Well done, Royi!

Monday, May 3rd, 2010

Just had an amazing “support experience” with Check Point:
My customer suffered from sudden loss of VPN connectivity as the SmartCenter CA died because of a database corruption.
Check Point needed only 30 minutes from answering my call to providing a hotfix that solved the problem!
Well done, guys! Very well done!

Tobias Lachmann

Criticial error messages and logs

Thursday, April 22nd, 2010

Today I want to bring your attention to SecureKnowledge article sk33219, which deals with “Critical error messages and logs”.

There we have a nice list of possible error messages together with a short explanation why this error occured.

I’m missing hints on how to resolve the issue or to a related sk. But all in all a very usefull article you should bookmark for further reference.

Tobias Lachmann

Neighbour table overflow

Sunday, January 17th, 2010

Under SecurePlatform you can sometimes see the following message in /var/log/messages

Jan 15 13:44:08 fw1 kernel: Neighbour table overflow.

This refers to the ARP cache a.k.a. Neighbour table.

If you’re running a gateway with lot’s of interfaces or big subnets, you might see many nodes over Layer-2, so communication to them fills your ARP table and sometimes overflows it, which can lead to connectivity errors.

The ARP cache table has a maximum size, which can be displayed with cat /proc/sys/net/ipv4/neigh/default/gc_thresh3.
You can verify the actual amount of ARP entries either with arp -an | wc -l or with ip neighbor show |wc -l. Proxy ARP entries are only displayed when using the arp command.

Periodically and automatically the entries in the ARP cache are verified. At a specified interval, a garbage collector is running and removes entries that are no longer used. The interval can be verified with cat /proc/sys/net/ipv4/neigh/default/gc_interval, by default it’s 30 seconds.

The garbage collector is controlled by three variables:
gc_thresh1, which is the minimum number of entries in the ARP cache. If the actual number of entries are below this value, the garbage collector will not run.

gc_thresh2, which is the soft maximum number of entries. If the actual number of entries is above this value for more than 5 seconds, the garbage collector will run.

gc_thresh3, which is the hard maximum number of entries. If the actual number of entries is above this value, the garbage collector with immediately run.

gc_thresh3 is also the maximum value of ARP entries that can be kept in the table.

The default values are quite low, so you might want to increase them.

You can do this on the fly with the following CLI commands:

sysctl -w net.ipv4.neigh.default.gc_thresh3=4096
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh1=1024

This does not survice a reboot.

To survive a reboot, add this lines in the /etc/sysctl.conf file

net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024

Afterwards run the command sysctl -p for the changes to take effect and then reboot.

Tobias Lachmann

Update: Hardware Monitoring on UTM-1 appliances

Wednesday, December 2nd, 2009

Check Point found the error and described it as follows:
While installing NGX R65 HFA50, the net-snmp packages are updates to 5.3.1.0 version. When upgrading to R70.1, it tries to upgrade to net-snmp-5.0.9 version, which fails since newer packages are installed.

Check Point provided me with a new R70.1 upgrade package which I will test in the next days. Sadly, the updated package hasn’t made it to the download page so far, there you can still get the old -buggy- version.

Hopefully they re-relase this upgrade package soon.

If they don’t and you experience problems while upgrading, please ask support for the fix and refer to SR 11-72334871.

UPDATE: Check Point released a SecureKnowledge article for this issue: sk43340

UPDATE 2: Today I spoke to the support guys again. I convinced them that this fix is interesting for everybody with a UTM-1 installation and HFA50 out there, which are quite a lot people, I guess. Now they’re thinking about changing the SK entry and adding a direct download link to the fixed packet.

UPDATE 3: Support will release a new article for this issue including the download links: sk43350. At the moment the sk is not publicly available

Tobias Lachmann

Hardware Monitoring on UTM-1 appliances

Sunday, November 22nd, 2009

The hardware monitoring feature on UTM-1 appliances, available since R70.1, is missing when upgrading the appliance from NGX R65 with Messaging Security HFA50.

Check Point just confirmed this and stated, that some rpm packages are not updated which produces the error.
When you upgrade directly from NGX R65 with Messaging Security, the error is not occuring.

We’ll see what the solution from the developers for this problem will be. Since this is a general problem, I hope for new upgrade packages instead of some fixes to be applied afterwards. Seems cleaner to me….

I keep you informed

Tobias Lachmann

My favorite troubleshooting command

Sunday, November 8th, 2009

Do you know how to troubleshoot connection issues the easy way? Instead of looking into SmartView Tracker for the reason of a connection drop, just enter the shell. Then issue fw ctl zdebug drop and you’ll see the dropped packet in realtime with the reason for the drop. This is an undocumented command, which is actually a shortcut for a couple of debugging commands. A developer from Check Point was to tired of typing the needed debug lines again and again and so he introduced the zdebug command. His first name began with the letter Z, so this is why the command is zdebug.

The output is very nice, shows the reason for the drop and can easily be filtered with the grep command for IP addresses:

fw_log_drop: Packet proto=17 10.255.253.21:20031 -> 10.255.253.255:20031 dropped by fw_antispoof_log Reason: Address spoofing

fw_log_drop: Packet proto=17 192.243.100.205:58999 -> 224.0.0.1:9996 dropped by fw_handle_first_packet Reason: Rulebase drop - rule 243

fw_log_drop: Packet proto=1 10.68.111.2:1281 -> 10.68.111.5:1669 dropped by fw_icmp_stateless_checks Reason: ICMP redirect packets are not allowed

fw_log_drop: Packet proto=6 192.243.119.238:80 -> 91.96.46.174:49543 dropped by fw_first_packet_state_checks Reason: First packet isn't SYN

Since this is realtime debug output, you need to have live traffic through the firewall to see if a packet is dropped. When you try to investigate the reason for a drop of an older connection, you have to go the SmartView Tracker.

Tobias Lachmann

Capazity Optimization

Tuesday, October 27th, 2009

Because one of my customers run recently in this problem, maybe it’s a good idea to mention this again.

The firewall has a limit for it’s maximum concurrent connections. This is necessary to limit the amount of memory allocated.

But if you reach the limit, the firewall stops to accept new connections. You may experience this as a partial loss of connectivity.

To check the number of actual connections and the peak value, run fw tab -t connections -s on the command line

[Expert@fw1]# fw tab -t connections -s
HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 108437 166360 378754

The memory allocation and use of connections can also be shown with fw ctl pstat.

[Expert@fw1]# fw ctl pstat

Machine Capacity Summary:
Memory used: 12% (203MB out of 1604MB) - below low watermark
Concurrent Connections: 15% (79242 out of 499900) - below low watermark
Aggressive Aging is not active

If your concurrent connections are near the limit, you can increase the number using the SmartDashboard. Just edit the properties of the gateway object under capacity optimization and set a higher value. Please note that the memory allocation will also increase when you change something here, so make sure you’ve got enough free memory.

capacity_optimization

Tobias Lachmann

Some stuff posted in MCS customer mag

Friday, October 23rd, 2009

My employer MCS Moorbek Computer Systeme GmbH publishes a customer magazine on a regular basis.

I had same articles about Check Point topics in this magazine, written in german:

Really basic stuff, actually, but worth noticing.

I’ve you’re into Solaris or MySQL, check out the “Admin Tipps & Tricks” in the other issues of the magazine. Usually they’re located on the last pages. Use the link above to access current and older magazines.

Tobias Lachmann

Presentations from Check Point User Group Conference (CPUGCON) 2009

Wednesday, October 21st, 2009

I think I get started with this blog by posting links to the presentations I held on the Check Point User Group Conference in Chur, Switzerland.

The first presentation is purely for beginners:  Troubleshooting in the Check Point environment, Part I

The second one, which was more liked by the crowd at CPUGCON, is really advanced troubleshooting: Troubleshooting in the Check Point environment, Part II

I benefit from my daily work with a Check Point Collaborative Support Provider (CCSP) for these two presentations, as they reflect the things I’m constantly facing.

From the project side, I did lot’s of migrations from distributed Check Point installations to Check Point UTM-1 Full-Cluster. This means that the firewall / vpn part is working in active/standby cluster and we have also Management High Availability with the two SmartCenters. This is described in the presentation: Migration from a Distributed Environment to a UTM-1 Cluster

Best regards

Tobias Lachmann