Using GSiteCrawler with slscart

GSiteCrawler (GSC) generates a Google sitemap for your website. Check out the references at the bottom of the page.

GSiteCrawler runs on your local computer or laptop.

Setting up Gsitecrawler

If you have a small website, I would not recommend a sitemap because the search engines will find changes you make pretty fast and the extra time dealing with a sitemap is not worth it.

Prerequisites

Google recommends the sitemap file be put in the root directory of your web site. The default name of the file is sitemap.xml

Robots.txt Entry

Your robots.txt file should have a path to the sitemap file in it. Use https if your site is secure.

Sitemap: https://www.mydomain.com/sitemap.xml

Sitemap: https://www.mydomain.com/sitemap.xml.gz

for a compressed (i.e. smaller) sitemap. GSC can generate both. The compressed version is better in that downloading it reduces bandwidth. Note that the sitemap.xml.gz would have to be specified in the robots.txt file.

Download and Install

Download gsitecrawler and install it.

Setup

Most entries are self-explanatory but there are a few that may trip you up.

Project Tab

Main URL: url to your home page (e.g. http://www.mydomain.com). Use https:// if your site is secure.

Settings - General tab

Uncheck: remove trailing slash on folder names

File extensions to check

There is no reason to put images in the sitemap since they will be picked up by the search engines in a different manner. Delete the following image extensions:

jpg
jpeg
gif
png
tif

Location of project files

Create a folder called 'gsitecrawler' in your local website folder and put the full path to it. (C:\mydomain.com\gsitecrawler\) Note this is a folder and not a file. It is not necessary, nor advisable, to keep your gsitecrawler files online.

Location and name of sitemap file

Full path to your local website folder ending with sitemap.xml. (C:\mywebsite\sitemap.xml)

Settings - FTP tab

Filezilla will be used for secure FTP transfers, so this tab isn’t required to be populated. You can use the cpanel login and password along with SFTP protocol for a secure FTP transfer.

Settings - Automation tab

Not used

Filter Tab

Ban URLs tab

Add any directories, such as logs, that you do not want gsitecrawler to look at. For example, on slscart:

/gsitecrawler/
/logs/
/sbconf/
/storeadmin/

Drop Parts

Not used

Remove Parameters

Not used

URL List Tab

Generally only the first URL is checked ‘Manually’ as this is the url to the site.

‘Delete all non-manual links’ is basically a start over. All links except manual ones are erased. This should be used if the sitemap has not been generated for a while.

Statistics tab

Used after GSC is run.

Running Gsitecrawer

Click on '(Re)Crawl' button on the top toolbar and wait until it is finished. A dialog box indicating GSC is finished will appear when done but closes itself after 10 seconds. On large sites it will take a long time. On incorrectly designed sites gsitecrawler may never finish.

The last time GSC was run is under Statistics tab, next to the Type: dropdown.

Generate Sitemaps

When the crawler is finished,

In Gsitecrawler, click the 'Generate' button. A dialog will appear. Check 'Generate Google Sitemap file'. (there is no reason to check the Yahoo urllist file since Yahoo now recognizes the sitemap file). This will generate new sitemap.xml and sitemap.xml.gz files in the directory specified by location and name of sitemap file.

Review the statistics (Statistics tab) and correct any errors. If errors occurred, rerun the crawler again.

Upload the sitemap.xml and sitemap.xml.gz files to the root directory (where robots.txt is located)

URL List Tab

‘Confirm Existance’ (sp) will check each URL to see if it is valid.

Hand check to make sure all the URLs are present. Use XENU to help.

Statistics tab

There are several types of statistics. Click on ‘generate statistics’ on each page. The stats will be updated based on the last time GSC was run along with the date.

Aborted URLs stat

This lists the URLs that are invalid. They need to be fixed.

Duplicate URLs stat

Gsitecrawler found identical files or an old link that was redirected to a new link. Search your code for the old link and correct it.

General Statistics

Shows:

Main URL
Number of URLs listed total - total number of URLs GSC found and any ones that were manually entered
Number of URLs to be included - number of urls included in sitemap.xml
Number of URLs listed to be crawled
Number of URLs still waiting in the crawler
Number of URLs aborted in the crawler - a summary of the Aborted URLs stat. These need to be fixed.

Last ROBOTS.TXT stat

Shows the current robots.txt contents.

Page-speed statistics

These stats can help track down bloated web pages.

Closing

If you wish to view the sitemap online you need to upload a file called gss.xsl to the same directory as the sitemap.xml file. The gss.xsl file is in the directory where sitemap.xml file is created on your computer. Note this file is not necessary for Google or any other search engine to use your sitemap file.

Troubleshooting

Question and answer about troubleshooting GSC issues.

Q: gsitecrawler never finishes.

A:
1) make sure your links are ok. Use XENU to analyze your links.
2) Perhaps it finished crawling and you missed the popup message saying that
3) Check Crawler Watch.
4) Check URL list (and refresh).
5) Check Statistics.
6) If it's finished, after you do the URL List you can modify priority, frequently of any url or block of urls.
7) Don't forget to click the button to Generate > Google sitemap
8) Did you check to see any error reports?
9) Perhaps the server decided to block the IP.
10) Or maybe your robots.txt file blocks certain areas.
11) Or some areas need login.
12) Or other urls are not following the same canonical form: www on non-www as whatever you start out with.
13) Or your computer went into sleep mode, or rebooted (maybe there was an automatic update during the night or something).
14) Your homepage is very slow in loading, as are all subforums. GSC might have timed out.
15) If you have not actually stopped the crawlers and flushed the crawler queue, when you reopen the program crawling resumes where it left off.

Q: when I click 'generate statistics' the statistics are not updated.

go to the project file directory (named in the Settings - General tab). Make sure the following files are NOT read-only: (right click on the file > Properties)

Aborted.txt
Last_Stats.txt
Robots.txt
StatDup.txt
StatSpeed.txt

Q: gsitecrawler only crawls the root url

A: It could be many things.
1) using meta refresh on the homepage to redirect to another page
2) JavaScript or flash only navigation
3) navigation buried in frames/iframes
4) switch from www to non www urls (or vice-versa) for the other links on the homepage
5) badly broken code that makes the code with the links inaccessible
6) a robots.txt file which disallows other urls
7) the use of a robots meta tag that specifies "nofollow"
8) the use of rel="nofollow" on your other links
9) the robot being blocked by the server
10) robots meta tag on home page that specifies nofollow

Q: gsitecrawler does not run or only scans the root url

A: GSC must be run as administrator.
2) Try running in compatibility mode

Compatibility mode

In Windows 10:

In the menu under Softplus GSiteCrawler, right-click on GSiteCrawler and Open File Location
Right click on GSiteCrawler and click Properties
Under Shortcut tab, click on Advanced and check 'Run as administrator'
Under Compatibility tab, run Compatibility troubleshooter. I believe GSC should be run under Vista Service Pack 2. Click apply

References

Learn about Sitemaps - Google explains why a sitemap is/is not necessary

https://www.sitemaps.org/protocol.html

XENU - analyzes website for missing, malformed urls. Gives a good idea what a bot sees.

***