Using GSiteCrawler with slscart
GSiteCrawler (GSC) generates a Google sitemap for your website. Check out the references at the bottom of the page.
GSiteCrawler runs on your local computer or laptop.
Setting up Gsitecrawler
If you have a small website, I would not recommend a sitemap because the search engines will find changes you make pretty fast and the extra time dealing with a sitemap is not worth it.
Google recommends the sitemap file be put in the root directory of your web site. The default name of the file is sitemap.xml
Your robots.txt file should have a path to the sitemap file in it. Use https if your site is secure.
for a compressed (i.e. smaller) sitemap. GSC can generate both. The compressed version is better in that downloading it reduces bandwidth. Note that the sitemap.xml.gz would have to be specified in the robots.txt file.
Download and Install
Download gsitecrawler and install it.
Most entries are self-explanatory but there are a few that may trip you up.
Main URL: url to your home page (e.g. http://www.mydomain.com). Use https:// if your site is secure.
Settings - General tab
Uncheck: remove trailing slash on folder names
File extensions to check
There is no reason to put images in the sitemap since they will be picked up by the search engines in a different manner. Delete the following image extensions:
Location of project files
Create a folder called 'gsitecrawler' in your local website folder and put the full path to it. (C:\mydomain.com\gsitecrawler\) Note this is a folder and not a file. It is not necessary, nor advisable, to keep your gsitecrawler files online.
Location and name of sitemap file
Full path to your local website folder ending with sitemap.xml. (C:\mywebsite\sitemap.xml)
Settings - FTP tab
Filezilla will be used for secure FTP transfers, so this tab isn’t required to be populated. You can use the cpanel login and password along with SFTP protocol for a secure FTP transfer.
Settings - Automation tab
Ban URLs tab
Add any directories, such as logs, that you do not want gsitecrawler to look at. For example, on slscart:
URL List Tab
Generally only the first URL is checked ‘Manually’ as this is the url to the site.
‘Delete all non-manual links’ is basically a start over. All links except manual ones are erased. This should be used if the sitemap has not been generated for a while.
Used after GSC is run.
Click on '(Re)Crawl' button on the top toolbar and wait until it is finished. A dialog box indicating GSC is finished will appear when done but closes itself after 10 seconds. On large sites it will take a long time. On incorrectly designed sites gsitecrawler may never finish.
The last time GSC was run is under Statistics tab, next to the Type: dropdown.
When the crawler is finished,
In Gsitecrawler, click the 'Generate' button. A dialog will appear. Check 'Generate Google Sitemap file'. (there is no reason to check the Yahoo urllist file since Yahoo now recognizes the sitemap file). This will generate new sitemap.xml and sitemap.xml.gz files in the directory specified by location and name of sitemap file.
Review the statistics (Statistics tab) and correct any errors. If errors occurred, rerun the crawler again.
Upload the sitemap.xml and sitemap.xml.gz files to the root directory (where robots.txt is located)
URL List Tab
‘Confirm Existance’ (sp) will check each URL to see if it is valid.
Hand check to make sure all the URLs are present. Use XENU to help.
There are several types of statistics. Click on ‘generate statistics’ on each page. The stats will be updated based on the last time GSC was run along with the date.
Aborted URLs stat
This lists the URLs that are invalid. They need to be fixed.
Duplicate URLs stat
Gsitecrawler found identical files or an old link that was redirected to a new link. Search your code for the old link and correct it.
- Main URL
- Number of URLs listed total - total number of URLs GSC found and any ones that were manually entered
- Number of URLs to be included - number of urls included in sitemap.xml
- Number of URLs listed to be crawled
- Number of URLs still waiting in the crawler
- Number of URLs aborted in the crawler - a summary of the Aborted URLs stat. These need to be fixed.
Last ROBOTS.TXT stat
Shows the current robots.txt contents.
These stats can help track down bloated web pages.
If you wish to view the sitemap online you need to upload a file called gss.xsl to the same directory as the sitemap.xml file. The gss.xsl file is in the directory where sitemap.xml file is created on your computer. Note this file is not necessary for Google or any other search engine to use your sitemap file.
Question and answer about troubleshooting GSC issues.
- Q: gsitecrawler never finishes.
1) make sure your links are ok. Use XENU to analyze your links.
2) Perhaps it finished crawling and you missed the popup message saying that
3) Check Crawler Watch.
4) Check URL list (and refresh).
5) Check Statistics.
6) If it's finished, after you do the URL List you can modify priority, frequently of any url or block of urls.
7) Don't forget to click the button to Generate > Google sitemap
8) Did you check to see any error reports?
9) Perhaps the server decided to block the IP.
10) Or maybe your robots.txt file blocks certain areas.
11) Or some areas need login.
12) Or other urls are not following the same canonical form: www on non-www as whatever you start out with.
13) Or your computer went into sleep mode, or rebooted (maybe there was an automatic update during the night or something).
14) Your homepage is very slow in loading, as are all subforums. GSC might have timed out.
15) If you have not actually stopped the crawlers and flushed the crawler queue, when you reopen the program crawling resumes where it left off.
- Q: when I click 'generate statistics' the statistics are not updated.
go to the project file directory (named in the
Settings - General tab). Make sure the following files are NOT read-only:
(right click on the file > Properties)
- Q: gsitecrawler only crawls the root url
- A: It could be many things.
1) using meta refresh on the homepage to redirect to another page
3) navigation buried in frames/iframes
4) switch from www to non www urls (or vice-versa) for the other links on the homepage
5) badly broken code that makes the code with the links inaccessible
6) a robots.txt file which disallows other urls
7) the use of a robots meta tag that specifies "nofollow"
8) the use of rel="nofollow" on your other links
9) the robot being blocked by the server
10) robots meta tag on home page that specifies nofollow
- Q: gsitecrawler does not run or only scans the root url
- A: GSC must be run as administrator.
2) Try running in compatibility mode
In Windows 10:
- In the menu under Softplus GSiteCrawler, right-click on GSiteCrawler and Open File Location
- Right click on GSiteCrawler and click Properties
- Under Shortcut tab, click on Advanced and check 'Run as administrator'
- Under Compatibility tab, run Compatibility troubleshooter. I believe GSC should be run under Vista Service Pack 2. Click apply
Learn about Sitemaps - Google explains why a sitemap is/is not necessary
XENU - analyzes website for missing, malformed urls. Gives a good idea what a bot sees.