blogblogblog SyKoHPaTh

Sitemap XML Generator

Forgot I made this a while back...Ok, you know how there's "free" sitemap.xml generators? And you know how they limit it to like, 10 links? And you know how they suck?

Time to fix that.

Instructions:

1) Open "sitexmlgen.php"
2) Change the site to scan in line number 35: $linklist[0] = "http://www.sykohpath.com/";
3) Place on your site
4) Run the script. Example: www.yoursite.com/sitexmlgen.php
5) Wait until it finishes, will take long on large sites with many links.
6) When finished, view source
7) Copy ALL the text under "XML IS BELOW" section.
8) Paste into new document "sitemap.xml".
9) Done with generation! If you don't know what to do with that file, learn to SEO.
10) there is no 10.


And here's the code. Note that there *is* a 5000 link limit, imposed by sitemap.org:

Code Sample:
  1. <?php /* ---- Information ----
  2. Name: sitexmlgen.php "sitemap.xml Generator"
  3. Last Updated: 20110512 080000
  4. Page Version: sitexmlgen.php v1.0
  5. Author: SyKoHPaTh
  6. ---- Version History ----
  7. 1.0 Initial Coding
  8. -------------------------
  9. PURPOSE:
  10. "crawl" a site and record the links if they are valid.
  11. Note: this only picks up links between tags.
  12. -------------------------
  13. TODO:
  14. -------------------------
  15. LICENSE:
  16. Modification: OK, but must keep credit line: "SyKoHPaTh (www.sykohpath.com)", and this License. Any modifications MUST be written in "Version History", with your name and/or handle, and what the modification was.
  17. Free for public and commercial use. If you paid for this, you got scammed.
  18. --------------------------
  19. */
  20. /* -------- VARIABLES -------- */
  21. $linklist = array();
  22. $linklist[0] = "http://www.sykohpath.com/";
  23. $sitemap_limit = 50000; //enforced by sitemap.org, max number of links in one sitemap XML file.
  24. // there is also a 10MB limit to sitemap XML files, but we're not checking for that here.
  25. /* -------- FUNCTIONS -------- */
  26. function digger($scanlink) {
  27. //does the work of scanning a page and putting links into an array
  28. $linkcontents = @file_get_contents($scanlink);
  29. if(!$linkcontents) {
  30. print "Unable to open: {$linkcheck}
  31. ";
  32. return array();
  33. }
  34. $linkinfo = parse_url($scanlink);
  35. $linkcore = $linkinfo['scheme'] . "://" . $linkinfo['host'];
  36. $linkcontents_strip = strip_tags($linkcontents, "");
  37. $linkcontents_mod = preg_replace("/]*)href=\"\//is", "
  38. $linkcontents_mod = preg_replace("/]*)href=\"\?/is", "
  39. preg_match_all("/]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $linkcontents_mod, $matches);
  40. return $matches[1];
  41. }
  42. function checklink($linkcheck) {
  43. //simply checks a link to see if it loads up or not
  44. $linkcontents = @file_get_contents($linkcheck);
  45. if(!$linkcontents) {
  46. print "Unable to open: {$linkcheck}
  47. ";
  48. return false;
  49. }
  50. return true;
  51. }
  52. /* -------- Initial header thing -------- */
  53. $xmloutput = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>
  54. ";
  55. $x = 0;
  56. while(1==1){
  57. //gen list
  58. print "Scanning [$x of " . (count($linklist)-1) . "]: " . $linklist[$x] . "
  59. ";
  60. $linkmatch = digger($linklist[$x]);
  61. //scan list
  62. foreach($linkmatch as $key=>$value){
  63. //print $key . ": " . $value . "
  64. ";
  65. //filter bad data
  66. //check link against $linklist[0]
  67. if(substr($value, 0, strlen($linklist[0])) == $linklist[0]){
  68. if(!(in_array($value, $linklist))){
  69. //push to array
  70. $linklist[] = $value;
  71. $xmloutput .= "
  72. \t" . $value . "
  73. ";
  74. }
  75. } else {
  76. //check if it's a foreign link
  77. if(!substr($value, 0, 4) == "http"){
  78. //add scanned linklist to front, and see if it's a valid link
  79. //cut out everything after the slash: http://w3dev.millerind.com/parts/index.php?bid=2
  80. $pattern = preg_replace("/[^\/]*$/s", "", $linklist[$x]);
  81. $value = trim($value); //strip whitespace BAD CODER, BAD!
  82. $value = preg_replace("/^[\/]/s", "", $value); //strip beginning / if there is one
  83. if(checklink($pattern . $value)){
  84. $value = $pattern . $value;
  85. if(substr($value, 0, strlen($linklist[0])) == $linklist[0]){
  86. if(!(in_array($value, $linklist))){
  87. //push to array
  88. $linklist[] = $value;
  89. $xmloutput .= "\t
  90. \t\t" . $value . "
  91. ";
  92. }
  93. }
  94. }
  95. }
  96. }
  97. }
  98. //echo "Total links: " . count($linklist) . "
  99. ";
  100. //if nothing new was added, exit loop
  101. if($x 1 >= count($linklist)){ break; }
  102. //if limit reached, exit loop
  103. if($x > $sitemap_limit - 1){ break; }
  104. $x=$x 1;
  105. }
  106. //Optional tags for each link.
  107. //" . date("Y-m-d") . "
  108. //yearly
  109. //0.5
  110. $xmloutput .= "";
  111. print "

  112. -----------------------------------------------------------
  113. XML IS BELOW (view source)
  114. Copy and paste into \"sitemap.xml\"
  115. -----------------------------------------------------------
  116. " . $xmloutput;
  117. ?>



php, xml, sitemap