How to Crawl and Update Private Sites
A Guide on How to Crawl Password Protected Webpages
Last updated
A Guide on How to Crawl Password Protected Webpages
Last updated
The website with a lot of rich content that you would like to crawl may be behind a login authentication page.
To bypass this login authentication, you have to follow the steps below and also download an extension. Please also ensure that you have the full authority and right to crawl a password protected site.
Download a chrome extension which would help you to fetch a session cookie allowing Wonderchat access to crawl your site.
Download the chrome extension at this link: Get cookies.txt LOCALLY and click on "Add to Chrome". This tool downloads cookies locally into your server so it would allow you to safely store the cookie.
Once you have downloaded the extension successfully, it will show up on your side bar.
Log into your password protected website. For instance, if you are looking to crawl a private Wordpress community site so you have to be logged into the website to save the session cookie.
After you have logged into your private website, open the โGet cookies.txt LOCALLYโ extension.
Set โExport Formatโ as JSON. This is critically important, as the default format is set as โNetScape.โ
Click copy to copy your session cookie to your clipboard
Go back to your Wonderchat dashboard, click on โCreate Chatbotโ and "Edit Chatbot > Chatbot Settings > Data Sources"
Enter the link to your website that requires a login.
Adjust the settings of the crawl, if you only want to crawl one page or a sub-directory from your private site, remember to specify doing so.
Under โAdvanced Settingsโ, paste the previously copied Session cookie into the field.
Hit โcreateโ to create a chatbot trained on your private website data. A successful crawl would allow the chatbot to crawl pages shown in the โpages crawledโ column.
Many websites handle authentication differently. While we have tried to support as many websites as possible, edge cases may still fall through.
If your website is still unable to be crawled, please reach out to us at https://wonderchat.io/contact and we will do our best to assist you.
To add more private pages to Wonderchat, you can click on the โadd pagesโ section. Enter the link of the website you want to crawl.
Always remember to copy and paste your session cookie within the advanced settings button for the chatbot to be able to crawl the private pages.
Once the session cookie is added, click on 'confirm' to add the new page into your chatbot.
If you have any more questions, feel free to reach out to us at support@wonderchat.io