Advertisement

WinHTTrack 3.48 & My Blog settings & wget64

top computer

Menggunakan Wget Command untuk Mengunduh File Tunggal

Salah satu contoh perintah dasar dari wget adalah mengunduh file tunggal dan menyimpannya di direktori working Anda saat ini. Sebagai contoh, Anda bisa memperoleh versi terbaru WordPress dengan menggunakan command berikut ini:

wget64 https://yt-dl.org/downloads/2021.06.06/youtube-dl.exe --no-check-certificate

Menggunakan Wget Command untuk Mengunduh File yang Telah Diberi Angka

Jika Anda memiliki file atau gambar yang telah dinomori pada daftar tertentu, Anda bisa dengan mudah mengunduh semuanya dengan syntax berikut ini:

wget http://example.com/images/{1..50}.jpg
Scan Rules:
+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
+*.gif +*.jpg +*.jpeg +*.png +*.tif +*.bmp + *.svg + *.tiff +*.webp +*.psd + +*.pdf +*.ico +*.pcx +*.pdf +*.tga +*.pxm +*.pcl +*.pns +*.pdd +*.psb +*.rle +*.dib +*.eps +*.iff +*.tdi +*.jpe +*.jpf +*.jpx +*.jp2 +*.j2c +*.j2k +*.jpc +*.jps +*.mpo +*.raw
+*.zip +*.tar +*.tgz +*.gz +*.rar +*.z +*.exe
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm +*.wav +*.vob +*.qt +*.vid +*.ac3 +*.wma +*.wmv



Blog original setting

Blogger Template Designer
1. Templates = no change
2. Background
3. Adjust widths

4. Layout
5. Advanced

Download Full Website

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla https://facebook.com

WHAT DO ALL THE SWITCHES MEAN:
--limit-rate=200k: Limit download to 200 Kb /sec
--no-clobber: don't overwrite any existing files (used in case the download is interrupted and
resumed).
--convert-links: convert links so that they work locally, off-line, instead of pointing to a website online
--random-wait: Random waits between download - websites dont like their websites downloaded
-r: Recursive - downloads full website
-p: downloads everything even pictures (same as --page-requsites, downloads the images, css stuff and so on)
-E: gets the right extension of the file, without most html and other files have no extension
-e robots=off: act like we are not a robot - not like a crawler - websites dont like robots/crawlers unless they are google/or other famous search engine
-U mozilla: pretends to be just like a browser Mozilla is looking at a page instead of a crawler like wget

(DIDNT INCLUDE THE FOLLOWING AND WHY)
-o=/websitedl/wget1.txt: log everything to wget_log.txt - didnt do this because it gave me no output on the screen and I dont like that id rather use nohup and & and tail -f the output from nohup.out
-b: because it runs it in background and cant see progress I like "nohup <commands> &" better
--domain=kossboss.com: didnt include because this is hosted by google so it might need to step into googles domains
--restrict-file-names=windows:  modify filenames so that they will work in Windows as well. Seems to work good without it

=======================================================

Spider Websites with Wget - 20 Practical Examples

Wget is extremely powerful, but like with most other command line programs, the plethora of options it supports can be intimidating to new users. Thus what we have here’s a collection of wget commands that you can use to accomplish common tasks from downloading single files to mirroring entire websites. It will help if you can read through the wget manual but for the busy souls, these commands are ready to execute.

1. Download a single file from the Internet

wget http://example.com/file.iso

2. Download a file but save it locally under a different name

wget ‐‐output-document=filename.html example.com

3. Download a file and save it in a specific folder

wget ‐‐directory-prefix=folder/subfolder example.com

4. Resume an interrupted download previously started by wget itself

wget ‐‐continue example.com/big.file.iso

5. Download a file but only if the version on server is newer than your local copy

wget ‐‐continue ‐‐timestamping wordpress.org/latest.zip

6. Download multiple URLs with wget. Put the list of URLs in another text file on separate lines and pass it to wget.

wget ‐‐input list-of-file-urls.txt

7. Download a list of sequentially numbered files from a server

wget http://example.com/images/{1..20}.jpg

8. Download a web page with all assets - like stylesheets and inline images - that are required to properly display the web page offline.

wget ‐‐page-requisites ‐‐span-hosts ‐‐convert-links ‐‐adjust-extension http://example.com/dir/file

Mirror websites with Wget

9. Download an entire website including all the linked pages and files

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com/

10. Download all the MP3 files from a sub-directory

wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://example.com/mp3/

11. Download all images from a website in a common folder

wget ‐‐directory-prefix=files/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://example.com/images/

12. Download the PDF documents from a website through recursion but stay within specific domains.

wget ‐‐mirror ‐‐domains=abc.com,files.abc.com,docs.abc.com ‐‐accept=pdf http://abc.com/

13. Download all files from a website but exclude a few directories.

wget ‐‐recursive ‐‐no-clobber ‐‐no-parent ‐‐exclude-directories /forums,/support http://example.com

Wget for Downloading Restricted Content

Wget can be used for downloading content from sites that are behind a login screen or ones that check for the HTTP referer and the User-Agent strings of the bot to prevent screen scraping.

14. Download files from websites that check the User-Agent and the HTTP Referer

wget ‐‐refer=http://google.com ‐‐user-agent="Mozilla/5.0 Firefox/4.0.1" http://nytimes.com

15. Download files from a password protected sites

wget ‐‐http-user=labnol ‐‐http-password=hello123 http://example.com/secret/file.zip

16. Fetch pages that are behind a login page. You need to replace user and password with the actual form fields while the URL should point to the Form Submit (action) page.

wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data 'user=labnol&password=123' http://example.com/login.php_ _wget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://example.com/paywall

Retrieve File Details with wget

17. Find the size of a file without downloading it (look for ContentLength in the response, the size is in bytes)

wget ‐‐spider ‐‐server-response http://example.com/file.iso

18. Download a file and display the content on the screen without saving it locally.

wget ‐‐output-document - ‐‐quiet google.com/humans.txt

wget

19. Know the last modified date of a web page (check the LastModified tag in the HTTP header).

wget ‐‐server-response ‐‐spider http://www.labnol.org/

20. Check the links on your website to ensure that they are working. The spider option will not save the pages locally.

wget ‐‐output-file=logfile.txt ‐‐recursive ‐‐spider http://example.com

Also see: Essential Linux Commands

Wget - How to be nice to the server?

The wget tool is essentially a spider that scrapes / leeches web pages but some web hosts may block these spiders with the robots.txt files. Also, wget will not follow links on web pages that use the rel=nofollow attribute.

You can however force wget to ignore the robots.txt and the nofollow directives by adding the switch ‐‐execute robots=off to all your wget commands. If a web host is blocking wget requests by looking at the User Agent string, you can always fake that with the ‐‐user-agent=Mozilla switch.

The wget command will put additional strain on the site’s server because it will continuously traverse the links and download files. A good scraper would therefore limit the retrieval rate and also include a wait period between consecutive fetch requests to reduce the server load.

wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror example.com

In the above example, we have limited the download bandwidth rate to 20 KB/s and the wget utility will wait anywhere between 30s and 90 seconds before retrieving the next resource.

Finally, a little quiz. What do you think this wget command will do?

wget ‐‐span-hosts ‐‐level=inf ‐‐recursive dmoz.org