官术网_书友最值得收藏!

Downloading the JavaScript

There's one more step before we can point this at a site  we need to download the actual JavaScript! Before analyzing the source code using our scanjs wrapper, we need to pull it from the target page. Pulling the code once in a single, discrete process (and from a single URL) means that, even as we develop more tooling around attack-surface reconnaissance, we can hook this script up to other services: it could pull the JavaScript from a URL supplied by a crawler, it could feed JavaScript or other assets into other analysis tools, or it could analyze other page metrics.

So the simplest version of this script should be: the script takes a URL, looks at the source code for that page to find all JavaScript libraries, and then downloads those files to the specified location.

The first thing we need to do is grab the HTML from the URL of the page we're inspecting. Let's add some code that accepts the url and directory CLI arguments, and defines our target and where to store the downloaded JavaScript. Then, let's use the requests library to pull the data and Beautiful Soup to make the HTML string a searchable object:

#!/usr/bin/env python2.7

import os, sys
import requests
from bs4 import BeautifulSoup

url = sys.argv[1]
directory = sys.argv[2]

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

Then we need to iterate over each script tag and use the src attribute data to download the file to a directory within our current root:

for script in soup.find_all('script'):
if script.get('src'): download_script(script.get('src'))

That download_script() function might not ring a bell because we haven't written it yet. But that's what we want  a function that takes the src attribute path, builds the link to the resource, and downloads it into the directory we've specified:

def download_script(uri):
address = url + uri if uri[0] == '/' else uri
filename = address[address.rfind("/")+1:address.rfind("js")+2]
req = requests.get(url)
with open(directory + '/' + filename, 'wb') as file:
file.write(req.content)

Each line is pretty direct. After the function definition, the HTTP address of the script is created using a Python ternary. If the src attribute starts with /, it's a relative path and can just be appended onto the hostname; if it doesn't, it must be a full/absolute link. Ternaries can be funky but also powerfully expressive once you get the hang of them.

The second line of the function creates the filename of the JavaScript library link by finding the character index of the last forward slash (address.rfind("/")) and the index of the js file extension, plus 2 to avoid slicing off the js part (address.rfind("js")+2)), and then uses the [begin:end] list-slicing syntax to create a new string from just the specified indices.

Then, in the third line, the script pulls data from the assembled address using requests, creates a new file using a context manager, and writes the page source code to /directory/filename.js. Now you have a location, the path passed in as an argument, and all of the JavaScript from a particular page saved inside of it.

主站蜘蛛池模板: 屯门区| 陆川县| 芷江| 平谷区| 长武县| 秭归县| 福州市| 沂水县| 南靖县| 二连浩特市| 宝丰县| 丘北县| 普兰县| 芷江| 巫山县| 津市市| 法库县| 陇南市| 长葛市| 故城县| 古浪县| 临泉县| 翁牛特旗| 莎车县| 惠来县| 岱山县| 积石山| 山东省| 甘泉县| 和林格尔县| 牟定县| 四子王旗| 平舆县| 都匀市| 寿宁县| 绿春县| 内乡县| 中方县| 福海县| 曲周县| 高碑店市|