Open source intelligence techniques – 6th edition (2018) pdf
9th Edition Changes & Summary
If you have the 8th edition of this book, you may want to know what has changed in this 9th edition. The previous edition of this book was originally written in late 2020. Soon after publication, I declared that I was taking a break from writing, which I did. In late 2021, I was asked to update this book, as it is required reading for numerous college courses, university degrees, and government training academies. I never want stale or inaccurate information being presented within training programs, so I created a special hardcover revision for these audiences. I then replicated the content within a more affordable paperback version. In the previous editions, I only published a new version once I had at least 30% new material and 30% updated content. The recycled material was kept to a maximum of 40%. With this edition, I have deviated away from that rule. I estimate that only 20% of the content here is changed, with the remaining 80% recycled from the previous edition. Much of the eighth edition content was still applicable and only needed minor updates to reflect changes since 2020. If you have read the previous edition, you will find most of those overall strategies within this book. However, I have added many new OSINT methods which complement the original text in order to cater to those who always need accurate information. I also removed a lot of outdated content which was no longer applicable. I believe there is much new value within this updated text. The majority of the updates are available in chapters 3, 4, 5, 6, 27, and 28, along with the digital files which accompany them. The other chapters all have minor updates.
All purchases include free download of updated custom search tools; updated Linux, Mac, and Windows OSINT scripts to build your own virtual machines; detailed cheat-sheets to simplify each process; and a single Linux command to build a complete 2022 OSINT VM with every tool in the entire book. You can find your custom login link and credentials within chapters 3, 4, 5, and 6 of each book which provides permanent online access to all files. The outline is below.
SECTION I: OSINT Preparation
CHAPTER 01: Computer Optimization
CHAPTER 02: Linux Virtual Machine
CHAPTER 03: Web Browsers
CHAPTER 04: Linux Applications
CHAPTER 05: VM Maintenance & Preservation
CHAPTER 06: Mac & Windows Hosts
CHAPTER 07: Android Emulation
CHAPTER 08: Custom Search Tools
SECTION II: OSINT Resources and Techniques
CHAPTER 09: Search Engines
CHAPTER 10: Social Networks: Facebook
CHAPTER 11: Social Networks: Twitter
CHAPTER 12: Social Networks: Instagram
CHAPTER 13: Social Networks: General
CHAPTER 14: Online Communities
CHAPTER 15: Email Addresses
CHAPTER 16: Usernames
CHAPTER 17: People Search Engines
CHAPTER 18: Telephone Numbers
CHAPTER 19: Online Maps
CHAPTER 20: Documents
CHAPTER 21: Images
CHAPTER 22: Videos
CHAPTER 23: Domain Names
CHAPTER 24: IP Addresses
CHAPTER 25: Government & Business Records
CHAPTER 26: Virtual Currencies
CHAPTER 27: Advanced Linux Tools
CHAPTER 28: Data Breaches & Leaks
SECTION III: OSINT Methodology
CHAPTER 29: Methodology & Workflow
CHAPTER 30: Documentation
CHAPTER 31: Policy & Ethics
“Author Michael Bazzell has been well known in government circles for his ability to locate personal information about any target through Open Source Intelligence (OSINT). In this book, he shares his methods in great detail. Each step of his process is explained throughout twenty-five chapters of specialized websites, software solutions, and creative search techniques. Over 250 resources are identified with narrative tutorials and screen captures. This book will serve as a reference guide for anyone that is responsible for the collection of online content. It is written in a hands-on style that encourages the reader to execute the tutorials as they go. The search techniques offered will inspire analysts to “think outside the box” when scouring the internet for personal information. Much of the content of this book has never been discussed in any publication. Always thinking like a hacker, the author has identified new ways to use various technologies for an unintended purpose” — Amazon.com
This chapter presents the theoretical framework of an Open Source Intelligence operation. The process how to undertake an osint investigation is outlined, the terms data, information, and intelligence are clarified, and selected tools and techniques are presented.
Modelling the Process of an
Different models to formalize the process of an osint investigation exist. In order to transform raw data into actionable intelligence, the intelligence community derived a model called intelligence cycle (Office of the Director of National Intelligence 2011). It is applied to all sources of intelligence and, in particular, to osint. This model has been adopted by Gibson (Gibson 2016) and with some adjustments also by Hassan (Hassan and Hijazi 2018). Bazzell presents a practical interpretation which is used as a mandatory training manual by U.S. government agencies (Bazzell 2021). Other works emphasize information gathering and analysis and, therefore, introduce models focusing on these tasks. This applies to the comprehensive three-step model derived by Pastor, et al. (Pastor-Galindo et al. 2016), as well to the model of Tabatabaei, et al. (Tabatabaei and Wells 2016).
Visualization of the intelligence cycle as described in Office of the Director of National Intelligence (2011); Gibson (2016)
Full size image
The intelligence cycle is visualized in Fig. 1. It contains six steps labeled Direction, Collection, Processing, Analysis, Dissemination, and Feedback which are described in detail in the rest of the section.
This phase is dedicated to planning and preparation before the actual investigation initiates. Gibson summarizes this step as “identification of intelligence required” (Gibson 2016). More details are offered by Bazzell in Bazzell (2021). He describes a more practical approach which is detailed in the following. As Bazzell addresses osint operations related to human activities, his explanations are enhanced for cases in which IT-systems are targeted. An osint investigation is most likely to begin with a specific request or a mission assignment by a client. Bazzell lists the following examples as typical tasks: threat assessment of individuals or events, target profiles for individuals or organizations, account attribution, or subscriber identification. IT-Security related tasks might focus on threat assessment for a system or digital footprint of an organization. Bazzell recommends clarifying the given task and verify provided identifiers like real names, user names, or email addresses. If the target is an IT-system, a common identifier is the domain name, but other identifiers like organisation or company name as well as email addresses are possible. Further, he also suggests applying a technique called triage. This is an assessment to derive a plan that is likely to provide the best possible outcome for the future investigation. Moreover, he advises to prepare the technical environment with an instance of a virtual machine dedicated to this specific investigation equipped with applications and tools relevant for osint. He also offers instruction on how to set up such a virtual machine addressing the requirements related to a search for human activities.
This phase focuses on the collection of data and is described as gathering data. The idea is to systematically search public data using the known identifiers and link the findings to produce results. Bazzell gives the most detailed account about this phase by structuring it in three steps (Bazzell 2021). First, he suggests invoking specialized search engines, websites, and services which might require a payment or a fee. If the responsible investigator is associated with a law enforcement agency, this includes their respective closed source information systems. Second, the investigation continues with an initial web search of the identifiers followed by the utilization of selected osint application, tools, and techniques depending on the target of the investigation. In order to structure this utilization, he derives a workflow for each of the identifiers email address, user name, real name, telephone number, domain name, and location. Each workflow starts with a given identifier and proposes different paths including specific tools. This results in new pieces of information which can be further exploited to gain more insights. For example, a given identifier user name is potentially helpful to identify real name, email address, or a social network profile. The workflow includes several approaches how to proceed. One approach is a manual check for all social networks for the given user name, thereby potentially identifying the real name. Another path described in the workflow is the guessing of the email address based on the provided information. A third path takes the user name as input into a set of tools provided by Bazzell. In addition, the input is processed by standard and specialized web search engines and further enriched with information from compromised databases. All these workflows, however, are in most cases tailored to a search located in the USA. In the third and final step of the collection phase, all findings are captured.
This phase serves different purposes depending on the underlying model. The intelligence community and Gibson describe this step as the transformation of the collected data into information (Office of the Director of National Intelligence 2011; Gibson 2016). This implies translating, decrypting, or converting the data in an useful and understandable format. The models utilized by Bazzell, Pastor et al. do not specifically mention this step (Bazzell 2021; Pastor-Galindo et al. 2016) while others repurpose it for data enrichment (Tabatabaei and Wells 2016) or data verification (Hassan and Hijazi 2018). Although not explicitly indicated as a dedicated phase in the models of Bazzell, Hassan, Pastor et al., or Tabatabeai et al., the task of data transformation into information is not neglected. They include the required effort in the Analysis phase.
This phase converts information into intelligence as described by Gibson (Gibson 2016). This includes the integration, evaluation, and analysis of the gained information to produce a result meeting the requirements (Office of the Director of National Intelligence 2011). Bazzell points out that this step aims to understand how information is connected and how to represent these connections. Therefore, he advises to use a link analysis tool in order to visualize the results of the investigation (Bazzell 2021). This phase is split up by Pastor to emphasize that the analyzed information can be subjected to additional data mining or artificial intelligence techniques in a dedicated knowledge extraction phase (Pastor-Galindo et al. 2016).
This phase distributes the results of the investigation to the client (Gibson 2016). This might be provided in the form of a written report (Bazzell 2021).
This phase concludes the investigation. While the U.S. National Intelligence and Gibson include the evaluation of feedback to improve their processes (Office of the Director of National Intelligence 2011; Gibson 2016), Bazzell finishes an investigation with the archiving of the results and a cleanup process (Bazzell 2021). The rest of the models renounce this phase (Pastor-Galindo et al. 2016; Tabatabaei and Wells 2016; Hassan and Hijazi 2018).
Data, Information, and Intelligence
The last section showed the process of an osint investigation not only as the acquisition of data but as the transformation from collected data into information and, finally, to intelligence. This section clarifies the terms data, information, and intelligence. Next, a proposed classification of information collected and compiled during an investigation is presented.
The differentiation between data, information, and intelligence derives from the NATO Open Source Intelligence Handbook and is frequently included in the discussion about the methodology of osint investigations, for example by Gibson or Hassan (Gibson 2016; Hassan and Hijazi 2018). This transformation process from data to intelligence is visualized in Fig. 2.
Visualization of the data processing during an osint investigation as described in Gibson (2016)
Full size image
Contingent on the underlying model, information is either the output of the collection or the processing phase in Fig. 1. It is produced by processing the collected data. Depending on the nature of the data, processing includes translation, decryption, or format conversion, additionally filtering, correlating, classifying, clustering, and interpreting the given data. Fig. 2 refers to this output as open source information.
Compiling information to address a specific query results in intelligence (Gibson 2016). It is the result of the integration, evaluation, and analysis of information during the analysis phase in Fig. 1. Fig. 2 denotes it as open source intelligence.
Until this point, the phases of the intelligence cycle in Fig. 1 map to the different outputs in Fig. 2. However, Fig. 2 depicts a fourth and additional type of output which is not covered within the intelligence cycle. According to the NATO it is called validated open source intelligence (Gibson 2016). It is described as open source intelligence which “a high degree of certainty can be attributed” (Gibson 2016). This demands verification and validation of the derived open source intelligence. This can be potentially done using other intelligence sources.
The transformation from data to information can result in an abundance of different kinds of information. Therefore, a classification to structure which kind of information can be expected from an osint investigation is proposed in this work. Although osint is a very active field and therefore a classification might need constant adaptions, this paper presents a base point for further discussions.
Information is classified with respect to the intended target of an osint investigation. Seven types of addressed entities were identified during this research. Acquired information can relate to:
Findings range from information about real life – like full name, address, employment, or financial information; to the online persona with user name, email address, or social media presence.
Group of People
In particular, criminal investigations often do not only focus on a single person, but on a set of people to understand their personal relationships and interactions.
In this paper, an organization is understood as an entity with a defined and articulated purpose that distinguishes it from a person or group of people. Examples are companies, institutions, associations, or even countries. The number of interesting information includes details about business deals, strategic planning, customer relations, employees, customers.
This summarizes all information related to IT-systems. It contains information about domain names and existing subdomains, used software and their respective version, as well as open ports.
Observing interactions between people might lead to information about an event happening online or in real life with details about date, location, and participants.
This includes details about a physical address or a set of coordinates.
This covers every target not addressed in one of the above classes. It includes images and videos as well as their content.
This classification is not distinct, meaning that a target might fit in more than one class. For example, the information “Woodstock” can be classified as group of people, event, or location. Its classification depends on the context of the initial query which also influences how the information is further processed.
Selected Tools and Techniques
Even if the intelligence cycle is initiated with a precise query as input, the response to this query relies on collecting, processing, and analyzing massive amounts of public data. This is predicated on at least the partial automation of certain tasks. In particular, the collection phase can be facilitated by the utilization of different tools and techniques.
A complete overview on tools is difficult to provide in light of the fact that many specialized tools are utilized. In addition, the landscape of external tools is extremely active and subject to change. One reason for changes is the revocation of the tools by their developers as observed in June 2019 when Bazzell withdrew his set of popular interactive online tools (Bazzell 2021) or the disappearance of the meta-crawler website searx.me. Another aspect is related to the dynamic nature of social networks. For example, Facebook and Instagram are known to actively undermine the usage of osint related tools and techniques. Therefore, they block respective web services, regularly change their source code, and restrict capacities exploited by the osint community (Bazzell 2021). For example, Instagram includes special character encoding in the source code of their website to make it difficult to directly extract URLs.
Nevertheless, the osint community has shown some resilience in the face of these challenges and constantly adapts or renews its approaches. Given the aforementioned aspects, a selection of tools and techniques only represents a limited snapshot and might be obsolete soon.
Despite all these difficulties, different approaches to overview tools and techniques exist. the osint framework is the most developed ressource (Nordine 0000). Its goal is the easy identification of osint tools suitable for the search based on specific identifiers. It is organized as a tree structure with identifiers as the roots and candidate tools as the leaves. In addition to this excellent classification, Michael Bazzell’s selection of tools with respect to their usability should be highlighted. The tools are provided through the setup of a virtual machine according to the instructions found in Bazzell (2021). These tools are well documented, regularly updated, and maintained.
As the remainder of this section can only offer a limited overview, the focus is on the most promising starting points for an osint investigation of human activities as well as of computer systems. First, web search engines are discussed as a general basis followed by social network searches addressing the search for human activities and continue with information-gathering focusing on IT-systems.
Web Search Engines
Tarakeswar et al. describe four types of search engines (Tarakeswar and Kavitha 2011): Crawler-based search engines, human-powered directories, hybrid search engines, and meta search engines. The aforementioned search engines are considered general crawler-based search engines. They perform three operations: crawling, indexing, and searching. Crawling is the process of finding and reading the content of a website. At this point, a crawled website will not be listed in the search results. Therefore, indexing is necessary first. During indexing the information on the website is extracted and stored within the search engines index. By performing a search using a search engine, the index is queried and returns results from its index pointing to the crawled website.
As described in Sect. 4.1, the collection phase invokes a web search based on the provided information, the so-called identifiers. A web search relies on a web search engine which crawls the web, indexes the findings systematically, and supports a search over the results. Well known is the classic textual search where the input of one or more words initiates the search and the output is presented in the form of links to websites containing the input. In addition to the aforementioned example Google (Google 0000), there are several competitors offering similar services like Bing (Bing 0000), Yandex (Yandex 0000), Baidu (Baidu 0000), DuckDuckGo (Duckduckgo 0000), or StartPage (Startpage 0000). These services are a good starting point for identifiers like people’s names, organisations or public events.
Compared to the simple textual search, a careful formulation of the search query might improve the results significantly. Therefore, the input to the well-known user interface is enriched with extended search parameters combined with Boolean expressions. Extended search parameters are special characters and commands which extend the capability of textual search significantly. Examples include the usage of quotation marks requesting an exact match, the term intext: for a search in the body or document, the term inurl: for a term in the URL. Their utilization produces refined results. Consider the following example: a search is made for PDF documents containing the term osint. The input OSINT, PDF produces almost 834.000 results of mixed file types compared to the input OSINT filetype:PDF with only 42.000 PDF-files as result. This technique was first applied in the Google search engine and is therefore known as Google Hacking (Long 2005) or Dorking (Hassan and Hijazi 2018). However, it is also applicable to some extent for other search engines as well. Johnny Long reported on this approach since the early 2000’s, for example in Long (2005). In particular, this technique is useful for personal-related information as well as for system-related information.
Given identifiers like user names, phone numbers, or email addresses as inputs, the results of the described web search engines might be limited. This is also true for the analysis of other potential relevant content such as images or locations. Specialized search engines are able to address such kinds of inputs. These include image search (Google Images 0000), as well as searches for user names (Knowem 0000), email addresses (Hunter.io 0000), locations (Google Earth 0000), or phone numbers (Das Telefonbuch 0000). Looking beyond the surface web to the dark web, there are also different web search engines available (Not Evil 0000).
Other search engines are tailored to output only specific results, e.g. real names (True People Search 0000), news (Google News 0000), scientific papers (dblp – Computer Science Bibliography 0000), patents (Europäisches Patentamt 0000), security vulnerabilities (Common Vulnerabilities and Exposures 0000), or Internet-connected assets (Shodan 0000). Bazzell suggests the utilization of the programmable search engine by Google which allows customized searching and filtering for osint investigations (Bazzell 2021).
Social Network Searches
Promising sources of information about an individual or a group of people are social networks like Facebook, Instagram, LinkedIn, Twitter, Pinterest, YouTube, or even PayPal (Lolagar 0000) where people share personal information and interact with families, friends, colleagues, or even strangers. This approach is of particular interest if the given identifier is a user name and might finally result in the identification of the real name. However, a real name as a given identifier can also be exploited as it is customary to use real names in some social networks like LinkedIn and, to some extent, Facebook. Findings in one social network can be included for searches in other social networks and the merge of information leads to even more detailed results.
Before an osint investigation can collect any information, it is often necessary to register to the respective social network with Twitter being a notable exception as tweets and timelines are accessible even without an account. For social networks requiring registration, this step might be sufficient to learn a lot (if not all) interesting facts about an account in case the target has a public profile. Detailed analysis is supported by automated download tools adapted to a respective social network, for example InstaLooter (Instalooter 0000) or Instaloader (InstaLoader 0000) focusing on Instagram or TweetBeaver (Tweetbeaver 0000) or exportdata (ExportData 0000) focusing on Twitter.
Apart from a manual examination of a target’s profile, an important technique for the collection of information in social networks is the submission of search queries. The simplest approach utilizes the internal search functionalities provided by the social network itself. However, search features vary depending on the respective social network. For example, Facebook’s internal search is simple compared to Twitter which offers search operators comparable the extended search parameters for web search engines. Facing such limitations, there are attempts to introduce search options beyond the designed scope. This relies on the knowledge and manipulation of certain URLs which are usually automatically created and deployed producing results for search queries which cannot be submitted directly. This can be automated by a range of tools, for example Bazzell’s Facebook tool (Bazzell 2021).
In addition, there is a number of applications, browser extensions, or web services specialized on different social networks which offer the extraction of certain information either as a preparation for URL manipulation (e.g. the extraction of a Facebook userID Facebook UserID LookUp 0000) or as stand-alone information (e.g. displaying biography changes in Twitter Twitter Biography Changes 0000).
Challenges arise as some social networks allow to apply strong privacy settings on the profile which prevents a detailed examination. Although this omits a substantial amount of information, it does not necessarily prevent information leakage. This is documented by a number of practitioners’ tutorials allowing to derive information about private profiles without compromising the target’s user account. For example, it is possible to identify public posts not directly shown on the private profile page (OSINT Curious 0000). Another option is the analysis of connected accounts which might be public and reveal information about the target.
Tools and techniques for searches on social networks are constantly evolving. As cited before, this is caused by frequent changes by the social networks themselves to prevent the exploitation of information in ways not originally intended. A notable example is the disappearance of Facebook’s graph search. Initially introduced for general usage, it offered a powerful approach to derive information. This changed in 2014 and graph search functionality was only available using URL modifications. Finally, all tools and techniques relying on graph search stopped working in mid 2019 (Bazzell 2021). The assumption that Facebook is actively preventing data collection in non-intended ways is further fuelled by the block of web services offering automatic exploitation. Furthermore, also Facebook user accounts which invoke osint-related tools and techniques are regularly blocked (Bazzell 2021).
Information Gathering for Penetration Testing
Easily overseen, osint is a key technique to support information gathering for penetration testing or similar IT-related tasks. As for all osint-related investigations, the utilization of web search engines and in particular Google Hacking is a valuable starting point (Long 2005). Similar to the tailored tools available for the search of human activities, a range of tools dedicated for investigation on IT-systems exist.
Depending on the given identifier, a multitude of tools allows further inspections. A company’s website can be evaluated with respect to applied technologies (BuiltWith 0000) as well as possible vulnerabilities (Common Vulnerabilities and Exposures 0000). Further information can be gained by identifying which websites share the same Google Analytics ID (Reverse analytics id 0000) or inspecting the website’s history using an Internet archive (The Wayback Machine 0000).
Given a domain name as identifier, tools provide information about the ownership (WhoIs Online Service 0000) as well as about possible subdomains (Aboul-Ela 0000). Open ports can be identified by specialized web search engines supporting the search over pre-scanned results (Censys 0000). In addition, there are also tools and services performing active port scans available (Nmap 0000). This functionality is sometimes enriched with additional information about unpatched security vulnerabilities (Cyberscan 0000).
In order to collect more information about the system landscape of a company, another approach includes the target’s advertisements and employee profiles on social networks to learn about declared skills to conclude about the used technologies and systems.
The collected information can be analyzed by dedicated analysis tools like spiderSilk (Spidersilk 0000) or the version of the utilized technologies can be matched against a database containing vulnerabilities for specific versions.