404 Not Found Knowledge Base
Last updated: 2020/06/28
Newly added in the past week:
- [Python project packaging and publishing](# tools)
Table of contents:
- computer basics
- Computer theory basics
- computer network
- operating system
- Data Structures and Algorithms
- database
- Basics of cryptography
- Computer technology basics
- language
- frame
- tool
- technology
- underlying research
- Safety
- security technology
- loopholes
- Web security
- Penetration testing
- Code audit
- Data security
- Cloud security
- security tools
- security research
- APT detection
- Malicious samples
- Red Team
- WAF
- Malicious URL detection
- Fight against machine traffic
- Anomaly detection
- Figures and Security
- AI and security
- Enterprise safety construction
- Safe development
- Security testing
- security products
- Safe operation
- Security management
- Think safe
- security architecture
- Red and blue confrontation
- Intranet security
- Data security
- New technology and new security
- Overview
- cloud native
- trusted computing
- DevSecOps
- safe development
- personal development
- Industry development
- data
- Data system
- Data analysis and operations
- Security data analysis
- algorithm
- AI
- Algorithm system
- basic knowledge
- machine learning
- deep learning
- reinforcement learning
- Application areas
- Industry development
- Comprehensive quality
- Profession
- career planning
- thinking
- communicate
- manage
- think
- Things to note
- appendix
- Domestic outstanding technical personnel
- Excellent foreign technology sites
- abandoned
computer basics
Computer theory basics
operating system
- [Computer Postgraduate Entrance Examination 408 is the most comprehensive in the entire network!!!!!] Kingly Computer Operating System
- Interrupts and Exceptions
- How to understand paging and segmentation of memory management in the operating system in a simple way?
Granularity, logical units of information and physical units of information, indeterminate and deterministic lengths, two-dimensional addresses and one-dimensional addresses, complete information and discrete allocation of memory. - Summary of kernel state and user state of operating system
- Compilation of common interview questions-operating system (a must for every developer)
computer network
- Compilation of common interview questions - computer network (a must for every developer)
The difference between TCP and UDP, TCP three-way handshake and four-way wave, the process after the browser enters the URL, the request type of the HTTP protocol, the difference between GET and POST, ARP address resolution protocol - A complete browser request process page (browser, HTTP) request to response process includes a series of processes such as TCP three-way handshake, such as domain name resolution, initiating TCP three-way handshake, initiating HTTP request, server responding to HTTP request, and the browser gets The HTML code and the browser parse the HTML code and request the resources in the HTML code. The browser renders the page and presents it to the user.
- What exactly does the reliability of tcp mean? - CYS's answer - Zhihu
The reliability of TCP refers to providing reliable data transmission services at the transport layer based on the unreliable IP layer. It mainly means that the data will not be damaged or lost, and all data will be transmitted in the order in which it was sent. The following mechanisms are used to achieve reliable transmission of TCP: checksum (to verify whether the data is damaged), timer (retransmission if the packet is lost), sequence number (used to detect lost packets and redundant packets), confirmation (receiver Informing the sender that a packet was received correctly and the next packet expected), negative acknowledgment (the receiver notifies the sender of a packet that was not received correctly), windows and pipelining (used to increase the throughput of the channel).
Data Structures and Algorithms
- Algorithm 3: The most commonly used sorting-quick sort
sort and quick sort. The idea of quick sort is to dig holes and fill in numbers + divide and conquer. - A Tencent interview question: My cup is so awesome (I learned it)
Problem-solving method 1: bisection method; problem-solving method 2: segmented search interval method; problem-solving method 3: method based on mathematical equations; problem-solving method 4: dynamic programming method (learned), described by the formula: W(n, k) = 1 + min{max(W(n -1, x -1), W(n, k - x))}, x in {2, 3, ……,k}
(n is a cup number, k is the number of floors) - How to write algorithm questions effectively
The questions on LeetCode are roughly divided into three types: examine data structures: such as linked lists, stacks, queues, hash tables, graphs, Tries, binary trees, etc.; examine basic algorithms: such as depth first, breadth first, binary search, recursion, etc.; Examine basic algorithmic ideas: recursion, divide and conquer, backtracking search, greedy, and dynamic programming. - A brief discussion on what is the divide and conquer algorithm (learned)
Full permutation problem, merge sort problem, quick sort problem, and Tower of Hanoi problem under the divide and conquer idea. - 2018.08 In the job interview, the kth largest number in the disordered array, the median in the disordered array: quick sort pointer, O(N).
- [Video explanation] LeetCode Problem No. 1: The sum of two numbers
- Strategies for grabbing red envelopes at annual meetings
Basics of cryptography
- Detailed explanation of the advantages and disadvantages of symmetric encryption and asymmetric encryption Symmetric encryption is also called single-key encryption. Algorithms include: AES, RC4, 3DES. It is fast and can be used when a large amount of data needs to be encrypted. The calculation amount is small and the efficiency is high. If one party's secret key is revealed, the entire encryption will be unsafe. Asymmetric encryption, algorithms include RSA, DSA/DSS, slow and highly secure. Hash algorithms include MD5, SHA1, and SHA256. Three types of algorithms are the basis of HTTPS communication .
database
- Tencent interview: What are the reasons why a SQL statement executes slowly?
Supplementary learning : database engine (InnoDB supports transaction processing and foreign keys, but is slower, ISAM and MyISAM use low space and memory, and insert data quickly), database encoding ( character_set_client、character_set_connection、character_set_database、character_set_results、character_set_server、character_set_system
), database index (Primary key index, clustered index and non-clustered index) and other basic knowledge points.
The reasons why a SQL statement is executed slowly are divided into two categories: 1) Normal in most cases, occasionally very slow: (1) The database is refreshing dirty pages, such as redo When the log is full, it needs to be synchronized to the disk; (2) Locks are encountered during execution, such as table locks and row locks; 2) It is always slow: (1) The index is not used: for example, the field has no index; due to the The index cannot be used due to calculations and function operations; (2) The wrong index is selected in the database. Compare the number of rows scanned from the clustered index to the primary key index and the direct full table search. It is possible that the sampling problem is misjudged and a full table scan is performed. No indexing. - This is probably the most comprehensive SQL optimization solution
Computer technology basics
language
- An in-depth analysis of Python decorators in a 10,000-word long article
- Python3 iterators and generators
Python : Iterators have two basic methods: iter() and next(). Iterable objects such as strings, tuples, and lists can be used to create iterators (this is because these classes implement the __iter__()
function internally. After calling iter(), it becomes a list_iterator
object, you will find that the __next__()
method has been added. All objects that implement __iter__
and __next__
methods are iterators). The iterator is a stateful object. It will record the position of the current iteration to It is convenient to obtain the correct elements during the next iteration. __iter__
returns the iterator itself, __next__
returns the next value in the container. Generator: A function that uses yield is called a generator. When a generator function is called, an iterator object is returned. The generator can be regarded as an iterator. - python black technology iterator, generator, decorator
- How much do you know about Python’s advanced features? Let’s compare
Python : lambda anonymous function, the function is to perform some simple expression or operation without fully defining the function; Map function is a built-in python function that can apply functions to elements in various data structures; Filter built-in function Similar to the Map function, but only returns elements for which the applied function returns True; the Itertools module is a collection of tools for processing iterators, which are a data type that can be used in for loop statements; the Generator function is an iterator-like function . - Why use Go language? What are the advantages of Go language?
Go : The advantages of go and the uses of go. The main advantages of go include: static language, multiple concurrency, cross-platform, direct compilation into machine code, rich standard library, etc. The main uses of go include server programming, network programming, distributed systems, in-memory databases, and cloud platforms. - Gin practice series - Golang introduction and environment installation
Go : Go's environment installation, the meaning of each folder after the environment is installed; go's workspace, the meaning of each folder in the workspace. - ruby-on-rails - What is the difference between Ruby and JRuby
Ruby : Ruby is a programming language. The Ruby interpreter we generally refer to refers to CRuby. CRuby runs in the local C language interpreter environment. JRuby is a Ruby interpreter implemented in pure Java. JRuby runs in the Java virtual machine. .
frame
- Gin - Introduction and use of high-performance Golang web framework
Gin : is a web application framework written in Go. - What is the difference between spring boot and spring mvc?
Spring —> Spring MVC —> Spring Boot.
tool
- Comparison between spark and storm
Big data technology tools - computing type : Compare from the aspects of real-time computing model, real-time computing latency, throughput, transaction mechanism, robustness/fault tolerance, dynamic adjustment of parallelism, etc. Spark streaming is a quasi-real-time model. It collects data within a period of time and processes it as an RDD. The real-time calculation delay is second-level and has high throughput. It supports transaction mechanisms but is not complete enough. It has average robustness and does not support dynamics. Adjust the degree of parallelism; Storm is a purely real-time model. It receives and processes a piece of data. The real-time calculation delay is millisecond level, the throughput is small, it supports a complete transaction mechanism, is highly robust, and supports dynamic adjustment of the degree of parallelism. Application scenarios : Storm can be used in scenarios where pure real-time cannot tolerate delays of more than 1 second; for real-time computing functions that require reliable transaction mechanisms and reliability mechanisms, that is, data processing is completely accurate, Storm can also be considered ; If you also need to dynamically adjust the parallelism of real-time computing programs during peak and low-peak periods to maximize resource utilization, you can also consider storm; if the project is purely real-time computing, there is no need to execute SQL interactive queries in the middle, etc. For other operations, using storm is a better choice. On the other hand, if you do not require pure real-time, reliable transaction mechanisms, or dynamic adjustment of parallelism, you can consider spark streaming. The biggest advantage of spark streaming is that it is in the spark ecological technology stack. From the macro perspective of the project, if not only real-time is required Computing also requires offline batch processing and interactive query, and in real-time calculation, it will also involve high-latency batch processing, interactive query and other functions. Then you can use spark core to develop offline batch processing and spark sql to develop interactive query. use spark Streaming develops real-time computing, integrates seamlessly, and provides high scalability to the system. This feature greatly enhances the advantages of Spark Streaming. The two frameworks are good at different segmentation scenarios. - Ziyu Big Data Spark Getting Started Tutorial (Python version) (more important)
- What are the differences and connections between the log collection systems flume and kafka? When are they used respectively, and when can they be combined?
Big data technology tools - middleware type : Kafka can be understood as middleware, or cache system, or database, its main function is to maintain stability. Flume can be understood as the active collection of log data. Compared with Kafka, it is difficult to promote the online application modification interface to write data into Kafka. - What are the advantages and disadvantages between logstash and flume, and what scenarios are they suitable for?
Big data technology tools - Agent type : depending on the requirements, both logstash and flume exist as agents. Logstash has more plug-ins and better supporting products such as elasticsearch, but the development language of logstash is ruby and the operating environment is JRuby. Moreover, the transmitted data may be lost; there is a mechanism inside flume to ensure that a certain amount of data is transmitted without loss. The development language of flume is Java, which is easy for secondary development. However, the disadvantage is that the jvm takes up a lot of memory. - Mac shortcut key list
MAC : basic shortcut keys: screenshots, in applications, text processing, in finder, in browsers; shortcut keys for MAC startup and shutdown. - Commonly used Git command sheets
Git : Remote warehouse-"Local warehouse->Staging area-"Workspace, git add., git commit -m message, git push. - git-lfs
Git-lfs : git large file upload extension tool. - tshark statistical analysis pcap package
- [Python project packaging and publishing](# tools)
Memo : 1. setup.py: long_description and long_description_content_type (note the md and rst format rendering issues). 2. manifest.in vs. gitignore. 3. readme.rst vs readme.md. 4. .pypirc vs. gitconfig. 5. python setup.py bdist_wheel upload.
technology
- Decoding and xss ( there is an
\u72
in the original text "after html entity encoding" should be -
Browser technology-decoding sequence : Browser decoding mainly involves two parts: rendering engine and js parser. Decoding order: Decoding is performed in any environment. The decoding order is: the encoding corresponding to the outermost environment is decoded first. For example: in <a href=javascript:alert(1)>click</a>
alert(1) is in the html->url->js environment. 1. Click uses unicode encoding e, which cannot be decoded in html or url environments. It can only be decoded into the character e in js environment, so no pop-up window will occur.
2. Click uses url encoding. Before executing js, url decodes %65, so when the js engine starts, you see the complete alert(1)
3. Click html entity decoding is executed first
4. Click In the URL decoding process, JavaScript will not be considered to be a pseudo-protocol, and errors will occur.
5. Click htmlparser will be executed prior to JavaScript parser, so the parsing process is that the characters of htmlencode are decoded first, and then the JavaScript event is executed.
Browser decoding order is the basis for bypass in XSS . - The relationship between dockerfile and docker-compose
docker technology : the relationship between files and folders. - What is the difference between dockerfile and docker-compose?
docker technology : docker-compose is for orchestrating containers. - What is a bastion machine?
Bastion host technology : defines an entrance for access to the cluster; facilitates permission control and monitoring. - From what aspects does the feasibility of a product need to be analyzed?
Feasibility analysis : Product feasibility is divided into: technical feasibility, economic feasibility, and social feasibility. Among them, I focus on technical feasibility. Technical feasibility is mainly measured from the comparison of competitor functions, technical risks and avoidance methods, ease of use and user threshold, product environment dependence, etc. - What roles do Nginx and Gunicorn play in the server?
Application server : Nginx deployment scenario: load balancing (frameworks such as tornado only support single core, so multi-process deployment requires reverse load balancing. gunicorn itself is multi-process and does not need it), static file support, anti-concurrency pressure, additional Access control. - Wikipedia: Kerberos
Kerberos : Basic description, protocol content and specific process of Kerberos. - The relationship between dockerfile and docker-compose
docker technology : the relationship between files and folders. - What is the difference between dockerfile and docker-compose?
docker technology : docker-compose is for orchestrating containers. - What is a bastion machine?
Bastion host technology : defines an entrance for access to the cluster; facilitates permission control and monitoring. - From what aspects does the feasibility of a product need to be analyzed?
Feasibility analysis : Product feasibility is divided into: technical feasibility, economic feasibility, and social feasibility. Among them, I focus on technical feasibility. Technical feasibility is mainly measured from the comparison of competitor functions, technical risks and avoidance methods, ease of use and user threshold, product environment dependence, etc. - What roles do Nginx and Gunicorn play in the server?
Application server : Nginx deployment scenario: load balancing (frameworks such as tornado only support single core, so multi-process deployment requires reverse load balancing. gunicorn itself is multi-process and does not need it), static file support, anti-concurrency pressure, additional Access control. - Wikipedia: Kerberos
Kerberos : Basic description, protocol content and specific process of Kerberos. - What is microservices architecture**?
- What is Service Mesh (Service Mesh)
Microservice Architecture : Why: Why use a service mesh? Under the traditional MVC three-tier web application architecture, the communication between services is not complicated and can be managed within the application. However, in today's complex large-scale websites, single applications are decomposed into numerous microservices. Dependencies and communication between services are complex. What: Service mesh is the infrastructure layer for communication between services. It can be compared to TCP/IP between applications or microservices. It is responsible for network calls, current limiting, circuit breaking and monitoring between services. Features of Service Mesh: middle layer for inter-application communication, lightweight network proxy, application-agnostic, decoupled application retries/timeouts, monitoring, tracing and service discovery. Currently popular open source software is Istio and Linkerd, both of which can be integrated in the Cloud Native kubernetes environment. - Updater fails if not run as admin, even on a user installation
LaTeX : MiKTeX (registry problem and administrator rights problem) + TeXnicCenter (cannot generate pdf problem, set the adobe execute path in Build to genuine AcroRd32.exe) + Adobe Acrobat Reader DC, and then use the cracked version of Adobe Acrobat DC to convert to other formats. - HTTPS principle and interaction process
HTTPS : HTTPS requires a handshake between the browser and the website before transmitting data. During the handshake process, the password information used by both parties to encrypt the transmitted data will be confirmed. Obtain the public key -> The browser generates a random (symmetric) secret key -> Use the public key to encrypt the symmetric secret key -> Send the encrypted symmetric secret key -> ciphertext communication encrypted by the symmetric secret key. The entire process of HTTPS communication uses symmetric encryption, asymmetric encryption and HASH algorithms . - Browser’s Same Origin Policy
Browser technology : The same-origin policy is the core and most basic security function of the browser. The same-origin policy is defined as: protocol/host/port. - Nine cross-domain implementation principles (full version)
Browser technology : cross-domain request solutions: JSONP (vulnerabilities that rely on script tags without cross-domain restrictions), CORS (cross-domain resource sharing), postMessage, websocket, Node middleware proxy, nginx reverse proxy, windows.name+iframe , location.hash+iframe, document.domain+iframe.
CORS supports all types of HTTP requests and is the fundamental solution for cross-domain HTTP requests. JSONP only supports GET requests. The advantage is that it supports old browsers and can request data from websites that do not support CORS. Whether it is Node middleware proxy or nginx reverse proxy , the main reason is to impose no restrictions on the server through the same-origin policy. In daily work, the most commonly used cross-domain solutions are CORS and nginx reverse proxy. - How to use Python virtual environment in Jupyter Notebook?
Anaconda : Install plug-ins, conda install nb_conda - Since there are HTTP requests, why use RPC calls? - Brother Yi's answer
RPC : Restful VS RPC. RPC includes: reverse proxy, serialization and deserialization, communication (HTTP, TCP, UDP), exception handling
underlying research
A brief analysis of the python requests library process
Python requests library implementation : socket->httplib->urllib->urllib3->requests. The internal calling process of requests.get: requests.get->requests()->Session.request->Session.send->adapter.send->HTTPConnectionPool(urllib3)->HTTPConnection(httplib).
1、socket:是TCP/IP最直接的实现,实现端到端的网络传输
2、httplib:基于socket库,是最基础最底层的http库,主要将数据按照http协议组织,然后创建socket连接,将封装的数据发往服务端
3、urllib:基于httplib库,主要对url的解析和编码做进一步处理
4、urllib3:基于httplib库,相较于urllib更高级的地方在于用PoolManager实现了socket连接复用和线程安全,提高了效率
5、requests:基于urllib3库,比urllib3更高级的是实现了Session对象,用Session对象保存一些数据状态,进一步提高了效率
Analysis of XGBoost principles and underlying implementation (learned)
XGBoost : Understand from the perspective of the score of the tree (objective function: loss function (second-order expansion) + regular term), the structure of the tree (split decision (pre-sorting)).
In-depth understanding of Lightgbm histogram optimization algorithm
Lightgbm : Compared with pre-sorting, lgb uses a histogram to handle node splitting and find the optimal split point. Algorithm idea: Convert feature values into bin values in advance before training, that is, make a piecewise function for the value of each feature, divide the values of all samples on this feature into a certain segment (bin), and finally Feature values are converted from continuous values to discrete values. Histograms can also be used for differential acceleration. The complexity of calculating the histogram is based on the number of buckets.
Keras text preprocessing source code analysis
Keras - text preprocessing :
Keras sequence preprocessing source code analysis
Word2Vec
- Understanding the Skip-Gram model of Word2Vec
- Implementing Skip-Gram model based on TensorFlow - Tian Yusu's article
- Word2Vec Tutorial - The Skip-Gram Model
- Word2Vec Tutorial Part 2 - Negative Sampling
- Word2Vec word embedding tutorial in Python and TensorFlow
- word2vec_basic tensorlflow source code analysis
- A Word2Vec Keras tutorial
- keras_word2vec@adventures-in-ml-code
Safety
security technology
loopholes
- Compilation of Wuyun vulnerability library payload and Burp auxiliary plug-in
- boy-hack/wooyun-payload
- A researcher’s perspective on vulnerability research in the 2010s
Vulnerability research: Current status and trends of vulnerability research in the past 10 years : 1. In the post-PC era, control flow integrity has become a new basic protection mechanism for system security. 2. Surprising hardware security features and hardware security vulnerabilities. 3. New wine in old bottles, the safe design of mobile devices enables overtaking in corners. 4. The battle for network entrances is intensifying. Network entrances include: browsers, WiFi coprocessors, basebands, Bluetooth, routers, instant messaging devices, social software, email clients, traditional PCs and servers. 5. Automated vulnerability mining and exploitation still need to be improved.
Web security
- An article to give you an in-depth understanding of vulnerabilities: XXE vulnerabilities
XXE vulnerability : The principle of XXE: calling external entities, the utilization of XXE: using general entities, parameter entities, external entities, internal entities to read files, intranet host and port detection, intranet RCE (the support of expect extension is required under PHP) ) - Mysql comma-free injection techniques
Injection attacks : sql injection, xml injection (a markup language that structurally represents data through tags), code injection (eval class), CRLF injection (rn). Mysql injection: Use comments to bypass spaces, use parentheses to bypass spaces, use symbols such as %20 %0a to replace spaces; under union query, use join to bypass comma filtering, select id,ip from client_ip where 1>2 union select * from ( (select user())a JOIN (select version())b );
Use select case when(条件) then 代码1 else 代码2 end
to bypass comma filtering, insert into client_ip (ip) values ('ip'+(select case when (substring((select user()) from 1 for 1)='e') then sleep(3) else 0 end));
- [CRLF Injection vulnerability utilization and example analysis]([https://wooyun.js.org/drops/CRLF%20Injection%E6%BC%8F%E6%B4%9E%E7%9A%84%E5% 88%A9%E7%94%A8%E4%B8%8E%E5%AE%9E%E4%BE%8B%E5%88%86%E6%9E%90.html](https://wooyun.js .org/drops/CRLF Injection vulnerability utilization and example analysis.html))
CRLF is the abbreviation of "carriage return + line feed" (rn). HTTP Header and HTTP Body are separated by two CRLF. CRLF injection is also called HTTP Response Splitting, or HRS for short. X-XSS-Protection:0 turns off the browser's protection strategy for reflected XSS filtering. - SSRF vulnerability exploitation and getshell combat (selected)
- Summary of several methods to bypass filtering (IP restrictions) in SSRF vulnerabilities
SSRF : Use 302 jump (xip.io, short address, self-written service); DNS rebinding (bypassing IP restrictions); change the way the IP address is written; use the problem of parsing the URL: http://[email protected]/
; via various non-HTTP protocols - Summary of SSRF bypass methods
SSRF : Use @; use short address; use special domain name xip.io; use DNS resolution (set A record on the domain name); use hexadecimal conversion; use period - ThinkPHP 5.0.0~5.0.23 RCE vulnerability analysis
- A brief analysis of character encoding and SQL injection in white box auditing (excellent, learned)
Injection attack based on character encoding : a gbk-encoded Chinese character occupies 2 bytes, and a utf-8-encoded Chinese character occupies 3 bytes. Wide byte injection takes advantage of the characteristics of mysql. When mysql uses gbk encoding, it will think that two characters are one Chinese character (under gbk, the previous ascii code must be greater than 128 to reach the range of Chinese characters; the encoding value range of gb2312 : High bit 0xA1-0xF7
, low bit 0xA1-0xFE
, and
is 0x5c
, is not in the low-bit range, so 0x5c
is not the encoding in gb2312, so it will not be eaten. Extend this idea to all multi-byte encodings, as long as the low-bit range contains the encoding of 0x5c
, wide byte injection can be performed. ). Defense plan one: mysql_set_charset+mysql_real_escape_string
, taking into account the current character set of the connection. Defense plan two: Set character_set_client
to binary
(binary), SET character_set_connection=gbk, character_set_results=gbk,character_set_client=binary
. When our mysql receives the client's data, it will think that its encoding is character_set_client
, and then change it to the encoding of character_set_connection
, and then enter the specific table and field, and then convert it to the encoding corresponding to the field. Then, when the query results are generated, they will be converted from the table and field encoding to character_set_results
encoding and returned to the client. Therefore, if we set character_set_client
to binary
, there will be no problem of wide byte or multi-byte. All data is transferred in binary form, which can effectively avoid wide character injection. Problems may also occur when calling iconv after defense. When using iconv to convert utf-8 to gbk, the utilization method is錦'
, because its utf-8 encoding is 0xe98ca6
, and its gbk encoding is 0xe55c
, which finally becomes %e5%5c%5c%27
, two %5c
It's
, which just escapes the backslash. When using iconv to convert gbk to utf-8, the method of use is to directly inject wide bytes. A gbk Chinese character is 2 bytes, and a utf-8 Chinese character is 3 bytes. If we convert gbk to utf-8, PHP will convert every two bytes. Therefore, if the characters before '
are an odd number,
will be swallowed and '
will escape the limit. Why can't錦'
use this method? According to the utf-8 encoding rules, (0x0000005c)
will not appear in utf-8 encoding, so an error will be reported. - Security issues caused by client sessions
- An insight into DAST, SAST, and IAST in one article - a brief discussion on the comparison of Web application security testing technologies (learned)
- Talk about SAST/IDAST/IAST
- Introduction to PHP connection methods and how to attack PHP-FPM
- A GET request to get the flag——XCTF 2018 Final PUBG (WEB 2) Writeup
Penetration testing
- A set of practical penetration testing job interview questions Code execution functions:
eval、preg_replace+/e、assert、call_user_func、call_user_func_array、create_function
; command execution functions: system、exec、shell_exec、passthru、pcntl_exec、popen、proc_open
; img tag except onerror attribute Besides, is there any other way to get the administrator path? src specifies a remote script file to obtain the referer. - A set of practical penetration testing job interview questions, do you know it?
- My interview experience, penetration testing
Code audit
- Java code audit - layer by layer advancement
Data security
- NO.27 Chat about data security Big data technology and era, data is the core asset of many companies ; the traditional security boundary is blurred, we need to assume that our boundary has been penetrated, and at the same time have in-depth defense capabilities to protect the security of information. Therefore, while strengthening traditional security methods, we should directly focus security on the data itself. This is what data security does. Before doing this, there is a premise: we must know that security still serves the business (in most enterprise security cases, business > security), so security and usability must be weighed. Currently, the measures commonly used by enterprises mainly include: data classification, data life cycle management, data desensitization & data encryption, and data leakage prevention.
- Internet enterprise data security system construction
Cloud security
- Cloud security, what exactly is it?
There are three major research directions in cloud security: cloud computing security, cloudization of security infrastructure, and cloud security services. Data security collaboration is also mentioned in the future development trends of cloud security, indicating that no matter what scenario, data is the focus of security. Cloud security services can be seen as chefs cooking (ppt from cdxy), cloud computing (energy), algorithms (tools), data (raw materials), engineers (chefs), what kind of rice can be made (security services that can be provided) ) - The future of cloud security (long in-depth article)
Writing ideas : Cloud security market trends - "Mainstream cloud security products (cloud platform security products and third-party cloud security products CWPP, CSPM, CASB) -" The combination of cloud security and SD-WAN - "Cloud native (DevOps, Continuous delivery, microservices, containers) security.
other
- Security information: enterprise labs, security communities, security teams, security tools, etc.
security tools
Vulnerability Scan
- Vulnerability scanning using xray proxy mode
security research
APT detection
- APT detection based on machine learning
APT detection model : This paper proposes an APT detection model by detecting multiple links in the APT life cycle, correlating alarm events in each link, and using machine learning to train the detection model. It is slightly similar to my idea. I had previously thought that I could use a graph model or rule association algorithm to reconstruct the attack chain, but this article seems to use the associated event set as input data into a prediction model for training. . The purpose of this is to completely describe the set of security events in an APT scenario, reduce the false positive rate, improve the accuracy, and avoid the problems of missed negatives and false positives caused by traditional APT single-link detection. However, there are also some problems in this article, such as the lack of APT data sources. The lack of security data has always been a problem, resulting in the failure of the model proposed in this article to be demonstrated in a real environment.
Malicious samples
- Use machine learning to detect HTTP malicious external traffic (excellent)
Malicious HTTP external traffic detection : General idea : 1. Data collection , run malicious samples in the sandbox, collect malicious traffic, manually distinguish malicious traffic from white traffic, and then classify malicious traffic into families based on threat intelligence. 2. Data analysis (feature engineering): For the similarity of malicious external traffic of the same family, you can consider using a clustering algorithm to group the traffic of the same family into one category, extract their commonalities, form a template, and then use the template to detect unknown traffic. 3. Algorithm: Training phase : Extract HTTP external connection traffic ---> Extract request header fields ---> Generalization ---> Similarity calculation ( field-specific weighting in the request header and then calculate similarity ) ---> Hierarchical clustering--->Generate malicious external traffic template (the union of this field in the cluster is used as the value of this field in the template). Detection stage : Unknown HTTP external traffic ---> Extract request header fields ---> Generalization ---> Match with malicious templates ---> Determine whether the similarity exceeds the threshold (threshold determination) - Construction of Cuckoo malware automated analysis platform
- Cuckoo malware analysis environment
- Playing with Cuckoo
Cuckoo sandbox: I encountered many pitfalls in the process of building the Cuckoo malicious sample analysis environment. What I am still impressed by is pip source change -i https://pypi.tuna.tsinghua.edu.cn/simple; configure agent. py to the startup folder; pay attention to the network relationship between windows10, ubuntu16 and windows7, NAT and Host-Only mode. On the physical host, Windows 10 is installed with vmware, vmware is installed with ubuntu16, ubuntu16 is installed with virtualbox and cuckoo server, and virtualbox is installed with windows7 as the agent. - Summary of malicious sample analysis resources
Fight against machine traffic
- 2018 Bad Bot Report
Combat machine traffic : Security confrontation has promoted the evolution of attack methods and entered the stage of automated confrontation. Various crawlers, credential stuffing, and simulators have generated a large amount of machine traffic, including search engine crawlers and automatically updated RSS subscription servers. Normal machine traffic is generated, while malicious crawlers and other malicious crawlers imitate normal user requests to generate malicious machine traffic. The degree of imitation is also different. Simple malicious machine traffic is generated directly through scripts, and advanced ones are generated through browsers, such as headless. browser, more advanced ones can simulate mouse movements and clicks. Machine traffic can be distinguished based on the network environment (Amazon ISP, data centers, global hosting providers), the tools used (browsers of machine traffic like to disguise themselves as Chrome, Firefox, Internet explorer, Safari), and whether they imitate human interactions, such as mouse trajectories and clicks. and normal user traffic. Once they detect our attempts to stop them, advanced malicious machine traffic APBs become persistent and adaptive, performing multi-modal transformations. Defense: Understand our operations and enemy objectives. Suppress outdated UA/Browser; suppress well-known hosting service providers; protect sensitive APIs; observe high and low peak segments (waveforms?) based on source traffic; investigate the sign of the malicious machine traffic, that is, the significance mark; monitor failed login attempts; monitor Number of failures to properly verify gift cards; be aware of public data leaks to prevent credential stuffing; .
Malicious URL detection
- Detecting Malicious URLs
After reading domestic security algorithms and security data analysis materials to the end, they began to turn their attention to foreign countries and track the development process of foreign machine learning applications in the field of network security. Taking URL detection as an example, many applicable scenarios can be derived, including malicious web page detection, malicious communication activities, and malicious web software. - Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs
Use malicious URL detection as a supplementary method for malicious web page detection. Data: Open source black and white URL samples, no special features; Features: Lexical features and host-based features, average features, analysis and comparison of features of each subcategory, average features; Models: L1 logistic regression, SVM, Naive Bayes, no features, analysis and comparison of each subcategory This model has no characteristics; what is worth learning is the subsequent analysis of the results, analyzing the causes of errors such as False Positive and False Negative, Mismatched Data Sources, model performance and feature performance. After all, it was a paper written ten years ago. - Identifying Suspicious URLs: An Application of Large-Scale Online Learning
- Exploiting Feature Covariance in High-Dimensional Online Learning
Red Team
- Red Team’s practice and thinking from 0 to 1 (learned)
Definition of Red Team --->The goal of Red Team (learn and use TTPs of known real attackers to attack, evaluate the effectiveness of existing defense capabilities, identify weaknesses in the defense system and propose specific countermeasures, use real and effective simulated attacks to assess the potential business impact caused by security issues) ---> Who needs Red Team ---> How Red Team works (basic composition: knowledge reserve, infrastructure, technical research capabilities; work process: full stage Attack simulation, staged attack simulation; collaboration) --->Red Team’s quantification and assessment (coverage of known TTPs, detection rate/detection time/detection stage, blocking rate/blocking time/blocking stage) --->Red Team’s growth and improvement (simulation environment training, vulnerability Analysis and technical research, external communication and sharing) - Summary of ATT&CK APT organization TTPs
- ATT&CK full platform attack technology summary
- Summary of real APT organization analysis reports
WAF
- Technical Discussion | Bypassing WAF at the HTTP protocol level
- Use chunked transfer to defeat all WAFs
- Bypass waf from http protocol level and database level
- Four levels of WAF attack and defense research: Bypass WAF
- Some knowledge about WAF
Anomaly detection
- N methods of anomaly detection (learned)
One of the difficulties of abnormal detection is that the group truth is missing. The common method is to use an unsupervised method to dig out abnormal samples, and then use a monitoring model to combine multiple features to dig more abnormalities. From time sequence (moving average, year -on -year, and month -on -month, STL+GESD), statistics (Mazhi distance, box line map), distance angle (KNN), linear method (matrix decomposition and PCA dimension reduction), distribution (relative entropy KL KL Disposity, deck test), tree, map, behavioral sequence, supervision model (can automatically combine more features, such as GBDT) and other angle detection abnormalities. - Machine learning-abnormal detection algorithm (1): Isolation Forest
- Machine learning-abnormal detection algorithm (2): Local Outlier Factor
- Machine learning-abnormal detection algorithm (3): Principal Component Analysis
- What is the type of support vector machine (ONE Class SVM), does it refer to two types of support vector machines?
- IsolationForest of an abnormal detection algorithm
- Digging abnormal excavation, isolation forest
- Preliminary attempt to test
- Smart monitoring of timing data under machine learning blessing
- Massive operation and maintenance log abnormal excavation
- Data Pre-processing-abnormal value recognition
- ABNONORMAL DETECTION and Supervised Learning
- What are the common "abnormal detection" algorithms in data mining? -Win-tuning answers-Zhihu
1. Introduce common unsupervised abnormal detection algorithms and experiments; 2. Compare the detection capabilities of multiple algorithms; 3. Compare the calculation overhead of multiple algorithms; 4. Summarize and summarize how to deal with abnormal detection problems. 1.1) Statistics and probability models: assumptions distribution and assumption test, one -dimensional and multi -dimensional, characteristic independence is related to characteristics, European -style distance is from Ma's distance; linear model: low -dimensional space embedded Distance from Ma's, PCA and Soft PCA and One-Class SVM; models measured based on similarity: density, distance, angle, dividing ultra-plane, clustering; integrated abnormal detection and model fusion. 1.2) The connection between the decision -making boundary verification algorithm from the experimental results diagram. 2.1) Comparison of model detection effects, ISOLATION FOREST and KNN are stable; KNN and other models based on distance measurement are greatly affected by data dimensions. 3.1) Data volume and data dimension also affect algorithm overhead. Isolation is more suitable for high -dimensional space. 4.1) The experimental results brought the idea of the selection of abnormal detection models: small and medium -sized data sets KNN and MCD are relatively stable, and the ISOLATION FOREST is stable in the Chinese and large data sets; Supervision, therefore stability is more important than high and low performance; simple model effects may also be very good. 4.2) For a brand new abnormal detection problem, you can follow the following steps to analyze: A. The understanding of data, the distribution of data, and the distribution of abnormalities can be selected according to the assumptions; Can't waste; C. If possible, try different algorithms, especially when the data is limited; D. According to the special of the data Point selection algorithm; E. The verification results of unsupervised abnormal detection models are not easy. It can be semi -automatic ways to let go of high confidence and low confidence in manual review; f, abnormal trends and characteristics are often changing constantly changing Therefore, the model needs to re -training and adjust the strategy; G. Do not rely on the model completely, try to use semi -automated strategies: artificial rules+detection models. Artificial rules are still very useful. Do not try a step -by -step data strategy instead of existing rules. - Sworing | Abnormal testing
- Anomaly Detection Isolation Forest & Visualization
- Anomaly detection with time selecasting
Map and security
- Figure/LOUVAIN/DGA Talking Talkinger Carrier topology information, and topology information can be regarded as a characteristic dimension, and some offensive and defensive scenarios have obvious topology special signs. The key point of the Louvain algorithm is the weight of the edge of the graph. In specific offensive and defensive scenarios, special studies are required. For example, in the DGA scene, the correlation between domain name A and B = the number of IPs of domain names A and B in the domain. Master CDXY achieves this logic with SQL.
- Community discovery algorithm-Fast University (Louvian) algorithm initial detection
- A dga oadyssey PDNS DRIVEN DGA Analysis
- Figure calculation learned at the falling point of basic security: the graphs of the invasion detection, invasion response, threatening intelligence, and UEBA landing. Invasion testing: The development direction of corporate intrusion testing and the development of data analysis capabilities. Invasion response: The problem solved during the process (the integrity and richness of the log, the massive data and the correlation analysis of the long -term window, the real -time construction and query, interaction and visualization of the graph). UEBA: Yun's native trust and zero trust development- "default security-" obtains credible service vouchers, "supply chain" attacks- "built on an invasive detection and portrait on the establishment of certification-" Behavioral analysis and portrait. Summary: Business Issues-"data problem.
AI and Security
- Safety scenario, AI -based security algorithm and security data analysis learning information
- TOWARDS Privacy and Security of Deep Learning Systems: A Survey
AI security attack surface : Data and models during the training phase and testing phase, attacking data poisoning and confrontation samples, model extraction and model reversal. - Intelligent threat detection: SPARK -based SOC machine learning detection platform
Corporate security construction
Safe development
- Safety scanning automation detection platform construction (in web black box)
- Analysis of Kunpeng source code of reading artifacts
Security testing
- Construction plan for the risk control and early warning system
Business safety-risk control : Quickly discover abnormalities and accurately define risks. Discover abnormal fragments and entities through changes in the core indicators, and find all the entities under the cluster through the clustering method; abnormal entity sampling ---> Unconscious manual review ---> - From the journey of traditional security to risk control, we also talk about the trend of black production and risk control industry.
Business safety-risk control : struggle in the field of risk control is becoming increasingly fierce. Black production has evolved from a highly professional and clear gang to industrial operation. Now risk control requires basic security technology support (traditional security). In the high -voltage strike of black -gray production, in the future, large enterprises will pay attention to the product capabilities and compliance legality of risk control suppliers. - Ratings Model Model Interview Preparation-Technical Article
- Real combat of risk control models- "Magic Mirror Cup" Ratings Contest Competition
- Risk control user recognition method
- github: Sladesha
- Multi -algorithm identification of abnormal users such as collision coupons
- DNS Tunnel tunnel concealed communication experiment && attempts to reproduce characteristics Ventilation thinking method detection
- HIDS of Enterprise Safety Construction
- Guarantee IDC security: distributed HIDS cluster architecture design
- Point Rong Open Source AGENTSMITH HIDS --- A set of lightweight HIDS system
- Enterprise Safety Construction — Some ideas of HIDS system design based on Agent
Invasive detection-Host invading detection system : Meituan's systemic practice is very worth learning. From the description of demand, product managers put forward demand-> analysis needs, summarize the characteristics of product architecture-> technical difficulties, analyze the technical challenges encountered-> architecture design and technical selection-> distributed HIDS cluster architecture map-> programming programming Language selection-> product implementation. - ICMP tunnel detection method and implementation based on statistical analysis
security products
- Collect some excellent open source security projects to help Party A's security practitioners build open source security products (learned) open source security products : including asset management, security development, automated code auditing, security operation and maintenance, fortress machines, HIDS, network traffic Analysis, honeypots, WAF, corporate cloud disk, fishing website system, github monitoring, risk control, loophole management, SIEM/SOC.
Safe operation
- I understand safe operations
The company pays for output, not for knowledge . Safe operation is guided by solving problems. The main responsibilities and skills of safety operations: security, research and development, operation and maintenance background; better communication ability ; certain project management capabilities ; data awareness. - Let's talk about WHY, which is safe operation and safe operation: safety risk is intuitive , and its appearance has been pierced; the safety construction period has passed, and the results have begun to pursue results .
Safe and how are : Grasp the main contradictions and secondary contradictions and do not let go , try their best to solve it.
Security management
- Enterprise Safety Construction Skills Tree V1.0 Releases six parts: explanation, safety concept, security governance, general skills, professional skills, high -quality resources.
Safe thinking
- Talk about the development direction of the security of Internet companies
Enterprise security development : It is divided into four targets from shallow to depth: 1. Drive to destroy loopholes. The first goal is to make each line of code written by engineers safe. As a result, SDL, SDL and technology research and technology are born Products, such as the research of code security scanning tools and Fuzzing. 2. With SDL, it cannot be 100%security, so the second goal is to allow all known and unknown attacks to discover the first time and quickly call the police and track. Challenge: Massive data and complex demand solutions: super computing power and three -dimensional model. 3. The third goal is to make security the core competitiveness of the company, deepen the characteristics of each product, and can better guide users to use the Internet's habit. 4. The last goal is to observe the changes in the security trend of the entire Internet and early warning of risks in the future. Doing security in the Internet company must have imagination, and at the same time pay close attention to the development of other technical fields. This will not stop at several vulnerabilities. You will find that there are many interesting things waiting to do. A magnificent blueprint. - Innovative defense: Corporate Blue Army Construction Thinking
- Zhao Yan’s CISO Lightning Wars | The road to Secondary practice of Party A in two years (learned)
Scope objects (company business, challenges and security needs (depth defense, self-supply chain security, empowering third-party security)) ----> target settings (current demand settings and future development) ---> Challenge (team all teams Stack (knowledge structure and skill counterpart business), engineering ability, management ability) ---> decomposition Safety system (Safety Construction of Safety Construction of Safety, IT Security, IT Security, Infrastructure Security, Data Security, Terminal Security, Business Security, Privacy and Security Compliance) ---> Achievement and Corresponding (Safety Governance Framework, Industry Devil DequTons) (Really landing ability, DEMO does not have this ability), security research). In general, it is the full -stack technical field of vision (trying to rise from the skill level to the technical perspective)+security management capabilities.
Security architecture
- Cyber Security Architecture | Improve security through security architecture] (https://mp.weixin.qq.com/s/m90wyaevhzfsdgnfhmgxcw)
Red and blue confrontation
- [Red and Blue confrontation] Safety Blue Army Construction of Large Internet Enterprises (learned)
Red and Blue confrontation WHY : test the safety protection system of enterprises; sort out risk blind spots and offensive and defensive scenarios to provide valuable suggestions for safety construction; embodies of security value; strengthen the safety awareness of business colleagues.
Whak of red and blue confrontation : invasion discovery rate; offensive and defensive scenes appearing; attack coverage; frequent acting/security risk/strategy defects/efficiency improvement; attack cost; target achievement rate.
HOW of Red and Blue confrontation : Simulation APT ---> The Blue Army team needs to precipitate a systematic attack method knowledge base and arsenal library-> ATT && CK matrix framework.
Challenges in the process of red and blue confrontation : efficiency/yield; attack cost quantification; challenges from business (the core goal of red and blue confrontation is to escort the business).
Future of red and blue confrontation : multi -layered and multi -range Blue Army; the Blue Army's automation penetration platform/collaborative combat platform; the blue army capability output. - Red and blue confrontation construction in the era of network space (there is a red and blue confrontation related articles in the appendix)
Actual combat is the only criterion for testing security protection capabilities . The penetration test is suitable for the initial stage of the construction of the enterprise security system or a poor and two -white stage, and the red and blue confrontation is an upgraded version of the penetration test. It not only focuses on security vulnerabilities, but also focuses on the defects of the enterprise safety construction system. The boundary of red and blue confrontation is not just information security The network infiltration attack under the perspective, with the emergence of new technologies and new architectures, has also expanded to AIOT, Industrial Internet, Business Risk Control, Eavesdropping/Step in the network space security perspective .
Intranet security
- Internal network security attack simulation and abnormal detection rules actual combat
Xing Sisi : External Information Collection-> Boundary Breakthrough-> Information Collection, Permanent Raiders-> Permanent Maintenance-> Information Collection, By Extraction-> Horizontal Move-> Data Steal-> Cleaning traces.
Data security
- Tencent Security First Enterprise "Data Security Power Map"
Xingxiang Road : Data security capacity maps include 6 major aspects: data asset control capabilities, data security operation capabilities, data business security management and control capabilities, data support environmental security management and control capabilities, data operation and maintenance security control capabilities, data security perception capabilities.
New technology and new security
Overview
- Modernization and security left in digital transformation
Xingxiang Road : New Infrastructure-> Digital Transformation-> Traditional Informatization Facing Challenges-> Business Drives Application of Modernization-> Yunyuan, containerization, DEVOPS, Application Micro-service, arrangement and other new technologies-> Application modernization architecture-> endogenous Security (cloud network comprehensive perception, credible, full process security intervention, safe operation).
Yunyuan
- Cloud native network agent MOSN transparent hijacking technical interpretation | open source
Xing Sisi : Service Mesh-> IStio-> Data surface-> network agent-> MOSN-> High-efficiency and transparent traffic hijacking. Question: traffic takeover. Problems: Environmental adaptation, configuration management, data surface performance. - Observation trend of Yunnian invasion detection trends
Xingxi Road : Diverse assets, fragmentation of service, spraying middleware, and default safety of infrastructure-> invasion testing "business", behavioral analysis will become the core capabilities. - Avfisher (Avfisher): Red Teaming for Cloud (Mark)
trusted computing
- Zhang Ou: Digital Bank Trust Network Practice
Wenting : The essential problem is: depth defense at the network level. Why do you need to do (challenge)-> The idea and plan of landing-> during the process . - He Yi: The path of practice of zero trust security architecture
Core points : The core of zero trust is the establishment of a trust chain such as users+equipment+applications. The continuous dynamic verification of safe and continuous dynamic verification and narrowing the attack surface. Work: network gateway, host gateway, application gateway, SOC .
DevSecOps
- "Safety requires the participation of each engineer" -Devsecops concept and thinking (Mark)
safe development
personal development
interview
- Face Scriptures, Internships, ETC
Interviews : Didi, Baidu (2), 360 (2), Ali (6), Tencent (3), B Station, Huawei, Tonghua Shun, Mushroom Street. In general, the big brothers are so strong, and most of them are the Ministry of Security of Party A. My understanding: After seeing the sorting and asking questions of the big brothers, it is really a variety of, there are BIN directions, the data security direction, and the direction of safe operations, etc., there are some reference value, but because Different directions, you ca n’t help you, you still have to give your own expertise, first be an expert in your own small field. - 2018 Spring Recruitment Safety Post Internship Interview Summary
- Tencent 2016 Internship Recruitment-Security Penal Trial Questions Answers Detailed Explanation
Written test : Design a safe web authentication solution: front end: verification code+CSRF_TOKEN+encrypted random number based on timestamp -based; transmit identity information to the server background, and set homologous strategies (homologous websites: port name, port, protocol); After the server verifies the client identity, it returns to the client by randomly encrypted session and cookies; the client establishes a connection with the server. - Interview with large companies' security technology positions
Interview : Foundation of Safety Technology ---> Project details (deep knowledge, crushing the interviewer in areas that are good at, so that the interviewer cannot ask for depth questions) ----> Handling ideas for challenging issues (knowledge and industry Cognitive ability generally does not get out of the field of goodness, you need to read more and think more every day) ---> Industry in-depth cognitive ability and career planning - What is the reality of the 2019 Ali intern? -The Answer of Zuo Zuo Vera-Zhihu (learned)
- Ten Ali, seven -faced headlines, do you guess I entered Ali?
Interview : Java Edition Excellent Facial Sutra, Java essential. - I and Alibaba (too strong)
- Safety recruitment face test questions (learned)
Xing Sisi Road : Infiltration test (Web direction), safety research and development (Java direction), safe operation (compliance audit direction), safety architecture (security management direction)
Supplementary learning : the differences and advantages and disadvantages of CRLF, symmetrical encryption and asymmetric encryption, the interactive process of HTTPS, homologous strategies, and cross -domain requests. - What are the good resumes of safe recruitment?
- The current status of the security industry of security recruitment
- Safety recruitment Safety practitioners essential quality
Xing Si Road: Basic Quality = Basic Ability (Self -Driven+Autonomous Learning)+Professional Ability (Infiltration of offensive and defensive+software development). Advanced Quality = Smart (IQ+EQ)+Brave and optimism+introspection . - Safety recruitment interview process is now stolen and spent more costs to make up for it.
- A safety engineer 2019
Xingxiang Road : Old track and new journey- "Industry explorers are still followers-" Industry information transparent interoperability- "life adds some salt.
career development
- Self -cultivation of security researchers
- Self -cultivation of security researchers (continued)
- Safety personnel development direction talks
Party A's security development route : hardcore technology type ---> large factory laboratory and security research post-non-hard nuclear technology type-> red and blue, technical operations, and safety management of Internet enterprises' safety construction - The meaning of the existence of security practitioners
Personal development : The goal is to help advance productivity solve the problem of security. Among them, security issues are the problem of trust (support support, origin support), a science (confrontation between people) that studies confrontation, and a probability problem (security architecture). Security is a science of application. With the differences in each era, there can be many different technical means and tools to complete their respective safety goals. Therefore, security practitioners should be sensitive and accepted by new technologies and advanced productivity. Bring a lot of new perspectives and capabilities, including machine intelligence and blockchain technology. - Several identities of the security team in the enterprise
Team development : The security team should use professional security capabilities to solve ideas and solutions as a serviceman and collaborator, and prevent security issues from multiple times.
Industry development
Security pattern
- The latest statistics from 2005-2017 Domestic research units published articles in the top international security conference
- From the perspective of content output, changes in the field of safety
Technical pattern : Internet giants such as Penguin have begun to block traffic, which has a great impact on security practitioners. The data cannot be climbed, and the API is limited. It can only rise to the APP HOOK; technical security analysis, data mining, and threatening information are more The more heavier, AI is not just a gimmick, the intelligent security is unstoppable ; in terms of security career development, more and more big brothers have begun to transform business security and data security. - A shallow analysis of competition in the network security industry
Market structure : basic security protection (traditional security protection capacity), intermediate security protection (massive data modeling and analysis capabilities), advanced security protection (cloud threat intelligence and analysis capabilities), and high -level security protection markets are broad. In addition, the full text highlights artificial intelligence technology in many places. Has intelligent security began to enter the slopes of enlightenment? ! More than half of people are optimistic about intelligent security, and some people are not optimistic about intelligent security. What will happen in the future? Let us wait and see! - ZOOMEYE network space surveying and mapping -Venezuela power outage incident affects its network key infrastructure and important information system
- 2020 safety work outlook
Walking logic : 2019 major event : HW operation has changed security from hidden to explicit, low frequency becomes high frequency, exposure problems, and promoting management of management to security. This is a big background; the security compliance of the guarantee 2.0 is becoming stricter. Changes in 2019 : Leaders attached importance to; practicalization. 2020 Party A Safety Focus Technical Point : Safety operation (coverage rate and normal rate and other indicators, whether there are verification ideas: whether safety measures can be actively discovered within a certain period of time) and security asset management (CMDB, data on the host, traffic, scanning, scanning, scanning, scanning, scanning Added manually). 2020 Follow the needs of "people" . 2020 Outstanding Industry : The organizational structure of Party A's security team will change drastically, whether the security team can withstand changes; the way to get along with the two parties and B; there are more and more safety black swan incidents.
security products
- The future of the C -side safety products
C -side safety products : Will mobile security products usher in spring like PC security products a few days ago? In the PC era, Windows is a unique and open platform, which allows third -party security companies to have enough space between platforms and users. However, on the mobile terminal, Android begins to be closed, it is hard to say. Traditional security software revolves around viruses and fraud, and C -end security products around personal information security have a line of vitality. - Next Holy Grail-2019
API security : The development of security : prediction in 2015, data is a new center, identity is a new border, behavior is new control, and information is new service. The evolution of infrastructure-> delivery method. In 2015, WAF products in the field of application security were good opportunities and were determined by the market. New situation and new opportunities : microservice, serverless, edge calculation. The delivery method in the market has changed. Cross -sub -field and cross -infrastructure : API security span application security, data security, and identity security: API. The API use scenarios are widely used, and the product needs to comprehensively cover a variety of different infrastructure.
data
Data system
- How does data analysts build a data operation index system? -The answer from Zhang Ximeng Simon
Core points : Collaborative process empowerment : Realizing the setting process of the indicator system of data -driven XX requires cross -team collaboration. The process is: demand collection, scheme planning, data collection, collection scheme evaluation, data collection and data verification online, and effect assessment. Two models of planning data index systems : OSM and UJM. OSM emphasizes business goals and UJM emphasizes user journey. Index hierarchical system : 1, two, and three indicators linkage. - How to establish a data/business analysis department from 0-1 in an enterprise? ( Learned )
The positioning and value of the department-> Milestone Design-> Team Construction-> Build IT data-> preliminary management.
Positioning and value are the fundamental of a department based on the company : the department VS of the report to do strategic departments ; the positioning of other companies in other companies and the recognition of other departments within the company; we must enlarge the value of the majority and the high -level route.
Establish long-term goals and disassemble milestones : company business goals ---> Company strategy ---> departmental goals ---> department milestone ---> work plan ; establish milestone skills? Borrow, win-win, coincidence, and foundation ; use the boss's potential to find the pain points of 1-2 bosses to solve the problem; find a win-win situation for departments with the same benefits; Sure, data cleaning, system interconnection, data warehouse design, data stadium design.
For the team building based on the milestone : avoid one step in place; cautiously pull the gang; do not miss the talents; learn to "draw cakes"; pay attention to team culture construction.
Build the company's data IT capabilities : build a basic and general data flow framework: application layer, gathering layer, processing layer, analysis layer, display layer; Indexs such as capacity, horizontal scalability, query real -time, query flexibility, writing speed, transaction, data storage, processing data scale, and scalability. The point that needs to be noticed in the establishment of a data framework is: the company's level of business data architecture needs to be realized . Based on the system's system of systematic data, any business changes will be reflected on the data, and the realization of the data fully reflects the current situation of the business. The key to completing this step is to complete the company -level main data management : clarify the business meaning and caliber of various data, the responsibilities unit of each data, open the data link, and promote data sharing.
Lead the team to victory : do "long -ranking" instead of "military captain"; let the right people do the right thing; clear the rules and timely fulfillment.
Data analysis and operation
- Data analysis and visualization: Who is the first person to eat chicken in the safe circle (learned)
Data analysis and visualization : Collect data sets ---> Observation Data sets ---> Community discovery and community relations ---> Player portrait. - Please share the ideas of data analysis. How to do a good job of data analysis?
Core points : Data Analysis Question: Business Data Analysis Index (Point Macrobrid). Method of data analysis: classification and contrast .
Security data analysis
- Data-Knowledge-A): Enterprise security data analysis (excellent, learned)
Summary : 1. Let the model understand business, establish an abnormal baseline based on business historical behaviors, and detect threats on the basis of abnormalities; feedback the operating results to the model, and regard mistakes as normal behavior back. 2. Safety operations can operate, reduce the cost of investigation in the event, and collect and aggregate automated information. 3. With the accumulation of data, security data analysis will develop to the high -level knowledge expression method based on graph structure. (This is deeply agreed.) 4. The depth of understanding of scenes, attack mode, and data is far more important than selecting tools. - Security data science learning resources
Summary : The author's research point is also the science of security data, and some learning methods and learning resources have been compiled. The learning methods are mainly divided into three aspects: Google Academic, Twitter, and Security Conference . Google Academic Focus on well -known researchers and their new articles, follow the article that quotes the articles you pay attention to, Twitter pays attention to people in the field of subdivided security, and pay attention to security conferences and conference agenda. Learning resources: books and courses. - Quickly build a data analysis framework for lightweight OpenSOC architecture (1) (learned)
Framework : Wanci: From thick to thin (from frame to examples (from framework to scenes to actual architecture). OpenSOC introduction (framework composition and workflow) --- "Construction of lightweight OpenSOC (focusing on specific scenarios and tools and specific architectures) ---" Establishment steps (every step of environmental construction and configuration)-"effect display. - Prophet TALK: Exploring security threats from the perspective of data
- Big data threat modeling methodology (learned a lot)
- Safety log dimension
- Data security analysis thought exploration
- DataCon 2019: 1st Place Solution of Malicious DNS Traffic & Dga Analysis (learned)
My understanding : The knowledge points involved are: security scenario: DNS security; data processing: the use of TSHARK tools, the use of MaxCompute and SQL, PAI pre -analysis and visualization; feature engineering: characteristics of DNS_Type, SRC_IP dimension; abnormal test algorithm: abnormal test algorithm: Single special 3sigma test; artificial extraction feature rules.
The first small question DNS malicious traffic is abnormal test: 80%of the individual absorption, the processing process has no obstacles, and the details and tools in each process have not been fully mastered. For example Comprehensive extraction and SQL statement are characterized;
Multi -category of DGA in the second small question: personal absorption 50%, the process is understood, but the understanding of some issues is not in place, such as community algorithms - Based on big data companies network threat discovery model practice
My understanding : Question: Horizontal and vertical linkage of multi -source security analysis equipment and services (threat data).
algorithm
AI
Algorithm system
- Machine learning algorithm collection: from Bayes to deep learning and respective advantages and disadvantages
算法知识框架:主要从算法的定义、过程、代表性算法、优缺点解释回归、正则化算法、人工神经网络、深度学习||决策树算法、集成算法||支持向量机||降维算法、 Cluster algorithm || Family -based algorithm || Bayesian algorithm || Association rules Learning algorithm || Figure model.
Personal understanding : The regression series is mainly based on linear regression and logical regression, including regression, regularization algorithms, artificial neural networks, and deep learning. The tree series is mainly based on decision -making trees, including decision -making trees and integrated learning algorithms of trees; support vector machines; support vector machines; It belongs to the old -fashioned algorithm; the dimension reduction algorithm and the cluster algorithm are mainly based on the inherent structure description data of the data ; The algorithm -based algorithm does not actually have the process of training. The representative algorithm is KNN, which is based on memory; the Bayesian algorithm uses the Bayesian theorem to calculate the output probability; The best explanation; graph model is a probability model that can indicate the conditional dependency structure between random variables. - Categories of Algorithms Non Exhaustive (learned)
Algorithm knowledge framework : I learned to build its own algorithm system.
basic knowledge
- Http dataset CSIC 2010
安全数据集-CSIC2010 :基于e-Commerce Web应用自动化生成的安全数据集,包含36000个正常请求和25000个异常请求,异常请求包括:SQL注入、缓冲区溢出、信息收集、文件泄露、CRLF注入、 XSS et al. - The performance assessment of the classification model -take SAS LOGISTIC as an example (3): Lift and Gain
- How to deal with non -balanced data sets in machine learning?
Non-balanced data set : punishment weights of upper samples and samples, positive and negative samples (SCIKIT-Learn's SVM as an example: Class_weight: {Diction, 'Balanced'}), combination/integration method The sample training model is integrated), feature selection (when the amount of small samples has a certain scale, select a significant characteristics) - What are the differences between GBDT and XGBOOST in machine learning algorithms?
Algorithm comparison : The GBDT base classifier is cart. The XGB classifier can be a variety of base classifiers, such as a linear classifier. At this time, it is equivalent to the logical regression or linear regression of the L1 and L2 regular items. The first -order guide is used, and the XGB has carried out the second -order Taylor formula for the loss function, and the accuracy becomes high; XGB parallel treatment (feature particle size parallel, the pre -storage of the feature value is stored as a block structure, and the node points are performed by the node points When class, you need to calculate the gain of each characteristic, and finally choose the characteristic of the largest gain to classify, then the gain calculation of each feature can be performed by multi When the classification brings negative gains, the GBM will stop splitting, and XGB has been classified to the designated maximum depth, and then the global pruning cuts are carried out; from a optimized perspective, GBDT uses numerical optimization thinking, the maximum decrease in decrease Fa to solve LOSS Function's optimal solution, which uses the CART decision tree to fit the negative gradient, and use Newtonian method to advance. XGB uses the thinking of parsing. To establish a decision tree as Gain, Loss Function is optimal. - Under what circumstances do SVM and logistic regression are used?
Algorithm usage scenarios-SVM and logical regression usage scenarios : You need to determine according to the number of features and the number of training samples. If the number of features is large enough than the training sample number, it can achieve a good effect with a linear model. Without too complicated models, the SVM of the LR or linear nuclear function is used. If the training sample is large enough and the characteristics are small, you can get better predictive performance through the SVM of the complex nuclear function. If the sample does not reach a million level, the SVM using the complex nuclear function will not cause the operation to be too slow. . If the training sample is particularly large, the SVM using complex nuclear functions will cause the operation to be too slow, so you should consider introducing more features and then use linear SVM or LR to construct a model. - Why does GBDT's residue replace the negative gradient?
- The distance between the Oushi and Ma's
- Common indicators of machine learning algorithm summary
- ROC-AUC curve and PRC curve of classification model evaluation
machine learning
- Average digital coding: data pre -processing/feature engineering for high base number qualitative features (category features)
- Mean Encoding
- Kaggle encoding Categorical Feature summary summary
- Python target encoding for categorical features
- Mean (likelihood) encodings: a comprehensive standy
- How to enter the top 10% in Kaggle's first battle
- Kaggle contest summary
- Share a wave of experience about doing kaggle competitions, jdata, Tianchi, it is enough to read me
- Why does GBDT and RANDOM FOREST have a very good effect in the actual Kaggle competition?
Supervision learning-tree series algorithm : single model, Gradient Boosting Machine and Deep Learning are the first choice. GBM does not require complicated feature projects, does not require too much time to adjust the parameters, and DL needs more time to adjust the network structure. From the perspective of Overfit , both have the ability of Overfit and even Perfect Fit. The stronger the Overfit ability, the stronger the plasticity. Then the problem we want to solve is that if the model training is "just", such as the Early_stopping function in GBM. The linear regression model lacks Overfit capabilities. If the actual data conforms to the relationship between the linear model, it can get a good result. If it does not meet, it is necessary to do feature engineering. The feature engineering is a more subjective process. The advantages of trees, non -parameter models, GBM overfit capable. The PERFACT FIT ability of Random Forest is very poor. This is because the RF tree is trained independently and does not cooperate. Although it is a non -parameter model, it is wasted for this innate advantage. - [Summary] Tree algorithm cognitive summary
有监督学习-树类算法:分类树和回归树的区别;避免决策树过拟合的方法;随机森林怎么应用到分类和回归问题上;kaggle上为啥GBDT比RF更优;RF、GBDT、 Xgboost's cognition (principle, advantages and disadvantages, differences, characteristics). - LightGBM
- Lightgbm algorithm summary
- "My Love Machine Learning" Integrated Learning (4) LightGBM
- How to play LightGBM (official Slides explanation)
There is supervision and learning-LightGBM-Personal understanding : Lightgbm several major characteristics and principles: histogram segmentation and histograms acceleration (two major improvements of the histogram: the complexity of the histogram = O (#Feature ×#Data), GOSS reduces the number of samples, GOSS reduces the number of samples. EFB reduces the number of features)-"efficiency and memory improvement. Leaf-Wise with Max Depth Limity replaced Level-Wise- "accuracy improvement. Support native category features. Parallel calculation: data parallel (horizontal division data), feature parallel (vertical division data), PV-Tree vote parallel (essentially data parallel). - 快速弄懂机器学习里的集成算法:原理、框架与实战
- 时间序列数据的聚类有什么好方法?
无监督学习-时间序列问题:传统的机器学习数据分析领域:提取特征,使用聚类算法聚集;在自然语言处理领域:为了寻找相似的新闻或是把相似的文本信息聚集到一起,可以使用word2vec把自然语言处理成向量特征,然后使用KMeans等机器学习算法来作聚类;另一种做法是使用Jaccard相似度来计算两个文本内容之间的相似性,然后使用层次聚类的方法来作聚类。常见的聚类算法:基于距离的机器学习聚类算法(KMeans)、基于相似性的机器学习聚类算法(层次聚类)。对时间序列数据进行聚类的方法:时间序列的特征构造、时间序列的相似度方法。如果使用深度学习的话,要么就提供大量的标签数据;要么就只能使用一些无监督的编码器的方法。 - 凝聚式层次聚类算法的初步理解
无监督学习-层次聚类:算法步骤:计算邻近度矩阵--->(合并最接近的两个簇--->更新邻近度矩阵)(repeat),直到达到仅剩一个簇或达到终止条件. - 推荐算法入门(1)相似度计算方法大全
无监督学习-层次聚类-相似性计算:曼哈顿距离、欧式距离、切比雪夫距离、余弦相似度、皮尔逊相关系数、Jaccard系数。
deep learning
CPU环境搭建
- tensorflow issues#22512
问题本质:报错:ImportError: DLL load failed,原因:缺少依赖,解决方法:pip install --index-url https://pypi.douban.com/simple tensorflow==2.0.0,会自动安装依赖。
GPU环境搭建
- Tensorflow和Keras 常见问题(持续更新~)(坑点)
- Tested build configurations(版本对应速查表)
- windows tensorflow-gpu的安装(靠谱)
- windows下安装配置cudn和cudnn
问题本质:总的来说,是英伟达显卡驱动版本、cuda、cudnn和tensorflow-gpu之间版本的对应问题。最好装tensorflow-gpu==1.14.0,tensorflow-gpu==2.0需要cuda==10.0,10.2会报错,tensorflow-gpu==2.0不支持。 - win10搭建tensorflow-gpu环境
问题本质:CUDA的各种环境变量添加。
深度学习基础知识
- 深度学习中的batch的大小对学习效果有何影响?
- Batch Normalization原理与实战(还没完全看懂)
神经网络基本部件
- 如何计算感受野(Receptive Field)——原理感受野:卷积层越深,感受野越大,计算公式为(N-1)_RF = f(N_RF, stride, kernel) = (N_RF - 1) * stride + kernel,思路为倒推法。
- 如何理解空洞卷积(dilated convolution)谭旭的回答空洞卷积:池化层减小图像尺寸同时增大感受野,空洞卷积的优点是不做pooling损失信息的情况下,增大感受野。3层3*3的传统卷积叠加起来,stride为1的话,只能达到(kernel_size-1)layer+1=7的感受野,和层数layer成线性关系,而空洞卷积的感受野是指数级的增长,计算公式为(2^layer-1)(kernel_size-1)+kernel_size=15。
- 空洞卷积(dilated convolution)感受野计算
- 空洞卷积(dilated Convolution)
- 直观理解神经网络最后一层全连接+Softmax(便于理解)
全连接层:可以理解为对特征的加权求和。
神经网络基本结构
- 一组图文,读懂深度学习中的卷积网络到底怎么回事?
卷积神经网络:卷积层参数:内核大小(卷积视野3乘3)、步幅(下采样2)、padding(填充)、输入和输出通道。卷积类型:引入扩张率参数的扩张卷积、转置卷积、可分离卷积。 - 卷积神经网络(CNN)模型结构
- 总结卷积神经网络发展历程- 没头脑的文章(很全面)
- 三次简化一张图:一招理解LSTM/GRU门控机制(很清晰)
循环神经网络:文中电路图的形式好理解。RNN:输入状态、隐藏状态。LSTM:输入状态、隐藏状态、细胞状态、3个门。GRU:输入状态、隐藏状态、2个门。LSTM和GRU通过设计门控机制缓解RNN梯度传播问题。 - gcn
- GRAPH CONVOLUTIONAL NETWORKS
图神经网络:相较于CNN,区别是图卷积算子计算公式。 - keras-attention-mechanism
神经网络应用
- [AI识人]OpenPose:实时多人2D姿态估计| 附视频测试及源码链接
- 使用生成对抗网络(GAN)生成DGA
- GAN_for_DGA
- 详解如何使用Keras实现Wassertein GAN
- Wasserstein GAN in Keras
- WassersteinGAN
- keras-acgan
- 用深度学习(CNN RNN Attention)解决大规模文本分类问题- 综述和实践
NLP :传统的高维稀疏->现在的低维稠密。注意事项:类目不均衡、理解数据(badcase)、fine-tuning(只用word2vec训练的词向量作为特征表示,可能会损失很大效果,预训练+微调)、一定要用dropout、避免训练震荡、超参调节、未必一定要softmax loss、模型不是最重要的、关注迭代质量(为什么?结论?下一步?)
reinforcement learning
- 深度强化学习的弱点和局限
- 关于强化学习的局限的一些思考
强化学习的局限性:采样效率很差、很难设计一个合适的奖励函数。
Application areas
- 全球最全?的安全数据网站(有时间得好好整理一下)
- 初探机器学习检测PHP Webshell
- 基于机器学习的Webshell 发现技术探索
- 网络安全即将迎来机器对抗时代?
智能安全-智能攻击:国外已经在研究利用机器学习打造更智能的攻击工具,比如深度强化学习,就是深度学习和强化学习的结合,可以感知环境,做出最优决策,可能被应用到漏洞扫描器里,使扫描器能够自动化地入侵目标。
个人理解:国外已有案例Deep Exploit就是利用深度强化学习结合metasploit进行自动化地渗透测试,国内还没有看到过相关公开案例。由于学习门槛高、安全本身攻击场景需要精细化操作、弱智能化机器学习导致的机器学习和安全场景结合深度不够等一系列的问题,已有的机器学习+安全的大多数研究主要集中在安全防护方面,机器学习+攻击方面的研究较少且局限,但是我相信这个场景很有潜力,或许以后就成为蓝方的攻击利器。 - 人工智能反欺诈三部曲之:设备指纹
智能安全-业务安全-设备指纹:ip、cookie、设备ID ;主动式设备指纹:使用JS或SDK从客户端抓取各种各样的设备属性值,然后组合,通过hash算法得到设备ID;优点:Web内或者App内准确率高。缺点:主动式设备指纹在Web与App之间、不同的浏览器之间,会生成不同的设备ID,无法实现跨Web和App,不同浏览器之间的设备关联;由于依赖客户端代码,指纹在反欺诈的场景中对抗性较弱。被动式设备指纹:从数据报文中提取设备OS、协议栈和网络状态的特征集,并结合机器学习算法识别终端设备。优点:弥补了主动式设备指纹的缺点。缺点:占用处理资源多;响应时延比主动式长。 - 风险大脑支付风险识别初赛经验分享【谋杀电冰箱-凤凰还未涅槃】
智能安全-业务安全-风控:个人理解见:https://github.com/404notf0und/Risk-Operation-Detection/blob/master/atec.ipynb。 - 机器学习在互联网巨头公司实践
入侵检测:机器学习和统计建模的主要区别:机器学习主要依赖数据和算法,统计建模依赖建模者对数据特征的了解。两者的优缺点:机器学习:打标数据难获取,如果采用非监督学习,则性能不足以运维;机器学习结果不可解释。所以现在机器学习在做入侵检测的时候,一般都要限定一个特定的场景。统计建模:数据预处理阶段移除正常数据的干扰(重点关注查全率,强调过正常数据的过滤能力,尽可能筛除正常数据),构建能够识别恶意可疑行为的攻击模型(重点关注precision,强调模型对异常攻击模式判断的准确性,攻击链模型),缺点是泛化能力不足、在入侵检测一些场景中,模型易被干扰。我们的最终目的:大数据场景下安全分析可运维。 - Web安全检测中机器学习的经验之谈
Web安全:从文本分类的角度解决Web安全检测的问题。数据样本的多样性,短文本分类,词向量,句向量,文本向量。文本分类+多维度特征。与传统方法做对比得出更好的检测方式:传统方法+机器学习:传统waf/正则规则给数据打标;传统方法先进行过滤。 - 词嵌入来龙去脉(学到了)
NLP :DeepNLP的核心关键:语言表示--->NLP词的表示方法类型:词的独热表示和词的分布式表示(这类方法都基于分布假说:词的语义由上下文决定,方法核心是上下文的表示以及上下文与目标词之间的关系的建模)--->NLP语言模型:统计语言模型--->词的分布式表示:基于矩阵的分布表示、基于聚类的分布表示、基于神经网络的分布表示,词嵌入--->词嵌入(word embedding是神经网络训练语言模型的副产品)--->神经网络语言模型与word2vec。 - 深入浅出讲解语言模型
NLP :NLP统计语言模型:定义(计算一个句子的概率的模型,也就是判断一句话是否是人话的概率)、马尔科夫假设(随便一个词出现的概率只与它前面出现的有限的一个或几个词有关)、N元模型(一元语言模型unigram、二元语言模型bigram)。 - 有谁可以解释下word embedding? - YJango的回答- 知乎
NLP :单词表达:one hot representation、distributed representation。Word embedding:以神经网络分析one hot representation和distributed representation作为例子,证明用distributed representation表达一个单词是比较好的。word embedding就是神经网络分析distributed representation所显示的效果,降低训练所需的数据量,就是要从数据中自动学习出输入空间到distributed representation空间的映射f(相当于加入了先验知识,相同的东西不需要分别用不同的数据进行学习)。训练方法:如何自动寻找到映射f,将one hot representation转变成distributed representation呢?思想:单词意思需要放在特定的上下文中去理解,例子:这个可爱的泰迪舔了我的脸
和这个可爱的京巴舔了我的脸
,用输入单词x 作为中心单词去预测其他单词z 出现在其周边的可能性(至此我才明白为什么说词嵌入是神经网络训练语言模型的副产品这句话)。用输入单词作为中心单词去预测周边单词的方式叫skip-gram,用输入单词作为周边单词去预测中心单词的方式叫CBOW。 - Chars2vec: character-based language model for handling real world texts with spelling errors and…
- Character Level Embeddings
- 使用TextCNN模型探究恶意软件检测问题
恶意软件检测:改进分为两个方面:调参和结构。调参:Embedding层的inputLen、output_dim,EarlyStopping,样本比例参数class_weight,卷积层和全连接层的正则化参数l2,适配硬件(GPU、TPU)的batch_size。结构:增加了全局池化层。
学到了:一个trick,通过训练集和评价指标logloss计算测试集的各标签数量,以此调整训练阶段的参数class_weight,还可以事先达到“对答案”的效果。和一个T大大佬在datacon域名安全检测比赛中使用的trick如出一辙。 - 基于海量url数据识别视频类网页
CV-行文思路:问题:视频类网页识别。解决方式:url粗筛->视频网页规则粗筛->视频网页截屏及CNN识别。
Industry development
- 认知智能再突破,阿里18 篇论文入选AI 顶会KDD
认知智能:计算智能->感知智能->认知智能。快速计算、记忆、存储->识别处理语言、图像、视频->实现思考、理解、推理和解释。认知智能的三大关键技术:知识图谱是底料、图神经网络是推理工具、用户交互是目的。 - 未来3~5 年内,哪个方向的机器学习人才最紧缺? - 王喆的回答
要点简记:站在机器学习“工程体系”之上,综合考虑“模型结构”,“工程限制”,“问题目标”的算法“工程师”。我的理解:红利的迁移,模型结构单点创新带来的收益->体系结构协同带来的收益。 阿里技术副总裁贾扬清:我对人工智能的一点浅见
AI发展:神经网络和深度学习的成功与局限,成功原因是大数据和高性能计算,局限原因是结构化的理解和小数据上的有效学习算法。 AI这个方向会怎么走?传统的深度学习应用,比如图像、语音等,应该如何输出产品和价值?而不仅仅是停留在安防这个层面,要深入到更广阔的领域。除了语音和图像之外,如何解决更多问题?而不仅仅是停留在解决语音图像等几个领域内的问题。
综合素质
- 算法工程师必须要知道的面试技能雷达图(学到了)
个人发展-必备技术素质:算法工程师必备技术素质拆分:知识、工具、逻辑、业务。在满足最小要求的基础上,算法工程师在这四个方面的能力是相对全面的,既包括”算法“,也包括”工程“,而大数据工程师则着重”工具“,研究员则着重”知识“和”逻辑“。
针对安全业务的算法工程师就是安全算法工程师。为了便于理解,举个例子,如果用XGBoost解决某个安全问题,那么可以由浅入深理解,把知识、工具、逻辑、业务四个方面串起来:
1.GBDT的原理(知识)
2.决策树节点分裂时是如何选择特征的? (Knowledge)
3.写出Gini Index和Information Gain的公式并举例说明(知识)
4.分类树和回归树的区别是什么(知识)
5.与Random Forest对比,理解什么是模型的偏差和方差(知识)
6.XGBoost的参数调优有哪些经验(工具)
7.XGBoost的正则化和并行化分别是如何实现的(工具)
8.为什么解决这个安全问题会出现严重的过拟合问题(业务)
9.如果选用一种其他模型替代XGBoost或改进XGBoost你会怎么做? Why? (业务、逻辑、知识)。
以上,就是以“知识”为切入点,不仅深度理解了“知识”,也深度理解了“工具”、“逻辑”、“业务”。
- [校招经验] BAT机器学习算法实习面试记录(学到了)
个人发展-面试经验:根据面试常遇到的问题再深入理解机器学习,储备自己的算法知识库。 - 机器学习如何才能避免「只是调参数」?(学到了)
个人发展-职业发展:机器学习工程师分为三种:应用型(能力:保持算法全栈,即数据、建模、业务、运维、后端,重点在建模能力,流程是遇到一个指定的业务场景应该迅速知道用什么数据做特征,用什么模型,这个模型在工程上的时效性和鲁棒性,最终会不会产生业务风险等一整套链路。预期目标:锻炼得到很强的业务敏感性,快速验证提出的需求)、造轮子型(多读顶会跟上时代节奏,且拥有超强的功能能力,打造ML框架,提供给应用型机器学习工程师使用)、研究型(AI Lab,读论文+试验性复现)。个人发展:锻炼业务能力和工程能力,未来几年成长规划还是算法全栈路线,技术上独挡一面,业务上带来kpi,以后快速晋升+带队。同时保持阅读习惯,多学习新知识。 - 做机器学习算法工程师是什么样的工作体验?
个人发展-工作体验:业务理解、数据清洗和特征工程、持续学习(增强解决方案的判断力)、编程能力、常用工具(XGB、TensorFlow、ScikitLearn、Pandas(表格类数据或时间序列数据)、Spark、SQL、FbProphet(时间序列)) - 大三实习面经(学到了)
- 如果你是面试官,你怎么去判断一个面试者的深度学习水平?
个人发展-心得体会:深度学习擅长处理具有局部相关性的问题和数据,在图像、语音、自然语言处理方面效果显著,因为图像是由像素构成,语音是由音位构成,语言是由单词构成,都有局部相关性,可以构造高级特征。 - 面试官如何判断面试者的机器学习水平? - 微调的回答- 知乎
个人发展-心得体会:考虑方法优点和局限性,培养独立思考的能力;正确判断机器学习对业务的影响力;学会分情况讨论(比如深度学习相对于机器学习而言);学习机器学习不能停留在“知道”的层次,要从原理级学习,甚至可以从源码级学习,知其然知其所以然,要做安全圈机器学习最6的。 - 两年美团算法大佬的个人总结与学习建议
个人发展-心得体会:算法的基本认识(知识)、过硬的代码能力(工具)、数据处理和分析能力(业务和逻辑)、模型的积累和迁移能力(业务和逻辑)、产品能力、软实力.
Profession
career planning
thinking
- 如何解决思维混乱、讲话没条理的情况?(学到了)
结构化思维->讲话有条理。 - 哪些思维方式是你刻意训练过的? (学到了)
structured thinking
金字塔思维:结论先行,以上统下,归类分组,逻辑递进。
金字塔结构:纵向延伸,横向分类。
如何得出金字塔结论:归纳法,演绎推理法。实际生活中,不是每时每刻都有相关的模型套用和演绎法的,这时候就用归纳法,自下而上进行梳理,得出结论,比如头脑风暴把闪过的碎片想法全部写下来,再抽象与分类,最后得出结论。 - 厉害的人是怎么分析问题的?(学到了)
定义问题/描述问题:问题的本质是现实和期望的落差部分;明确期望值B',精准定位现状B,,用B--->B'这个落差,精准描述问题。
分析问题/解决问题:不能从现状B出发,找寻一条B--->B'的路径,要透过现象看本质。方法A,现实B,期望B',变量C。校准期望B',重构方法A,消除变量C。
communicate
manage
- “我是技术总监,你干嘛总问我技术细节?”
(快速发展期、平稳期、衰退期等业务发展时期作为时间轴)(中高层管理者)(需要掌握)(应用场景、技术基础、技术栈中的技术细节)。技术基础要扎实,技术栈了解程度深(对技术原理和细节清楚),应用场景不能浮于表面。总的来说就是一句话:技术细节与技术深度。 - 阿里巴巴高级算法专家威视:组建技术团队的一些思考(学到了)
行文思路:团队的定位(定位(能力、业务、服务)、壁垒(以不变应万变沉淀风险管控知识作为壁垒)和价值(提供不同层次的服务形式))-》团队的能力(连接、生产、传播、服务)-》组织与个人的关系-》招人-》用人-》对内管理模式(找对前进的方向、绩效的考核(3个维度:业务结果、能力进步、技术影响力))
学到了:建设技术体系解决某一类问题,而不是某个技术点去解决某一个问题。 - 26岁当上数据总监,分享第一次做Leader的心得
团队管理方面的基本功和方法论:定策略、建团队、立规矩、拿结果。
定策略:要明确公司高层的真实目的;对自己的团队了如指掌;管理者专精的行业知识和经验。
建团队:避免嫉贤妒能、职场近亲、玻璃心。
立规矩:立规矩守规矩。
拿结果:注意吃相。
管理中常见的误区:做管理后放弃原来专业(要关注行业发展方向和前沿技术);过度管理(要自循环的稳定成熟团队);过度追求团队稳定(衡量团队稳定的核心标准不是人员的稳定,而是团队的效率和产出是否能够有持续稳定的增长) - 什么特质的员工容易成为管理者
公司内部晋升管理者:天时:企业/行业所处的阶段;地利:部门/业务所处的阶段;人和:人际关系+自身能力。
跳槽成为管理者:大公司跳槽到小公司,寻找职业突破,弊端是跳出去容易跳回来难;成为行业内有影响力的人物,被大公司挖角。大部分人都是第一种情况,在大公司的同学要多一点耐心,通过努力在公司内晋升,因为曲线救国式的跳槽已经没有市场了。 - 技术部门Leader是不是一定要技术大牛担任?
核心点:Manager vs Tech Leader、方法论、软技能、赋能成员、综合。
think
- 好的研究想法从哪里来
研究的本质是对未知领域的探索,是对开放问题的答案的追寻。“好”的定义-》区分好与不好的能力-》全面了解所在研究方向的历史和现状-》实践法/类比法/组合法。这就好比是机器学习的训练和测试阶段,训练:全面了解所在研究方向的历史和现状,判断不同时期的研究工作的好与不好。测试:实践法/类比法/组合法出的idea,判断自己的研究工作好与不好。 - 科研论文如何想到不错的idea?
模块化学习、交叉、布局可预期的趋势。 - 人在年轻的时候,最核心的能力是什么?
核心点:达到以前从未达到的高度:基本的事情做到极致、专注、坚持长久做一件事、延迟满足、认清自己+了解环境->准确定位、
Things to note
- 领域点-线-面体系:点:自己focus的领域;线:上游和下游;面:大领域。不要过度focus在自己工作的领域,要有全局化的眼光,特别是自己的上游和下游。
- 日常学习点-线-面体系:点:自己focus的安全数据分析领域;线:安全/数据分析;面:全局安全内容/行业发展/职业规划。每日专研至少一小时小领域;每日精读至少半小时/至少一篇安全/数据分析/行业发展/职业规划精品文章;每日大量浏览增量文章/存量文章。保持学习与思考的敏感性。
appendix
国外优质技术站点
- https://resources.distilnetworks.com
站点概况:专注于机器流量对抗与缓解。 - http://www.covert.io
技术栈:Jason Trost,专注于安全研究、大数据、云计算、机器学习,即安全数据科学。 - http://cyberdatascientist.com
站点概括:专注于安全数据科学,提供网络安全、统计学和AI等学习资料,并提供14个安全数据集,包括:垃圾邮件、恶意网站、恶意软件、Botnet等。没有secrepo.com提供的资料全面。 - https://towardsdatascience.com
站点概括:专注于数据科学。
国内优秀技术人
- michael282694
技术栈:数据分析挖掘产品开发、爬虫、Java、Python。 - LittleHann
技术栈:我也不知道该怎么描述,Han师傅会的太多了,C++、Java、Python、PHP、Web安全、系统安全,不过目前好像做算法多一些。 - FeeiCN
技术栈:专注自动化漏洞发现和入侵检测防御。 - xiaojunjie
技术栈:专注于代码审计、CTF。 - 云雷
技术栈:阿里云存储技术专家,专注于日志分析与业务,日志计算驱动业务增长。 - iami
技术栈:主要研究Web安全、机器学习,喜欢Python和Go。一直偷学师傅的博客。 - cdxy
技术栈:早先主要做Web安全,CTF,代码审计,现在主要做安全研究与数据分析,初步估算技术领先我1~2年,师傅别学了。 - csuldw
技术栈:专注于机器学习、数据挖掘、人工智能。 - molunerfinn
技术栈:专注于前端,北邮大佬,和404notfound同级。 - 刘建平Pinard
技术栈:机器学习、深度学习、强化学习、自然语言处理、数学统计学、大数据挖掘,相关tutorial非常棒。
abandoned
- Efficient and Flexible Discovery of PHP Vulnerability译文
- Efficient and Flexible Discovery of PHP Application Vulnerabilities原文
- The Code Analysis Platform "Octopus"
- A Code Intelligence System:The Octopus Platform