WechatVisualization Download - WechatVisualization Source code download

WechatVisualization

AI Source Code

version 1.0

Download

Function overview

what can be done

Used to visualize WeChat chat records.

text analysis
- Word cloud diagram of the chat vocabulary between the two parties
- Exclusive word analysis (words commonly used by A but not commonly used by B)
- Shared vocabulary analysis (words that both A and B like to use)
Time information analysis
- Changes in the number of messages per month
- Changes in the average number of messages per hour

The following pictures can be automatically generated:

What not to do

WeChat chat records cannot be extracted. WeChat chat records need to be extracted using another tool, which will be introduced below.
If you cannot complete image layout and beautification, please complete it yourself (modify the configuration file/source code or use third-party tools such as PPT).

Out of the Box/User Guide

Extract WeChat chat history

Download and install traces

Using the software Liuhen , Github has 22.8k stars, and the software has been iterated to version 1.1.1. It is a very mature software and is trustworthy.

GitHub address: LC044/WeChatMsg: Extract WeChat chat records, export them into HTML, Word, and CSV documents for permanent storage, analyze the chat records and generate annual chat reports (github.com)

Software website: https://memotrace.lc044.love/

Just download the exe and install it.

Synchronize mobile chat history to computer

I believe that most people’s chat records are on their mobile phones, and the chat records on their computers are incomplete. So first synchronize the chat history of the mobile phone to the computer. You may have experienced this when changing your mobile phone: WeChat - Settings - Chat - Chat History Migration and Backup - Migration. Wait a few minutes, depending on the size of your chat history.

Trace software interface operation

Decryption 2: Enter personal information and obtain information. Then decrypt: Start booting!

Then you can export the chat history among friends. In order to reduce garbled characters, please do not check pictures, videos, and emoticons. The export does not include pictures/videos/files !

When the export is complete, exit Trace. There will be a data directory in the same directory of the software. Click in and there will be a csv file under data/聊天记录/ . It probably looks like this:

Copy this csv file to input_data/ directory of WechatVisualization.

Note: When using Trace, you may find that it also integrates the analysis + visualization function of exporting annual reports. However, if you take a closer look at the annual report it produces, you will find that it is too rough. The words in the word cloud diagram are messy and not done. Data cleaning, this is why I want to develop it myself. However, if you feel that the report on Trace Production is already very good, then you don’t need to read the following content.

Install python third-party library

Users need to have basic Python knowledge (how to run code), and Anaconda or Python (version >= 3.7) has been installed on the computer. If using Anaconda, it is best to create a new environment.

Install the necessary third-party libraries in sequence:

Third-party libraries	Function
pandas	Form processing
matplotlib	Draw bar chart
pyyaml	Read configuration file
jieba	Chinese word segmentation
tqd	Print progress bar
pyecharts	Draw word cloud

One-click installation method:

pip install -r requirements.txt

The installation method is not the focus of this article. Basically, it is pip install . If you encounter problems, please search for solutions online. I won’t go into details here.

Modify configuration file

The configuration file is config.yml , which can be opened with Notepad. Of course, it is better to use a code editor because syntax highlighting is available.

The contents that can be set by yourself include

 # 输入数据
# 下面这些文件都放在input_data目录下
# 聊天记录
msg_file : msg.csv
 # 微信表情中英文对照表
emoji_file : emoji.txt
# 停用词表，一般是没有实际意义的词，不想让被分析到的词都放在这里
stopword_file : stopwords_hit_modified.txt
# 词语转换表，用于合并意义相近的词，比如把“看到”、“看见”、“看看”都转换为“看”
transform_file : transformDict.txt
# 用户自定义词典，用于添加大部分字典没有的、但自己觉得不能分开的词，如i人、e人、腾讯会议
user_dict_file : userDict.txt

# 名字
# name1是自己的名字
name1 : person 1
# name2是对方的名字
name2 : person 2
# name_both是双方共同的名字
name_both : both

# 局部参数
# top_k是绘制前多少个词
# 如果词或表情的出现频次低于word_min_count或emoji_min_count，就不会被分析
# figsize是绘图图窗尺寸，第一个是宽度，第二个是高度
word_specificity :
  top_k : 25
  word_min_count : 2
  figsize :
  - 10
  - 12

emoji_specificity :
  emoji_min_count : 1
  top_k : 5
  figsize :
  - 10
  - 12

word_commonality :
  top_k : 25
  figsize :
  - 10
  - 12
  
emoji_commonality :
  top_k : 5
  figsize :
  - 12
  - 12

time_analysis :
  figsize :
  - 12
  - 8

run code

You can run main.py directly in the code editor, or you can run python main.py in the command line (activate the previously created environment first).

A successful run should display the following information:

The generated images can be found in the figs folder of the current directory.

Modify input file

Check the generated image. You may find that some words are not what you want, or some words you want have been split. In this case, just go to input_data/ directory and modify each file. This is a continuously iterative process, that is, data cleaning, which is relatively time-consuming. But there is no other way. If you want relatively high-quality results, just be patient and do it carefully and clean the data.

emoji.txt is the Chinese and English version of WeChat emoticons. WeChat emoticons are presented in the form of [facepalm] or [Facepalm] in chat records. There are [xxx] in both Chinese and English in my chat history, so I created a comparison table and replaced all English words with Chinese. If you find that some emoticons are still in English, you can add Chinese to them for easy merging.
stopwords_hit_modified.txt is a stop word list. Words such as "now", "going on", and "as if" (which I think) have no actual meaning should not be counted and should be eliminated directly. If you think there are words you don't want to see in the generated results, you can add them here.
transformDict.txt converts some words into other words. Synonyms such as "see", "see", "look", and "look" may be counted separately. It is not necessary at all. We can merge them into one word "look". To do this, just fill in the original word and the converted word in two columns. Note that the two columns are separated by tab characters .
usreDict can add words that are not in traditional dictionaries, such as "e人", "i人", "Tencent Conference", etc. If you do not add these words yourself, the consequence is that they may be split into words such as "e", "i", "people", "Tencent", and "meeting", which is not what we want to see.

Error reporting and resolution

ValueError: shape mismatch: objects cannot be broadcast to a single shape
ValueError: The number of FixedLocator locations (5), usually from a call to set_ticks, does not match the number of ticklabels (1).

Possible reasons : When the above two errors occur, it may be because the top_k or min_count at the corresponding position is set too large, and the amount of chat records is too small, resulting in not so many words to draw.

Solution : With this in mind, I printed the maximum parameter value allowed to be set when each small section of the program was run. If double horizontal lines are printed, it means that the parameters of this section are set correctly and the program runs successfully. You can check whether the parameter at the corresponding position is set too large, and then reduce it appropriately.

Implementation Details/Developer Guide

Project process

parse.py reads the files in input_data/ and performs word segmentation. Generate keywords.csv and put it into temp_files/ , which adds two columns based on the original data. One column is the split words, and the other is the extracted WeChat emoticons.
word_cloud.py calculates the word frequency, generates the pickle file keyword_count.pkl and puts it in temp_files/ , and also creates a word cloud and puts it in figs/ .
Use the word frequency calculated in the previous step to calculate word specificity. Pictures are placed into figs/ .
Calculate the frequency of WeChat emoticons, generate the pickle file emoji_count.pkl and put it into temp_files/ and calculate the emoticon specificity. Pictures are placed into figs/ .
Use word frequency to calculate word commonality. Pictures are placed into figs/ .
Use expression frequencies to calculate expression commonality. Pictures are placed into figs/ .
Use the original files of WeChat chat records to analyze time information. Pictures are placed into figs/ .

Calculation method

Remember that you sent a certain word $i$ The number of times is $x_i$ , the other party has sent a certain word $i$ The number of times is $y_i$ .

Specificity calculation

Exclusiveness means that you often say something, but the other person doesn't say it often (and vice versa). My consideration of exclusivity is this. Suppose there are three words A, B, and C.

word	own frequencyx	Frequency of the other party y
A	4	0
B	100	96
C	1	0

For myself, it is obvious that A's exclusivity should be the highest. As for word B, although the two people differed 4 times, the base number is relatively large. There is no obvious contrast between the 4 times. In the case of C, the base number is too small. To say that C is its own exclusive vocabulary is not very reliable.

Let the specificity measure be $alpha$ . I saw someone online calculating the specificity using $$ alpha_i=dfrac{x_i-y_i}{x_i+y_i} $$. After calculation, the specificity of word A and word C is the same. unreasonable!

What if we multiply by the base? The base is the total number of times $x+y$ , that is $alpha_i=x_i-y_i$ . In this way, word A and word B are the same. It's also unreasonable.

So in my implementation, instead of multiplying by the sum of term frequencies, I multiply by the maximum value of term frequencies, i.e. $$ alpha_i=dfrac{x_i-y_i}{x_i+y_i}cdotmathrm{max}(x_i ,y_i) $$ This can ensure that the specificity of word A is the highest.

commonality calculation

Sharing indicates that two people often say a word. So first eliminate those words that one party has never said . To do this, first take the intersection of the words spoken by both parties.

Now we still assume that there are three words A, B, and C.

word	own frequency x	Frequency of the other party y
A	50	50
B	1000	1
C	1	1

Word B has been said by myself much more times than by the other person, so the commonality is obviously very low. Although both parties have said word C about the same number of times, the base number is too small to draw reliable conclusions. Therefore, word A has the highest commonality. How to calculate it?

Communality is the opposite of exclusivity, so can we use the reciprocal of exclusivity? I feel bad, partly because the denominator is $xy$ , zero values are easy to appear; on the other hand, the A word (50,50) and the C word (1,1) cannot be distinguished well.

To do this, I used the harmonic mean: $$ beta=dfrac{2}{1/x+1/y} $$ Why use the harmonic mean here instead of other means, because the harmonic mean is The one with the smallest number among the four averages, "commonality" emphasizes that both people must speak often, and one cannot just say it while the other does not. That is, no matter how much one party speaks, the impact on the commonality will be It is also very small, such as word B (1000,1).

Using the harmonic mean can ensure that word A has the greatest commonality.

Without further ado, please look at the source code.

future outlook

This project does not integrate the function of leaving traces. If the function of extracting data from traces is added, a simpler process operation can be achieved. However, due to the author's ability and time constraints, this idea cannot be implemented at present. For other functions that are insufficient and can be improved, you are also welcome to leave messages on GitHub or the official account background. Like-minded people are welcome to join the developer team, and let's develop better WechatVisualization together!

Expand

Additional Information

Version version 1.0
Type AI Source Code
Update Time 2025-01-24
size 23.23KB
From Github

Related Applications

node telegram bot api

2024-12-14
typebot.io

2024-12-14
python wechaty getting started

2024-12-14
TranscriberBot

2024-12-14
genal chat

2024-12-14
Facemoji

2024-12-14

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
node telegram bot api

AI Source Code

v0.50.0
typebot.io

AI Source Code

v3.1.2
python wechaty getting started

AI Source Code

1.0.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All