what can be done
Used to visualize WeChat chat records.
The following pictures can be automatically generated:
What not to do
Using the software Liuhen , Github has 22.8k stars, and the software has been iterated to version 1.1.1. It is a very mature software and is trustworthy.
GitHub address: LC044/WeChatMsg: Extract WeChat chat records, export them into HTML, Word, and CSV documents for permanent storage, analyze the chat records and generate annual chat reports (github.com)
Software website: https://memotrace.lc044.love/
Just download the exe and install it.
I believe that most people’s chat records are on their mobile phones, and the chat records on their computers are incomplete. So first synchronize the chat history of the mobile phone to the computer. You may have experienced this when changing your mobile phone: WeChat - Settings - Chat - Chat History Migration and Backup - Migration. Wait a few minutes, depending on the size of your chat history.
Decryption 2: Enter personal information and obtain information. Then decrypt: Start booting!
Then you can export the chat history among friends. In order to reduce garbled characters, please do not check pictures, videos, and emoticons. The export does not include pictures/videos/files !
When the export is complete, exit Trace. There will be a data directory in the same directory of the software. Click in and there will be a csv file under data/聊天记录/
. It probably looks like this:
Copy this csv file to input_data/
directory of WechatVisualization.
Note: When using Trace, you may find that it also integrates the analysis + visualization function of exporting annual reports. However, if you take a closer look at the annual report it produces, you will find that it is too rough. The words in the word cloud diagram are messy and not done. Data cleaning, this is why I want to develop it myself. However, if you feel that the report on Trace Production is already very good, then you don’t need to read the following content.
Users need to have basic Python knowledge (how to run code), and Anaconda or Python (version >= 3.7) has been installed on the computer. If using Anaconda, it is best to create a new environment.
Install the necessary third-party libraries in sequence:
Third-party libraries | Function |
---|---|
pandas | Form processing |
matplotlib | Draw bar chart |
pyyaml | Read configuration file |
jieba | Chinese word segmentation |
tqd | Print progress bar |
pyecharts | Draw word cloud |
One-click installation method:
pip install -r requirements.txt
The installation method is not the focus of this article. Basically, it is pip install
. If you encounter problems, please search for solutions online. I won’t go into details here.
The configuration file is config.yml
, which can be opened with Notepad. Of course, it is better to use a code editor because syntax highlighting is available.
The contents that can be set by yourself include
# 输入数据
# 下面这些文件都放在input_data目录下
# 聊天记录
msg_file : msg.csv
# 微信表情中英文对照表
emoji_file : emoji.txt
# 停用词表,一般是没有实际意义的词,不想让被分析到的词都放在这里
stopword_file : stopwords_hit_modified.txt
# 词语转换表,用于合并意义相近的词,比如把“看到”、“看见”、“看看”都转换为“看”
transform_file : transformDict.txt
# 用户自定义词典,用于添加大部分字典没有的、但自己觉得不能分开的词,如i人、e人、腾讯会议
user_dict_file : userDict.txt
# 名字
# name1是自己的名字
name1 : person 1
# name2是对方的名字
name2 : person 2
# name_both是双方共同的名字
name_both : both
# 局部参数
# top_k是绘制前多少个词
# 如果词或表情的出现频次低于word_min_count或emoji_min_count,就不会被分析
# figsize是绘图图窗尺寸,第一个是宽度,第二个是高度
word_specificity :
top_k : 25
word_min_count : 2
figsize :
- 10
- 12
emoji_specificity :
emoji_min_count : 1
top_k : 5
figsize :
- 10
- 12
word_commonality :
top_k : 25
figsize :
- 10
- 12
emoji_commonality :
top_k : 5
figsize :
- 12
- 12
time_analysis :
figsize :
- 12
- 8
You can run main.py
directly in the code editor, or you can run python main.py
in the command line (activate the previously created environment first).
A successful run should display the following information:
The generated images can be found in the figs
folder of the current directory.
Check the generated image. You may find that some words are not what you want, or some words you want have been split. In this case, just go to input_data/
directory and modify each file. This is a continuously iterative process, that is, data cleaning, which is relatively time-consuming. But there is no other way. If you want relatively high-quality results, just be patient and do it carefully and clean the data.
emoji.txt
is the Chinese and English version of WeChat emoticons. WeChat emoticons are presented in the form of [facepalm] or [Facepalm] in chat records. There are [xxx] in both Chinese and English in my chat history, so I created a comparison table and replaced all English words with Chinese. If you find that some emoticons are still in English, you can add Chinese to them for easy merging.stopwords_hit_modified.txt
is a stop word list. Words such as "now", "going on", and "as if" (which I think) have no actual meaning should not be counted and should be eliminated directly. If you think there are words you don't want to see in the generated results, you can add them here.transformDict.txt
converts some words into other words. Synonyms such as "see", "see", "look", and "look" may be counted separately. It is not necessary at all. We can merge them into one word "look". To do this, just fill in the original word and the converted word in two columns. Note that the two columns are separated by tab characters .usreDict
can add words that are not in traditional dictionaries, such as "e人", "i人", "Tencent Conference", etc. If you do not add these words yourself, the consequence is that they may be split into words such as "e", "i", "people", "Tencent", and "meeting", which is not what we want to see. ValueError: shape mismatch: objects cannot be broadcast to a single shape
ValueError: The number of FixedLocator locations (5), usually from a call to set_ticks, does not match the number of ticklabels (1).
Possible reasons : When the above two errors occur, it may be because the top_k or min_count at the corresponding position is set too large, and the amount of chat records is too small, resulting in not so many words to draw.
Solution : With this in mind, I printed the maximum parameter value allowed to be set when each small section of the program was run. If double horizontal lines are printed, it means that the parameters of this section are set correctly and the program runs successfully. You can check whether the parameter at the corresponding position is set too large, and then reduce it appropriately.
parse.py
reads the files in input_data/
and performs word segmentation. Generate keywords.csv
and put it into temp_files/
, which adds two columns based on the original data. One column is the split words, and the other is the extracted WeChat emoticons.word_cloud.py
calculates the word frequency, generates the pickle file keyword_count.pkl
and puts it in temp_files/
, and also creates a word cloud and puts it in figs/
.figs/
.emoji_count.pkl
and put it into temp_files/
and calculate the emoticon specificity. Pictures are placed into figs/
.figs/
.figs/
.figs/
. Remember that you sent a certain word
Exclusiveness means that you often say something, but the other person doesn't say it often (and vice versa). My consideration of exclusivity is this. Suppose there are three words A, B, and C.
word | own frequencyx | Frequency of the other party y |
---|---|---|
A | 4 | 0 |
B | 100 | 96 |
C | 1 | 0 |
For myself, it is obvious that A's exclusivity should be the highest. As for word B, although the two people differed 4 times, the base number is relatively large. There is no obvious contrast between the 4 times. In the case of C, the base number is too small. To say that C is its own exclusive vocabulary is not very reliable.
Let the specificity measure be
What if we multiply by the base? The base is the total number of times
So in my implementation, instead of multiplying by the sum of term frequencies, I multiply by the maximum value of term frequencies, i.e. $$ alpha_i=dfrac{x_i-y_i}{x_i+y_i}cdotmathrm{max}(x_i ,y_i) $$ This can ensure that the specificity of word A is the highest.
Sharing indicates that two people often say a word. So first eliminate those words that one party has never said . To do this, first take the intersection of the words spoken by both parties.
Now we still assume that there are three words A, B, and C.
word | own frequency x | Frequency of the other party y |
---|---|---|
A | 50 | 50 |
B | 1000 | 1 |
C | 1 | 1 |
Word B has been said by myself much more times than by the other person, so the commonality is obviously very low. Although both parties have said word C about the same number of times, the base number is too small to draw reliable conclusions. Therefore, word A has the highest commonality. How to calculate it?
Communality is the opposite of exclusivity, so can we use the reciprocal of exclusivity? I feel bad, partly because the denominator is
To do this, I used the harmonic mean: $$ beta=dfrac{2}{1/x+1/y} $$ Why use the harmonic mean here instead of other means, because the harmonic mean is The one with the smallest number among the four averages, "commonality" emphasizes that both people must speak often, and one cannot just say it while the other does not. That is, no matter how much one party speaks, the impact on the commonality will be It is also very small, such as word B (1000,1).
Using the harmonic mean can ensure that word A has the greatest commonality.
Without further ado, please look at the source code.
This project does not integrate the function of leaving traces. If the function of extracting data from traces is added, a simpler process operation can be achieved. However, due to the author's ability and time constraints, this idea cannot be implemented at present. For other functions that are insufficient and can be improved, you are also welcome to leave messages on GitHub or the official account background. Like-minded people are welcome to join the developer team, and let's develop better WechatVisualization together!