アニメデータの基礎分析

アニメデータの基礎分析#

準備#

Import#

変数#

`an_ae.csv`の基礎分析#

全体像の把握#

Show code cell content Hide code cell content

# df_aeデータフレームの先頭5行を転置して表示
df_ae.head().T

	0	1	2	3	4
aeid	M19760	M19761	M19762	M19763	M19764
aename	アトム誕生の巻＊	フランケンの巻＊	火星探険の巻＊	ゲルニカの巻＊	スフィンクスの巻＊
date	1963-01-01	1963-01-08	1963-01-15	1963-01-22	1963-01-29
aeno	第1話	第2話	第3話	第4話	第5話
acid	C7163	C7163	C7163	C7163	C7163
acname	鉄腕アトム	鉄腕アトム	鉄腕アトム	鉄腕アトム	鉄腕アトム
asid	C979	C979	C979	C979	C979

Show code cell content Hide code cell content

# df_aeデータフレーム内の欠損値（NaN）の情報を集計
# isna()メソッドを使用して欠損値の場所をTrueとして特定
# その後、agg()メソッドを使用して、各列における欠損値の合計と平均を計算
df_ae.isna().agg([sum, "mean"])

	aeid	aename	date	aeno	acid	acname	asid
sum	0.0	2676.000000	0.0	606.000000	0.0	0.0	402.00000
mean	0.0	0.024099	0.0	0.005457	0.0	0.0	0.00362

Show code cell content Hide code cell content

# df_aeデータフレームの記述統計情報を取得
# describe()メソッドを使用して、各列の基本的な統計量を表示
df_ae.describe()

	aeid	aename	date	aeno	acid	acname	asid
count	111041	108365	111041	110435	111041	111041	110639
unique	111041	105955	10360	4844	3637	3631	2527
top	M19760	[総集編]	2015-10-04	1	C8849	クレヨンしんちゃん	C1462
freq	1	68	66	3207	1926	1926	2084

Show code cell content Hide code cell content

# df_aeデータフレームの各列に対してユニークな値の数をカウント
# nunique()メソッドを使用して、各列のユニークな値の数を計算
# その後、結果を新しいデータフレームとして整形し、列名を`nunique`とする
df_ae.nunique().reset_index(name="nunique")

	index	nunique
0	aeid	111041
1	aename	105955
2	date	10360
3	aeno	4844
4	acid	3637
5	acname	3631
6	asid	2527

`date`列の深掘り#

Show code cell content Hide code cell content

# `year`（年）ごとに各カラムのユニークな値の数を集計
# 具体的には、groupbyメソッドで`year`を基準にグループ化し、
# nuniqueメソッドを使用して各カラムのユニークな値の数を計算
# その後、reset_indexメソッドでインデックスをリセットし、データフレームとして結果を返す
df_ae.groupby("year")[["month", "asid", "acid", "aeid"]].nunique().reset_index()

	year	month	asid	acid	aeid
0	1963	12	1	2	59
1	1964	12	1	2	99
2	1965	12	1	2	76
3	1966	12	1	1	45
4	1971	3	1	1	10
5	1972	3	1	1	13
6	1974	12	2	2	65
7	1975	3	1	1	13
8	1979	9	1	1	39
9	1980	1	1	1	4
10	1990	10	17	17	350
11	1991	12	50	52	1749
12	1992	12	63	64	1944
13	1993	12	53	54	1741
14	1994	12	57	59	1919
15	1995	12	62	64	1838
16	1996	12	63	67	2000
17	1997	12	73	76	1899
18	1998	12	104	108	2699
19	1999	12	137	148	2948
20	2000	12	130	144	3401
21	2001	12	161	173	3850
22	2002	12	171	187	4000
23	2003	12	201	215	4501
24	2004	12	217	241	4775
25	2005	12	222	245	5080
26	2006	12	287	315	6222
27	2007	12	284	306	6015
28	2008	12	268	302	5978
29	2009	12	248	267	5276
30	2010	12	205	236	4589
31	2011	12	193	221	4345
32	2012	12	223	254	5319
33	2013	12	260	291	5564
34	2014	12	296	332	6730
35	2015	12	262	301	6199
36	2016	12	309	346	7044
37	2017	10	164	166	2643

Show code cell content Hide code cell content

# `weekday`（曜日）ごとに各カラムのユニークな値の数を集計
# 具体的には、groupbyメソッドで`weekday`を基準にグループ化し、
# nuniqueメソッドを使用して各カラムのユニークな値の数を計算
# その後、reset_indexメソッドでインデックスをリセットし、データフレームとして結果を返す
df_ae.groupby("weekday")[["asid", "acid", "aeid"]].nunique().reset_index()

	weekday	asid	acid	aeid
0	0	575	734	14554
1	1	583	773	13572
2	2	578	731	13181
3	3	572	738	13060
4	4	723	903	17066
5	5	656	834	19837
6	6	611	777	19771

`acid`、`acname`列の深掘り#

Show code cell content Hide code cell content

# `acid`ごとにユニークな`acname`の数を集計し、その統計情報を取得
df_ae.groupby("acid")["acname"].nunique().describe().reset_index()

	index	acname
0	count	3637.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# `acname`ごとにユニークな`acid`の数を集計し、その統計情報を取得
df_ae.groupby("acname")["acid"].nunique().describe().reset_index()

	index	acid
0	count	3631.000000
1	mean	1.001652
2	std	0.040622
3	min	1.000000
4	25%	1.000000
5	50%	1.000000
6	75%	1.000000
7	max	2.000000

Show code cell content Hide code cell content

# `acname`ごとにユニークな`acid`の数を集計
df_tmp = df_ae.groupby("acname")["acid"].nunique().reset_index()

# `acid`が複数ある`acname`を抽出
df_tmp[df_tmp["acid"] > 1]

	acname	acid
9	100％パスカル先生	2
82	BUZZER BEATER	2
202	Fate/Zero	2
739	うっかりペネロペ	2
975	じゃがいぬくん	2
3588	魔法陣グルグル	2

Show code cell content Hide code cell content

# `df_ae`から`acname`が`100％ パスカル先生`である行を抽出
# 見やすいように特定の列を抽出
df_ae[df_ae["acname"] == "100％ パスカル先生"][["aename", "acid", "acname", "asid"]]

	aename	acid	acname	asid
107203	[第1話]	C16019	100％パスカル先生	C6536
107926	[第2話]	C16019	100％パスカル先生	C6536
109088	1時間目その名もパスカル先生	C16134	100％パスカル先生	C6536
109089	2時間目完璧[パーフェクト]プレート	C16134	100％パスカル先生	C6536
109184	1時間目恐怖!100!身体検査	C16134	100％パスカル先生	C6536
...	...	...	...	...
110958	2時間目完璧[パーフェクト]プレート	C16134	100％パスカル先生	C6536
110959	3時間目明日[あした]使えるヒエログリフ講座	C16134	100％パスカル先生	C6536
111014	1時間目体育大魔王襲来!!	C16134	100％パスカル先生	C6536
111015	2時間目完璧[パーフェクト]プレート	C16134	100％パスカル先生	C6536
111016	3時間目アメリカ横断パスカリングリッシュ	C16134	100％パスカル先生	C6536

73 rows × 4 columns

Show code cell content Hide code cell content

# `df_ae`で各アニメ作品(`acname`)ごとにユニークなアニメ各話(`aeid`)の数を集計
# その後、降順にソートして上位10件を表示
df_ae.groupby("acname")["aeid"].nunique().sort_values(
    ascending=False
).reset_index().head(10)

	acname	aeid
0	クレヨンしんちゃん	1926
1	親子クラブ	1363
2	サザエさん	1175
3	ちびまる子ちゃん［新］	994
4	それいけ！アンパンマン	958
5	ONE PIECE	783
6	しましまとらのしまじろう	751
7	名探偵コナン	729
8	あたしンち	668
9	ドラえもん［新・第2期］	608

Show code cell content Hide code cell content

# アニメ作品名(`acname`)の文字数(`l_acname`)に関する統計量を取得
df_tmp["l_acname"].describe().reset_index()

	index	l_acname
0	count	3637.000000
1	mean	14.134177
2	std	9.486087
3	min	1.000000
4	25%	8.000000
5	50%	12.000000
6	75%	18.000000
7	max	167.000000

Show code cell content Hide code cell content

# アニメ作品名(`acname`)の文字数(`l_acname`)が長い上位10作品を取得
df_tmp.sort_values("l_acname", ascending=False).head(10)

	acid	acname	year	l_acname
2908	C16008	Occultic;Nine THERE ARE NO SUCH THING AS “OCCU...	2016	167
2218	C15007	キャプテン･アース WHEN I OPENED THE DOOR CALLED TRUTH,...	2014	104
2394	C15395	四月は君の嘘　I met the girl　 under full-bloomed cher...	2014	93
2153	C14933	LOVELY❤ムービーいとしのムーコ Lovely Muuuuuuuco! The hap...	2013	88
2205	C14994	アオハライド THE SCENT OF AIR AFTER THE RAIN… I HEAR...	2014	76
2331	C15278	ローリング☆ガールズ Rolling, Falling, Scrambling. For o...	2015	67
2843	C15917	アンジュ・ヴィエルジュ “Progress“:Girls facing destiny ag...	2016	66
2089	C14862	戦姫絶唱シンフォギアG In the distance,that day,when the...	2013	65
2449	C15451	艦隊これくしょん艦これ Fleet Girls Collection KanColle ...	2015	64
2629	C15667	コメットルシファーNOW ADVENTURE BEGINS-WITH YOU. FRIEND...	2015	63

Show code cell content Hide code cell content

# 各初回放送年(`year`)ごとのアニメ作品名の文字数(`l_acname`)に関する統計量を集計
df_tmp.groupby("year")["l_acname"].describe()

	count	mean	std	min	25%	50%	75%	max
year
1963	2.0	5.000000	0.000000	5.0	5.00	5.0	5.00	5.0
1971	1.0	5.000000	NaN	5.0	5.00	5.0	5.00	5.0
1974	2.0	8.500000	2.121320	7.0	7.75	8.5	9.25	10.0
1979	1.0	8.000000	NaN	8.0	8.00	8.0	8.00	8.0
1990	17.0	10.705882	3.869184	5.0	8.00	10.0	14.00	20.0
1991	37.0	11.891892	4.903422	5.0	8.00	11.0	16.00	23.0
1992	37.0	11.162162	4.645867	6.0	8.00	10.0	13.00	24.0
1993	24.0	9.625000	3.449165	4.0	8.50	9.0	11.25	19.0
1994	33.0	10.515152	5.166728	5.0	7.00	9.0	11.00	26.0
1995	34.0	10.500000	3.925441	4.0	7.00	10.5	13.75	18.0
1996	40.0	10.325000	5.273968	4.0	7.75	10.0	12.00	35.0
1997	46.0	10.782609	4.816237	4.0	8.00	10.0	13.00	29.0
1998	78.0	12.346154	7.253887	3.0	8.00	10.0	15.75	45.0
1999	108.0	12.314815	6.727886	3.0	8.75	11.0	15.25	40.0
2000	84.0	13.440476	8.322348	3.0	8.00	11.0	16.00	43.0
2001	113.0	12.964602	7.569829	1.0	8.00	11.0	16.00	43.0
2002	116.0	12.224138	6.570138	4.0	8.00	11.0	14.00	37.0
2003	134.0	13.000000	7.188127	3.0	8.00	12.0	15.75	44.0
2004	157.0	14.019108	9.209437	2.0	8.00	12.0	19.00	61.0
2005	151.0	13.980132	8.119499	2.0	8.00	12.0	18.00	38.0
2006	212.0	14.448113	8.705946	2.0	8.00	13.0	18.00	57.0
2007	203.0	13.197044	7.761328	3.0	8.00	11.0	17.00	43.0
2008	200.0	13.230000	7.458906	3.0	8.00	11.0	18.00	37.0
2009	176.0	13.323864	7.556469	3.0	7.75	12.0	17.00	43.0
2010	151.0	14.735099	8.858670	2.0	8.50	13.0	19.00	53.0
2011	164.0	14.054878	9.379035	2.0	8.00	11.0	18.00	54.0
2012	176.0	14.948864	10.790085	1.0	8.00	12.0	17.25	62.0
2013	205.0	15.834146	11.554721	3.0	9.00	13.0	19.00	88.0
2014	257.0	15.731518	12.200870	3.0	9.00	13.0	18.00	104.0
2015	245.0	17.044898	11.041001	3.0	10.00	14.0	22.00	67.0
2016	269.0	17.442379	14.287605	2.0	9.00	14.0	23.00	167.0
2017	164.0	12.664634	6.920475	2.0	7.00	11.0	16.00	36.0

`asid`列の深堀り#

Show code cell content Hide code cell content

# `asid`ごとにユニークな`acid`の数を集計し、その統計情報を取得
df_ae.groupby("asid")["acid"].nunique().describe().reset_index()

	index	acid
0	count	2527.000000
1	mean	1.436486
2	std	1.259258
3	min	1.000000
4	25%	1.000000
5	50%	1.000000
6	75%	1.000000
7	max	26.000000

Show code cell content Hide code cell content

# `acid`ごとにユニークな`asid`の数を集計し、その統計情報を取得
# `asid`には欠損があるため事前に除外して集計
df_ae[~df_ae["asid"].isna()].groupby("acid")["asid"].nunique().describe().reset_index()

	index	asid
0	count	3630.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# df_aeデータフレームを'date'列で昇順に並び替えてから、'asid'でグループ化
# その後、各グループに対して'acid'と'aeid'のユニークな値の数を計算し、'acname'の最初の値を取得
# reset_index()で、結果のインデックスをリセットして、それを新しいデータフレームdf_tmpとして取得
df_tmp = (
    df_ae.sort_values("date")
    .groupby("asid")
    .agg({"acid": "nunique", "aeid": "nunique", "acname": "first"})
    .reset_index()
)

# df_tmpデータフレームを'acid'列の値で降順に並び替えて、上位10行を表示
df_tmp.sort_values("acid", ascending=False).head(10)

	asid	acid	aeid	acname
98	C1462	26	2084	忍たま乱太郎［第1期］
310	C2158	20	1653	おじゃる丸
2502	C6820	15	480	ビーストウォーズⅡ
1354	C4102	14	1245	いないいないばあっ![第4期]
696	C2833	14	659	ふたりはプリキュア
597	C2650	13	551	デュエル・マスターズ
237	C1985	13	862	ポケットモンスター
1669	C4727	10	138	てーきゅう
1550	C4517	10	370	CARDFIGHT!! ヴァンガード
458	C2454	9	410	爆転シュートベイブレード

Show code cell content Hide code cell content

# 'asid'が"C1462"の行だけを選択して、'acname'でグループ化した後、'aeid'のユニークな値の数と'date'の最小値を集計
# reset_index()で、結果のインデックスをリセットして、それを新しいデータフレームdf_nintamaとして取得
df_nintama = (
    df_ae[df_ae["asid"] == "C1462"]
    .groupby("acname")
    .agg({"aeid": "nunique", "date": "min"})
    .reset_index()
)

# df_nintamaデータフレームを'date'列で昇順に並び替え
# sort_values("date")で、'date'列を基準に昇順でデータを並び替え
df_nintama.sort_values("date")

	acname	aeid	date
10	忍たま乱太郎［第1期］	95	1993-04-10
12	忍たま乱太郎第2期	120	1994-10-03
17	忍たま乱太郎[第3期]	120	1995-10-02
18	忍たま乱太郎[第4期]	120	1996-04-01
19	忍たま乱太郎[第5期]	100	1997-10-06
20	忍たま乱太郎[第6期]	60	1998-04-06
21	忍たま乱太郎[第7期]	80	1999-04-05
2	忍たま乱太郎 [第8期]	83	1999-12-31
25	忍たま乱太郎［第9期］	85	2000-12-23
22	忍たま乱太郎［第10期］	82	2002-03-21
13	忍たま乱太郎[第11期]	82	2003-04-07
23	忍たま乱太郎［第12期］	83	2004-03-20
24	忍たま乱太郎［第13期］	58	2004-12-31
4	忍たま乱太郎[第14期]	50	2006-04-03
5	忍たま乱太郎[第15期]	53	2007-04-02
14	忍たま乱太郎[第16期]	100	2008-03-31
15	忍たま乱太郎[第17期]	90	2009-03-30
6	忍たま乱太郎[第18期]	90	2010-03-29
7	忍たま乱太郎[第19期]	90	2011-03-28
8	忍たま乱太郎[第20期]	90	2012-04-02
0	忍たま乱太郎 [第21期]	76	2013-03-20
11	忍たま乱太郎［第22期］	75	2014-04-01
9	忍たま乱太郎[第23期]	70	2015-03-30
3	忍たま乱太郎の宇宙大冒険 with コズミックフロント☆ N E X T	2	2016-02-11
1	忍たま乱太郎 [第24期]	70	2016-04-04
16	忍たま乱太郎[第25期]	60	2017-04-03

Show code cell content Hide code cell content

# ユニークなaeid数に対して降順にソートし、上位10件を表示
df_tmp.sort_values("aeid", ascending=False).head(10)

	asid	acid	aeid	acname
98	C1462	26	2084	忍たま乱太郎［第1期］
70	C1327	1	1926	クレヨンしんちゃん
310	C2158	20	1653	おじゃる丸
146	C1640	1	1363	親子クラブ
1354	C4102	14	1245	いないいないばあっ![第4期]
112	C1497	4	1207	しましまとらのしまじろう
226	C1945	1	1175	サザエさん
0	C1022	1	994	ちびまる子ちゃん［新］
2515	C7048	1	958	それいけ！アンパンマン
283	C2116	8	920	遊☆戯☆王

`aeid`、`aename`、`aeno`の深堀り#

Show code cell content Hide code cell content

# `aeid`ごとにユニークな`aename`の数を集計し、その統計情報を取得
df_ae.groupby("aeid")["aename"].nunique().describe().reset_index()

	index	aename
0	count	111041.000000
1	mean	0.975901
2	std	0.153358
3	min	0.000000
4	25%	1.000000
5	50%	1.000000
6	75%	1.000000
7	max	1.000000

Show code cell content Hide code cell content

# df_aeから`aename`が存在する列を抽出したうえで
# `aeid`ごとにユニークな`aename`の数を集計し、その統計情報を取得
df_ae[~df_ae["aename"].isna()].groupby("aeid")[
    "aename"
].nunique().describe().reset_index()

	index	aename
0	count	108365.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# `aename`ごとにユニークな`aeid`の数を集計し、その統計情報を取得
df_ae.groupby("aename")["aeid"].nunique().describe().reset_index()

	index	aeid
0	count	105955.000000
1	mean	1.022746
2	std	0.332036
3	min	1.000000
4	25%	1.000000
5	50%	1.000000
6	75%	1.000000
7	max	68.000000

Show code cell content Hide code cell content

# `aename`ごとにユニークな`aeid`の数を集計
df_tmp = df_ae.groupby("aename")["aeid"].nunique().reset_index(name="n_ae")

# `n_ae`に関して降順ソートし、上位五つを表示
df_tmp.sort_values("n_ae", ascending=False).head()

	aename	n_ae
14994	[総集編]	68
4167	2時間目完璧[パーフェクト]プレート	25
29650	てんすけのふるさとめぐり	25
13692	[サブタイトル表示なし]	20
14713	[第1話]	17

Show code cell content Hide code cell content

# aenameが2時間目 完璧[パーフェクト]プレートと一致する行を抽出
# 見やすさのため、特定の列のみ表示
df_ae[df_ae["aename"] == "2時間目 完璧[パーフェクト]プレート"][
    ["date", "acname", "aeid", "aename"]
]

	date	acname	aeid	aename
109089	2017-04-15	100％パスカル先生	M135357	2時間目完璧[パーフェクト]プレート
109185	2017-04-22	100％パスカル先生	M135359	2時間目完璧[パーフェクト]プレート
109276	2017-04-29	100％パスカル先生	M135362	2時間目完璧[パーフェクト]プレート
109362	2017-05-06	100％パスカル先生	M135364	2時間目完璧[パーフェクト]プレート
109455	2017-05-13	100％パスカル先生	M135367	2時間目完璧[パーフェクト]プレート
109546	2017-05-20	100％パスカル先生	M135369	2時間目完璧[パーフェクト]プレート
109632	2017-05-27	100％パスカル先生	M135372	2時間目完璧[パーフェクト]プレート
109717	2017-06-03	100％パスカル先生	M135374	2時間目完璧[パーフェクト]プレート
109809	2017-06-10	100％パスカル先生	M135377	2時間目完璧[パーフェクト]プレート
109895	2017-06-17	100％パスカル先生	M135380	2時間目完璧[パーフェクト]プレート
109978	2017-06-24	100％パスカル先生	M135383	2時間目完璧[パーフェクト]プレート
110032	2017-07-01	100％パスカル先生	M135386	2時間目完璧[パーフェクト]プレート
110100	2017-07-08	100％パスカル先生	M135389	2時間目完璧[パーフェクト]プレート
110180	2017-07-15	100％パスカル先生	M135392	2時間目完璧[パーフェクト]プレート
110257	2017-07-22	100％パスカル先生	M135395	2時間目完璧[パーフェクト]プレート
110336	2017-07-29	100％パスカル先生	M135398	2時間目完璧[パーフェクト]プレート
110414	2017-08-05	100％パスカル先生	M135401	2時間目完璧[パーフェクト]プレート
110483	2017-08-12	100％パスカル先生	M135404	2時間目完璧[パーフェクト]プレート
110564	2017-08-19	100％パスカル先生	M135407	2時間目完璧[パーフェクト]プレート
110643	2017-08-26	100％パスカル先生	M135410	2時間目完璧[パーフェクト]プレート
110722	2017-09-02	100％パスカル先生	M135413	2時間目完璧[パーフェクト]プレート
110801	2017-09-09	100％パスカル先生	M135416	2時間目完璧[パーフェクト]プレート
110880	2017-09-16	100％パスカル先生	M135419	2時間目完璧[パーフェクト]プレート
110958	2017-09-23	100％パスカル先生	M135422	2時間目完璧[パーフェクト]プレート
111015	2017-09-30	100％パスカル先生	M135425	2時間目完璧[パーフェクト]プレート

`an_ac_crt.csv`の基礎分析#

全体像の把握#

Show code cell content Hide code cell content

# df_ac_crtデータフレームの先頭5行を転置して表示
df_ac_crt.head().T

	0	1	2	3	4
acid	C10010	C12657	C12663	C12681	C13191
acname	グラビテーション	ヒピラくん原作/大友克洋	カウボーイビバップ[WOWOW放送版]	ドラえもん［新］	HUNTER × HUNTER[新]
asid	C2336	C3943	C2111	NaN	C2136
n_ae	13	10	26	224	149
first_date	2000-10-04	2009-12-21	1998-10-24	1999-12-03	2011-10-02
last_date	2001-01-10	2009-12-24	1999-04-24	2005-03-18	2014-09-24
crtid	ACRT00944	ACRT00733	ACRT01173	ACRT01283	ACRT00647
crtname	村上真紀	大友克洋	矢立肇	藤子・F・不二雄	冨樫義博

Show code cell content Hide code cell content

# df_ac_crtデータフレーム内の欠損値（NaN）の情報を集計
# isna()メソッドを使用して欠損値の場所をTrueとして特定
# その後、agg()メソッドを使用して、各列における欠損値の合計と平均を計算
df_ac_crt.isna().agg([sum, "mean"])

	acid	acname	asid	n_ae	first_date	last_date	crtid	crtname
sum	0.0	0.0	2.000000	0.0	0.0	0.0	0.0	0.0
mean	0.0	0.0	0.001339	0.0	0.0	0.0	0.0	0.0

Show code cell content Hide code cell content

# df_ac_crtデータフレームの記述統計情報を取得
# describe()メソッドを使用して、各列の中央値、平均、標準偏差などの基本的な統計量を表示
df_ac_crt.describe()

	n_ae
count	1494.000000
mean	27.911647
std	47.584073
min	1.000000
25%	12.000000
50%	13.000000
75%	27.000000
max	1175.000000

Show code cell content Hide code cell content

# df_ac_crtデータフレームの各列に対してユニークな値の数をカウント
# nunique()メソッドを使用して、各列のユニークな値の数を計算
# その後、結果を新しいデータフレームとして整形し、列名を"nunique"とする
df_ac_crt.nunique().reset_index(name="nunique")

	index	nunique
0	acid	1107
1	acname	1106
2	asid	896
3	n_ae	111
4	first_date	545
5	last_date	580
6	crtid	1056
7	crtname	1056

`acid`、`acname`列の深掘り#

Show code cell content Hide code cell content

# `acid`と`acname`ごとにユニークな`crtid`の数を集計し、結果の統計情報を取得
df_ac_crt.groupby(["acid", "acname"])["crtid"].nunique().describe().reset_index()

	index	crtid
0	count	1107.000000
1	mean	1.349593
2	std	0.573454
3	min	1.000000
4	25%	1.000000
5	50%	1.000000
6	75%	2.000000
7	max	4.000000

Show code cell content Hide code cell content

# `acid`と`acname`ごとにユニークな`crtid`の数を集計
# その後、降順にソートして、紐づいている原作者が多いアニメ作品を上位に表示
df_ac_crt.groupby(["acid", "acname"])["crtid"].nunique().sort_values(
    ascending=False
).reset_index().head()

	acid	acname	crtid
0	C16041	ぼのぼの[新]	4
1	C15430	てさぐれ! 部活ものすぴんおふプルプルんシャルムと遊ぼう	4
2	C16202	トミカハイパーレスキュードライブヘッド -機動救急警察-	4
3	C16039	フューチャーカードバディファイト DDD	4
4	C16067	BORUTO -ボルト- NARUTO NEXT GENERATIONS	3

Show code cell content Hide code cell content

# `acid`が`C16041`であるレコードを`df_ac_crt`から抽出
df_ac_crt[df_ac_crt["acid"] == "C16041"]

	acid	acname	asid	n_ae	first_date	last_date	crtid	crtname
881	C16041	ぼのぼの[新]	C1491	38	2016-04-02	2016-12-24	ACRT00153	いがらしみきお
882	C16041	ぼのぼの[新]	C1491	38	2016-04-02	2016-12-24	ACRT00270	アイ・エム・オー
883	C16041	ぼのぼの[新]	C1491	38	2016-04-02	2016-12-24	ACRT00321	オフィス・コウキ
884	C16041	ぼのぼの[新]	C1491	38	2016-04-02	2016-12-24	ACRT01219	竹書房

`asid`列の深堀り#

Show code cell content Hide code cell content

# `asid`ごとにユニークな`crtid`の数を集計し、結果の統計情報を取得
df_ac_crt.groupby("asid")["crtid"].nunique().describe().reset_index()

	index	crtid
0	count	896.000000
1	mean	1.380580
2	std	0.638803
3	min	1.000000
4	25%	1.000000
5	50%	1.000000
6	75%	2.000000
7	max	5.000000

Show code cell content Hide code cell content

# df_ac_crtデータフレームを'first_date'列で昇順に並び替えて、'asid'でグループ化
# 次に、各グループにおける'crtid'（原作者ID）と'acid'（アニメ作品ID）のユニークな値の数を計算し、
# さらにそのグループの最初の'acname'（アニメ作品名）を取得
df_tmp = (
    df_ac_crt.sort_values("first_date")
    .groupby("asid")
    .agg({"crtid": "nunique", "acid": "nunique", "acname": "first"})
)

# 集計したデータフレームdf_tmpを'crtid'（原作者IDのユニークな数）で降順に並び替え
# そして、その結果の上位5行を表示して、最も多様な原作者と紐づいているasidを確認
df_tmp.sort_values("crtid", ascending=False).head()

	crtid	acid	acname
asid
C2119	5	2	南海奇皇[第1期]
C3797	5	5	Fate/kaleid liner プリズマ ★イリヤ 2ｗｅｉ!
C3174	5	3	THE iDOLM＠STER シンデレラガールズ［第１期］
C5766	5	4	フューチャーカードバディファイト
C1240	4	1	トミカハイパーレスキュードライブヘッド -機動救急警察-

`crtid`、`crtname`列の深掘り#

Show code cell content Hide code cell content

# `crtid`単位で`crtname`のユニーク数を集計し、その統計情報を取得
df_ac_crt.groupby("crtid")["crtname"].nunique().describe().reset_index()

	index	crtname
0	count	1056.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# `crtname`単位で`crtid`のユニーク数を集計し、その統計情報を取得
df_ac_crt.groupby("crtname")["crtid"].nunique().describe().reset_index()

	index	crtid
0	count	1056.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# `crtid`および`crtname`単位で、紐付けられている`acid`のユニーク数を集計
# その結果の基礎統計情報を取得
df_ac_crt.groupby(["crtid", "crtname"])["acid"].nunique().describe().reset_index()

	index	acid
0	count	1056.000000
1	mean	1.414773
2	std	1.280673
3	min	1.000000
4	25%	1.000000
5	50%	1.000000
6	75%	1.000000
7	max	29.000000

Show code cell content Hide code cell content

# `crtname`単位で、紐付けられている`acid`のユニーク数を集計
# 結果を降順にソートし、上位10名の原作者とその紐づくアニメ作品数を抽出
df_ac_crt.groupby("crtname")["acid"].nunique().sort_values(
    ascending=False
).reset_index().head(10)

	crtname	acid
0	矢立肇	29
1	富野由悠季	10
2	尼子騒兵衛	10
3	ブシロード	9
4	ルーツ	9
5	Piyo	8
6	タツノコプロ	7
7	武内直子	6
8	サンリオ	6
9	あかほりさとる	6

`n_ae`列の深掘り#

Show code cell content Hide code cell content

# `crtname`単位での`n_ae`の合計を集計
# その集計結果の記述統計量を算出
df_ac_crt.groupby("crtname")["n_ae"].sum().describe().reset_index()

	index	n_ae
0	count	1056.000000
1	mean	39.488636
2	std	77.386859
3	min	1.000000
4	25%	12.000000
5	50%	24.000000
6	75%	38.000000
7	max	1267.000000

Show code cell content Hide code cell content

# `crtname`単位での`n_ae`の合計を集計し、降順にソート
# 上位10人の原作者を抽出
df_ac_crt.groupby("crtname")["n_ae"].sum().sort_values(
    ascending=False
).reset_index().head(10)

	crtname	n_ae
0	長谷川町子	1267
1	矢立肇	997
2	やなせたかし	958
3	尼子騒兵衛	817
4	サンリオ	644
5	ブシロード	363
6	犬丸りん	343
7	手塚治虫	328
8	富野由悠季	320
9	タツノコプロ	316

`first_date`、`last_date`列の深掘り#

Show code cell content Hide code cell content

# `df_ac_crt`から、原作者(`crtname`)ごとの最初の放送日(`first_date`)
# と最後の放送日(`last_date`)を集計
df_tmp = (
    df_ac_crt.groupby("crtname")[["first_date", "last_date"]]
    .agg({"first_date": "min", "last_date": "max"})
    .reset_index()
)

# 各原作者の活動期間を計算
df_tmp["duration"] = df_tmp["last_date"] - df_tmp["first_date"]

# 活動期間が長いトップ10の原作者を取得
df_tmp.sort_values(by="duration", ascending=False).head(10)

	crtname	first_date	last_date	duration
645	手塚治虫	1963-01-01	2017-07-08	19912 days
384	モンキー・パンチ	1971-10-24	2016-03-18	16217 days
864	矢立肇	1979-04-07	2016-12-25	13777 days
978	赤塚不二夫	1990-04-21	2016-12-13	9733 days
725	森下裕美	1991-04-11	2017-09-19	9658 days
947	藤子不二雄Ⓐ	1991-03-12	2017-06-19	9596 days
785	池田あきこ	1992-03-31	2016-12-31	9041 days
508	吉岡平	1993-01-25	2017-09-26	9010 days
751	横山光輝	1991-10-18	2016-03-26	8926 days
761	武内直子	1992-03-07	2016-06-27	8878 days

`an_ac_act.csv`の基礎分析#

全体像の把握#

Show code cell content Hide code cell content

# df_ac_actデータフレームの先頭5行を転置して表示
df_ac_act.head().T

	0	1	2	3	4
acid	C10001	C10001	C10001	C10001	C10001
acname	ギャラクシーエンジェル	ギャラクシーエンジェル	ギャラクシーエンジェル	ギャラクシーエンジェル	ギャラクシーエンジェル
asid	C2483	C2483	C2483	C2483	C2483
n_ae	24	24	24	24	24
first_date	2001-04-08	2001-04-08	2001-04-08	2001-04-08	2001-04-08
last_date	2001-09-30	2001-09-30	2001-09-30	2001-09-30	2001-09-30
actid	ACT00102	ACT05700	ACT06001	ACT01887	ACT02359
actname	かないみか	保村真	吉野裕行	山口眞弓	新谷良子
wiki_size	116003.0	45464.0	149454.0	19635.0	73259.0
gender	female	male	male	female	female

Show code cell content Hide code cell content

# df_ac_actデータフレーム内の欠損値（NaN）の情報を集計
# isna()メソッドを使用して欠損値の場所をTrueとして特定
# その後、agg()メソッドを使用して、各列における欠損値の合計と平均を計算
df_ac_act.isna().agg([sum, "mean"]).T

	sum	mean
acid	0.0	0.000000
acname	0.0	0.000000
asid	67.0	0.002197
n_ae	0.0	0.000000
first_date	0.0	0.000000
last_date	0.0	0.000000
actid	0.0	0.000000
actname	0.0	0.000000
wiki_size	0.0	0.000000
gender	0.0	0.000000

Show code cell content Hide code cell content

# df_ac_actデータフレームの記述統計情報を取得
# describe()メソッドを使用して、各列の中央値、平均、標準偏差などの基本的な統計量を表示
df_ac_act.describe()

	n_ae	wiki_size
count	30492.000000	30492.000000
mean	29.810245	120299.661124
std	71.504224	95596.818847
min	1.000000	84.000000
25%	12.000000	40113.000000
50%	13.000000	93599.000000
75%	26.000000	183982.000000
max	1926.000000	393910.000000

Show code cell content Hide code cell content

# df_ac_actデータフレームの各列に対してユニークな値の数をカウント
# nunique()メソッドを使用して、各列のユニークな値の数を計算
# その後、結果を新しいデータフレームとして整形し、列名を"nunique"とする
df_ac_act.nunique().reset_index(name="nunique")

	index	nunique
0	acid	2845
1	acname	2842
2	asid	1977
3	n_ae	150
4	first_date	1331
5	last_date	1478
6	actid	2998
7	actname	2998
8	wiki_size	2932
9	gender	2

`acid`、`acname`列の分析#

Show code cell content Hide code cell content

# acid、acnameに紐づいている声優数を集計
df_ac_act.groupby(["acid", "acname"])["actid"].nunique().describe().reset_index()

	index	actid
0	count	2845.000000
1	mean	10.717750
2	std	5.068738
3	min	1.000000
4	25%	7.000000
5	50%	11.000000
6	75%	15.000000
7	max	22.000000

Show code cell content Hide code cell content

# 紐づいている声優数が多いアニメ作品を抽出
df_ac_act.groupby(["acid", "acname"])["actid"].nunique().sort_values(
    ascending=False
).reset_index().head()

	acid	acname	actid
0	C14818	カーニヴァル	22
1	C15870	甲鉄城のカバネリ	22
2	C14716	キングダム	21
3	C13851	神様のメモ帳 It's the only NEET thing to do.	21
4	C14833	キングダム[第2期]	21

Show code cell content Hide code cell content

# acidがC14818のデータを抽出し、見やすさのため特定の列のみ表示
df_ac_act[df_ac_act["acid"] == "C14818"][["acname", "actname", "gender"]]

	acname	actname	gender
19711	カーニヴァル	下野紘	male
19712	カーニヴァル	中村悠一	male
19713	カーニヴァル	五十嵐裕美	female
19714	カーニヴァル	佐藤聡美	female
19715	カーニヴァル	保志総一朗	male
19716	カーニヴァル	入野自由	male
19717	カーニヴァル	前野智昭	male
19718	カーニヴァル	喜多村英梨	female
19719	カーニヴァル	宮野真守	male
19720	カーニヴァル	小野大輔	male
19721	カーニヴァル	岡本信彦	male
19722	カーニヴァル	平川大輔	male
19723	カーニヴァル	広瀬正志	male
19724	カーニヴァル	日笠陽子	female
19725	カーニヴァル	本名陽子	female
19726	カーニヴァル	矢作紗友里	female
19727	カーニヴァル	神谷浩史	male
19728	カーニヴァル	緒方賢一	male
19729	カーニヴァル	諏訪部順一	male
19730	カーニヴァル	豊永利行	male
19731	カーニヴァル	遊佐浩二	male
19732	カーニヴァル	遠藤綾	female

`asid`列の深堀り#

Show code cell content Hide code cell content

# `asid`ごとにユニークな`actid`の数を集計し、結果の統計情報を取得
df_ac_act.groupby("asid")["actid"].nunique().describe().reset_index()

	index	actid
0	count	1977.000000
1	mean	12.465352
2	std	8.874857
3	min	1.000000
4	25%	7.000000
5	50%	12.000000
6	75%	16.000000
7	max	165.000000

Show code cell content Hide code cell content

# まず、df_ac_actデータフレームを'first_date'列で昇順に並び替えて、'asid'でグループ化
# 次に、各グループにおける'actid'のユニークな値の数（活動の種類の数）と、
# 'acname'の最初の値（最初の活動名）を集計し、グループ化を解除して新しいデータフレームdf_tmpを作成
df_tmp = (
    df_ac_act.sort_values("first_date")
    .groupby("asid")
    .agg({"actid": "nunique", "acname": "first"})
    .reset_index()
)

# 次に、集計したデータフレームdf_tmpを'actid'（活動のユニークな数）で降順に並び替え
# そして、その結果の上位5行を表示して、最も活動の種類が多い上位のasidを確認
df_tmp.sort_values("actid", ascending=False).head()

	asid	actid	acname
391	C2833	165	ふたりはプリキュア
1959	C6820	91	超生命体トランスフォーマービーストウォーズメタルス
169	C2454	89	爆転シュートベイブレード
301	C2650	84	デュエル・マスターズ
728	C3397	79	バトルスピリッツ少年突破バシン

`actid`、`actname`列の深掘り#

Show code cell content Hide code cell content

# `actid`ごとにユニークな`actname`の数を集計し、結果の統計情報を取得
df_ac_act.groupby("actid")["actname"].nunique().describe().reset_index()

	index	actname
0	count	2998.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# `actname`ごとにユニークな`actid`の数を集計
# その後、結果の統計情報を取得
df_ac_act.groupby("actname")["actid"].nunique().describe().reset_index()

	index	actid
0	count	2998.0
1	mean	1.0
2	std	0.0
3	min	1.0
4	25%	1.0
5	50%	1.0
6	75%	1.0
7	max	1.0

Show code cell content Hide code cell content

# `actid`と`actname`の組合せごとにユニークな`acid`の数を集計
# その後、結果の統計情報を取得
df_ac_act.groupby(["actid", "actname"])["acid"].nunique().describe().reset_index()

	index	acid
0	count	2998.000000
1	mean	10.170781
2	std	19.762846
3	min	1.000000
4	25%	1.000000
5	50%	3.000000
6	75%	9.000000
7	max	181.000000

Show code cell content Hide code cell content

# `df_ac_act`で各声優名(`actname`)ごとにユニークなアニメ作品(`acid`)の数を集計
# その後、降順にソートして上位10件を表示
df_ac_act.groupby("actname")["acid"].nunique().sort_values(
    ascending=False
).reset_index().head(10)

	actname	acid
0	沢城みゆき	181
1	櫻井孝宏	180
2	子安武人	175
3	能登麻美子	158
4	釘宮理恵	156
5	福山潤	154
6	堀江由衣	149
7	花澤香菜	142
8	浪川大輔	139
9	川澄綾子	138

`n_ae`列の深掘り#

Show code cell content Hide code cell content

# `actid`と`actname`の組合せごとに`n_ae`の合計数を集計
# その後、結果の統計情報を取得
df_ac_act.groupby(["actid", "actname"])["n_ae"].sum().describe().reset_index()

	index	n_ae
0	count	2998.000000
1	mean	303.193462
2	std	584.690032
3	min	1.000000
4	25%	26.000000
5	50%	70.000000
6	75%	262.500000
7	max	5468.000000

Show code cell content Hide code cell content

# `df_ac_act`で声優名(`actname`)ごとにアニメ各話数(`n_ae`)の合計値を集計
# その後、降順にソートして上位10件を表示
df_ac_act.groupby("actname")["n_ae"].sum().sort_values(
    ascending=False
).reset_index().head(10)

	actname	n_ae
0	藤原啓治	5468
1	こおろぎさとみ	4546
2	川澄綾子	4429
3	石田彰	4389
4	山口勝平	4328
5	子安武人	4255
6	一龍斎貞友	4247
7	釘宮理恵	4107
8	櫻井孝宏	3908
9	三石琴乃	3739

Show code cell content Hide code cell content

# `df_ac_crt`から`acname`が`クレヨンしんちゃん`である行を抽出
# 見やすさのため、特定の列のみ表示
df_ac_act[df_ac_act["acname"] == "クレヨンしんちゃん"][["acname", "actname", "gender"]]

	acname	actname	gender
29115	クレヨンしんちゃん	こおろぎさとみ	female
29116	クレヨンしんちゃん	ならはしみき	female
29117	クレヨンしんちゃん	一龍斎貞友	female
29118	クレヨンしんちゃん	三石琴乃	female
29119	クレヨンしんちゃん	佐藤智恵	female
29120	クレヨンしんちゃん	佳川紘子	female
29121	クレヨンしんちゃん	富沢美智恵	female
29122	クレヨンしんちゃん	川澄綾子	female
29123	クレヨンしんちゃん	林玉緒	female
29124	クレヨンしんちゃん	桜井敏治	male
29125	クレヨンしんちゃん	真柴摩利	female
29126	クレヨンしんちゃん	矢島晶子	female
29127	クレヨンしんちゃん	立木文彦	male
29128	クレヨンしんちゃん	納谷六朗	male
29129	クレヨンしんちゃん	藤原啓治	male
29130	クレヨンしんちゃん	鈴木れい子	female
29131	クレヨンしんちゃん	高田由美	female

`first_date`、`last_date`列の深掘り#

Show code cell content Hide code cell content

# `df_ac_crt`から、声優(`actname`)ごとの最初の放送日(`first_date`)
# と最後の放送日(`last_date`)を集計
df_tmp = (
    df_ac_act.groupby("actname")[["first_date", "last_date"]]
    .agg({"first_date": "min", "last_date": "max"})
    .reset_index()
)

# 各声優の活動期間を計算
df_tmp["duration"] = df_tmp["last_date"] - df_tmp["first_date"]

# 活動期間が長い10名の声優を表示
df_tmp.sort_values(by="duration", ascending=False).head(10)

	actname	first_date	last_date	duration
798	増岡弘	1963-11-25	2016-12-25	19389 days
870	大竹宏	1963-11-25	2015-10-24	18961 days
2208	矢島正明	1963-01-01	2014-04-07	18724 days
501	内海賢二	1963-11-25	2011-04-08	17301 days
485	八奈見乗児	1963-11-25	2010-11-21	17163 days
2390	納谷悟朗	1971-10-24	2016-12-25	16499 days
837	大塚周夫	1971-10-24	2016-01-03	16142 days
2000	清水マリ	1963-01-01	2006-01-10	15715 days
2410	緒方賢一	1974-10-06	2017-09-30	15700 days
2984	麻生美代子	1974-01-06	2016-12-25	15694 days

`wiki_size`列の深掘り#

Show code cell content Hide code cell content

# 声優ごとに、それぞれのWikipediaページのサイズを集計
df_tmp = df_ac_act.groupby("actname")["wiki_size"].first().reset_index()

# Wikipediaページサイズの基本統計量を表示
df_tmp["wiki_size"].describe().reset_index()

	index	wiki_size
0	count	2998.000000
1	mean	41377.184456
2	std	53608.653558
3	min	84.000000
4	25%	8906.250000
5	50%	21203.000000
6	75%	51036.500000
7	max	393910.000000

Show code cell content Hide code cell content

# Wikipediaページサイズが大きい上位10人の声優を表示
df_tmp.sort_values("wiki_size", ascending=False).head(10)

	actname	wiki_size
2446	花澤香菜	393910.0
507	内田真礼	388653.0
1502	早見沙織	364131.0
1769	森川智之	360890.0
1647	松岡禎丞	356962.0
1924	沢城みゆき	354094.0
1863	水樹奈々	336020.0
1594	杉田智和	330195.0
838	大塚明夫	328094.0
1839	櫻井孝宏	325313.0

`gender`列の深掘り#

Show code cell content Hide code cell content

# 各性別（`gender`）ごとに、ユニークな声優（`actid`）の数を集計
df_ac_act.groupby("gender")["actid"].nunique().reset_index()

	gender	actid
0	female	1664
1	male	1334

Show code cell content Hide code cell content

# 各性別（`gender`）ごとに、合計行数をカウント
df_ac_act.value_counts("gender").reset_index()

	gender	count
0	female	16486
1	male	14006

Show code cell content Hide code cell content

# 各声優（actid）と性別（gender）ごとに、関与したユニークなアニメ作品（acid）の数を集計
df_tmp = (
    df_ac_act.groupby(["actid", "gender"])["acid"].nunique().reset_index(name="n_ac")
)

# 上で集計したアニメ作品数を、性別ごとに基本統計量を計算する
df_tmp.groupby("gender")["n_ac"].describe().reset_index()

	gender	count	mean	std	min	25%	50%	75%	max
0	female	1664.0	9.907452	18.927906	1.0	1.0	3.0	9.0	181.0
1	male	1334.0	10.499250	20.759826	1.0	1.0	3.0	9.0	180.0

アニメデータの基礎分析

Contents

アニメデータの基礎分析#

準備#

Import#

変数#

an_ae.csvの基礎分析#

全体像の把握#

date列の深掘り#

acid、acname列の深掘り#

asid列の深堀り#

aeid、aename、aenoの深堀り#

an_ac_crt.csvの基礎分析#

全体像の把握#

acid、acname列の深掘り#

asid列の深堀り#

crtid、crtname列の深掘り#

n_ae列の深掘り#

first_date、last_date列の深掘り#

an_ac_act.csvの基礎分析#

全体像の把握#

acid、acname列の分析#

asid列の深堀り#

actid、actname列の深掘り#

n_ae列の深掘り#

first_date、last_date列の深掘り#

wiki_size列の深掘り#

gender列の深掘り#

`an_ae.csv`の基礎分析#

`date`列の深掘り#

`acid`、`acname`列の深掘り#

`asid`列の深堀り#

`aeid`、`aename`、`aeno`の深堀り#

`an_ac_crt.csv`の基礎分析#

`acid`、`acname`列の深掘り#

`asid`列の深堀り#

`crtid`、`crtname`列の深掘り#

`n_ae`列の深掘り#

`first_date`、`last_date`列の深掘り#

`an_ac_act.csv`の基礎分析#

`acid`、`acname`列の分析#

`asid`列の深堀り#

`actid`、`actname`列の深掘り#

`n_ae`列の深掘り#

`first_date`、`last_date`列の深掘り#

`wiki_size`列の深掘り#

`gender`列の深掘り#